Abstract for HONS 05/02 - Computer Science and Software Engineering - University of Canterbury - New Zealand
HONS 05/02

A comparison of BWT approaches to compressed-domain pattern matching

Andrew Firth
Department of Computer Science
University of Canterbury

Abstract


A number of algorithms have recently been developed to search files compressed with the Burrows-Wheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while still allowing fast access to the information for searching. We provide a detailed description of five of these algorithms: Compressed-Domain Boyer-Moore (Bell et al. 2002), Binary Search (Bell et al. 2002), Suffix Arrays (Sadakane & Imai 1999), q-grams (Adjeroh et al. 2002) and the FM-index (Ferragina & Manzini 2001), and also present results from a set of extensive experiments that were performed to evaluate and compare the algorithms. Furthermore, we introduce a technique to improve the search times of Binary Search, Suffix Arrays and q-grams by around 20%, as well as reduce the memory requirement of the latter two by 40% and 31%, respectively. Our results indicate that, while the compressed files of the FM-index are larger than those of the other approaches, it is able to perform searches with considerably less memory. Additionally, when only counting the occurrences of a pattern, or when locating the positions of a small number of matches, it is the fastest algorithm. For larger searches, q-grams provides the fastest results.

  • Phone: +64 3 369 2777
    Fax: +64 3 364 2569
    CSSEadministration@canterbury.ac.nz
  • Computer Science and Software Engineering
    University of Canterbury
    Private Bag 4800, Christchurch
    New Zealand
  • Follow us
    FacebookYoutubetwitterLinked In