scispace - formally typeset
Search or ask a question
Posted Content

Efficient Pattern Matching on Binary Strings

TL;DR: This paper presents two efficient algorithms for the binary string matching problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte.
Abstract: The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.
Citations
More filters
Book ChapterDOI
18 Jun 2009
TL;DR: Experimental results show that the newly presented algorithm outperforms existing solutions in most cases.
Abstract: We present a new efficient algorithm for exact matching in encoded DNA sequences and on binary strings. Our algorithm combines a multi-pattern version of the B ndm algorithm and a simplified version of the C ommentz-Walter algorithm. We performed also experimental comparisons with the most efficient algorithms presented in the literature. Experimental results show that the newly presented algorithm outperforms existing solutions in most cases.

28 citations


Cites methods from "Efficient Pattern Matching on Binar..."

  • ...Best results are bold faced. m BBM BSKS BHM BFL FED 25 161.905 19.484 28.625 12.014 13.107 50 90.109 11.047 14.671 4.344 7.000 75 70.718 8.846 10.220 2.828 4.936 100 65.797 7.720 8.094 2.264 3.968 125 58.780 7.016 6.939 1.938 3.406 150 52.593 6.375 6.171 1.798 3.032 200 42.032 5.484 5.171 1.609 2.625 250 50.751 4.875 4.563 1.485 2.500 300 47.564 4.327 4.375 1.498 2.375 350 45.498 4.079 4.094 1.546 2.328 400 42.502 3.702 3.904 1.564 2.253 450 45.344 3.562 3.800 1.562 2.234 500 44.345 3.311 3.658 1.497 2.267 0 2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for a Rand(0/1)50 problem m BBM BSKS BHM BFL FED 25 188.469 29.842 33.110 14.095 20.219 50 112.720 20.031 17.860 5.142 11.125 75 88.953 16.299 13.251 3.624 8.390 100 82.360 13.909 10.938 2.797 6.938 125 74.671 12.531 9.686 2.358 6.032 150 69.875 11.531 8.641 2.218 5.389 200 58.952 9.967 7.452 1.843 4.641 250 64.921 9.093 6.690 1.689 4.406 300 61.219 8.283 6.218 1.671 4.063 350 58.141 7.921 5.908 1.670 3.796 400 54.420 7.595 5.563 1.624 3.718 450 57.402 7.284 5.423 1.642 3.594 500 55.296 7.077 5.281 1.625 3.405 0 5 10 15 20 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for a Rand(0/1)70 problem m BBM BSKS BHM BFL FED 16 41.266 8.062 19.407 6.594 8.249 32 28.955 5.046 10.046 2.814 4.422 64 29.485 3.813 5.420 1.641 2.533 96 26.764 3.375 4.031 1.453 2.032 128 26.436 3.047 3.422 1.361 1.766 160 24.577 2.859 2.862 1.347 1.701 192 25.624 2.592 2.733 1.469 1.578 224 33.170 2.438 2.641 1.373 1.623 256 28.595 2.453 2.517 1.372 1.608 288 26.421 2.299 2.421 1.377 1.593 320 27.596 2.234 2.374 1.407 1.703 352 24.251 2.235 2.281 1.391 1.625 384 23.593 2.221 2.359 1.327 1.734 448 24.063 2.626 2.343 1.294 1.830 496 24.659 2.891 2.362 1.452 1.906 0 1 2 3 4 5 6 7 8 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for an encoded DNA sequence Experimental results show that the Bfl algorithm obtains the best run-time performance in all cases....

    [...]

  • ...Here we present experimental data which allow to compare, in terms of running time, the following string matching algorithms on binary strings and encoded DNA sequences: the Binary-Boyer-Moore algorithm (BBM) [8] by Klein and Ben-Nissan, the Binary-Hash-Matching algorithm (BHM) [5], the BinarySkip-Search algorithm (BSKS) [5], Fed algorithm (FED) [7] and the new Bfl (BFL) algorithm....

    [...]

  • ...Recently in [5] two efficient algorithms have been presented for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte....

    [...]

  • ...Experimental results conducted in [5] on various conditions showed that the proposed algorithms perform better than existing solutions and even than the most effective algorithms for standard pattern matching....

    [...]

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The Crochemore-Perrin constant-space O(n)-time string matching algorithm is extended to run in optimal O( n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually.
Abstract: In the packed string matching problem, each machine word accomodates alpha characters, thus an n-character text occupies n/alpha memory words. We extend the Crochemore-Perrin constant-space O(n)-time string matching algorithm to run in optimal O(n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC0 instructions (i.e. no multiplication) plus two specialized AC0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e. Intel's SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC0) and table lookup.

23 citations

Journal ArticleDOI
TL;DR: In this paper, the worst-case complexity of string matching on strings given in packed representation is studied, where m is the number of characters in a single word and m = m.

19 citations

Journal ArticleDOI
TL;DR: The Crochemore-Perrin constant-space O(n)-time string-matching algorithm is extended to run in optimal O( n/@a) time and even in real-time, achieving a factor @a speedup over traditional algorithms that examine each character individually.

14 citations

Book ChapterDOI
03 Jul 2012
TL;DR: A novel string-matching algorithm that requires constant time for text scanning in an unusual model where the input pattern and text are each packed into a single word, and the output is a one word bit-mask identifying the pattern occurrences in the text.
Abstract: We present a novel string-matching algorithm that requires constant time for text scanning in an unusual model where (a) the input pattern and text are each packed into a single word, (b) the output is a one word bit-mask identifying the pattern occurrences in the text, and (c) there are constant-time arithmetic, bitwise, and shift instructions that operate on words whose size is proportional to the arbitrarily long input length. Our bit-parallelism techniques build upon and also greatly simplify existing parallel random access machine algorithms by using two "simple structure" rather than "small size" deterministic samples, i.e., one deterministic sample is very small (size two), while the other is a potentially very long prefix of the pattern. Pattern preprocessing takes time proportional to the word size. Our results also establish, by recent reductions, new bounds for the packed string matching problem.

9 citations

References
More filters
01 Jan 1999
TL;DR: It is argued that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets.
Abstract: A new algorithm to search for multiple patterns at the same time is presented. The algorithm is faster than previous algorithms and can support a very large number — tens of thousands — of patterns. Several applications of the multi-pattern matching problem are discussed. We argue that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets. Its advantage, of course, is that no additional search structure is needed.

564 citations

Journal ArticleDOI
TL;DR: A very fast new family of string matching algorithms based on hashing q-grams are proposed, which are the fastest on many cases, in particular, on small size alphabets.

122 citations

Journal ArticleDOI
TL;DR: Pertinent approaches to compression of the various files are reviewed, and it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.
Abstract: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tresor de la Langue Francaise on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

60 citations


"Efficient Pattern Matching on Binar..." refers background or methods in this paper

  • ...This is a reasonable assumption for compressed text [ KBD89 ]....

    [...]

  • ...For compression scheme using Huffman coding, such randomness has been shown to hold in [ KBD89 ]....

    [...]

Journal ArticleDOI
TL;DR: It turns out that the new proposed variants of the BOM string matching algorithm are very flexible and achieve very good results, especially in the case of large alphabets.
Abstract: In this article we present two efficient variants of the BOM string matching algorithm which are more efficient and flexible than the original algorithm. We also present bitparallel versions of them obtaining an efficient variant of the BNDM algorithm. Then we compare the newly presented algorithms with some of the most recent and effective string matching algorithms. It turns out that the new proposed variants are very flexible and achieve very good results, especially in the case of large alphabets.

51 citations

Proceedings ArticleDOI
01 May 1989
TL;DR: Until a few years ago, large full-text information retrieval systems could only be operated on powerful mainframes, but recently, the CD-ROM (compact disc read only memory) optical disc medium has become widespread, permitting access by a PC to very large amounts of storage at very low cost.
Abstract: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tresor de la Langue Francaise on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

46 citations