Efficient Pattern Matching on Binary Strings

Home
/
Papers
/
Efficient Pattern Matching on Binary Strings

Posted Content•

Efficient Pattern Matching on Binary Strings

Simone Faro¹, Thierry Lecroq²•Institutions (2)

University of Catania¹, University of Rouen²

14 Oct 2008-arXiv: Data Structures and Algorithms-

TL;DR: This paper presents two efficient algorithms for the binary string matching problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte.

read less

Abstract: The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

An Efficient Matching Algorithm for Encoded DNA Sequences and Binary Strings

[...]

Simone Faro¹, Thierry Lecroq²•Institutions (2)

University of Catania¹, University of Rouen²

18 Jun 2009

TL;DR: Experimental results show that the newly presented algorithm outperforms existing solutions in most cases.

...read moreread less

Abstract: We present a new efficient algorithm for exact matching in encoded DNA sequences and on binary strings. Our algorithm combines a multi-pattern version of the B ndm algorithm and a simplified version of the C ommentz-Walter algorithm. We performed also experimental comparisons with the most efficient algorithms presented in the literature. Experimental results show that the newly presented algorithm outperforms existing solutions in most cases.

...read moreread less

28 citations

Cites methods from "Efficient Pattern Matching on Binar..."

...Best results are bold faced. m BBM BSKS BHM BFL FED 25 161.905 19.484 28.625 12.014 13.107 50 90.109 11.047 14.671 4.344 7.000 75 70.718 8.846 10.220 2.828 4.936 100 65.797 7.720 8.094 2.264 3.968 125 58.780 7.016 6.939 1.938 3.406 150 52.593 6.375 6.171 1.798 3.032 200 42.032 5.484 5.171 1.609 2.625 250 50.751 4.875 4.563 1.485 2.500 300 47.564 4.327 4.375 1.498 2.375 350 45.498 4.079 4.094 1.546 2.328 400 42.502 3.702 3.904 1.564 2.253 450 45.344 3.562 3.800 1.562 2.234 500 44.345 3.311 3.658 1.497 2.267 0 2 4 6 8 10 12 14 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for a Rand(0/1)50 problem m BBM BSKS BHM BFL FED 25 188.469 29.842 33.110 14.095 20.219 50 112.720 20.031 17.860 5.142 11.125 75 88.953 16.299 13.251 3.624 8.390 100 82.360 13.909 10.938 2.797 6.938 125 74.671 12.531 9.686 2.358 6.032 150 69.875 11.531 8.641 2.218 5.389 200 58.952 9.967 7.452 1.843 4.641 250 64.921 9.093 6.690 1.689 4.406 300 61.219 8.283 6.218 1.671 4.063 350 58.141 7.921 5.908 1.670 3.796 400 54.420 7.595 5.563 1.624 3.718 450 57.402 7.284 5.423 1.642 3.594 500 55.296 7.077 5.281 1.625 3.405 0 5 10 15 20 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for a Rand(0/1)70 problem m BBM BSKS BHM BFL FED 16 41.266 8.062 19.407 6.594 8.249 32 28.955 5.046 10.046 2.814 4.422 64 29.485 3.813 5.420 1.641 2.533 96 26.764 3.375 4.031 1.453 2.032 128 26.436 3.047 3.422 1.361 1.766 160 24.577 2.859 2.862 1.347 1.701 192 25.624 2.592 2.733 1.469 1.578 224 33.170 2.438 2.641 1.373 1.623 256 28.595 2.453 2.517 1.372 1.608 288 26.421 2.299 2.421 1.377 1.593 320 27.596 2.234 2.374 1.407 1.703 352 24.251 2.235 2.281 1.391 1.625 384 23.593 2.221 2.359 1.327 1.734 448 24.063 2.626 2.343 1.294 1.830 496 24.659 2.891 2.362 1.452 1.906 0 1 2 3 4 5 6 7 8 50 100 150 200 250 300 350 400 450 500 BSKS BHM BFL FED Experimental results for an encoded DNA sequence Experimental results show that the Bfl algorithm obtains the best run-time performance in all cases....
[...]
...Here we present experimental data which allow to compare, in terms of running time, the following string matching algorithms on binary strings and encoded DNA sequences: the Binary-Boyer-Moore algorithm (BBM) [8] by Klein and Ben-Nissan, the Binary-Hash-Matching algorithm (BHM) [5], the BinarySkip-Search algorithm (BSKS) [5], Fed algorithm (FED) [7] and the new Bfl (BFL) algorithm....
[...]
...Recently in [5] two efficient algorithms have been presented for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte....
[...]
...Experimental results conducted in [5] on various conditions showed that the proposed algorithms perform better than existing solutions and even than the most effective algorithms for standard pattern matching....
[...]

Proceedings Article•DOI•

Optimal Packed String Matching

[...]

Oren Ben-Kiki¹, Philip Bille, Dany Breslauer², Leszek Gasieniec³, Roberto Grossi⁴, Oren Weimann² - Show less +2 more•Institutions (4)

Intel¹, University of Haifa², University of Liverpool³, University of Pisa⁴

01 Jan 2011

TL;DR: The Crochemore-Perrin constant-space O(n)-time string matching algorithm is extended to run in optimal O( n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually.

...read moreread less

Abstract: In the packed string matching problem, each machine word accomodates alpha characters, thus an n-character text occupies n/alpha memory words. We extend the Crochemore-Perrin constant-space O(n)-time string matching algorithm to run in optimal O(n/alpha) time and even in real-time, achieving a factor alpha speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC0 instructions (i.e. no multiplication) plus two specialized AC0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e. Intel's SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC0) and table lookup.

...read moreread less

23 citations

Journal Article•DOI•

Fast searching in packed strings

[...]

Philip Bille¹•Institutions (1)

Technical University of Denmark¹

01 Mar 2011-Journal of Discrete Algorithms

TL;DR: In this paper, the worst-case complexity of string matching on strings given in packed representation is studied, where m is the number of characters in a single word and m = m.

...read moreread less

19 citations

Journal Article•DOI•

Towards optimal packed string matching

[...]

Oren Ben-Kiki¹, Philip Bille², Dany Breslauer³, Leszek Gasieniec⁴, Roberto Grossi⁵, Oren Weimann³ - Show less +2 more•Institutions (5)

Intel¹, Technical University of Denmark², University of Haifa³, University of Liverpool⁴, University of Pisa⁵

01 Mar 2014-Theoretical Computer Science

TL;DR: The Crochemore-Perrin constant-space O(n)-time string-matching algorithm is extended to run in optimal O( n/@a) time and even in real-time, achieving a factor @a speedup over traditional algorithms that examine each character individually.

...read moreread less

14 citations

Book Chapter•DOI•

Constant-Time word-size string matching

[...]

Dany Breslauer¹, Leszek Gąsieniec², Roberto Grossi³•Institutions (3)

University of Haifa¹, University of Liverpool², University of Pisa³

03 Jul 2012

TL;DR: A novel string-matching algorithm that requires constant time for text scanning in an unusual model where the input pattern and text are each packed into a single word, and the output is a one word bit-mask identifying the pattern occurrences in the text.

...read moreread less

Abstract: We present a novel string-matching algorithm that requires constant time for text scanning in an unusual model where (a) the input pattern and text are each packed into a single word, (b) the output is a one word bit-mask identifying the pattern occurrences in the text, and (c) there are constant-time arithmetic, bitwise, and shift instructions that operate on words whose size is proportional to the arbitrarily long input length. Our bit-parallelism techniques build upon and also greatly simplify existing parallel random access machine algorithms by using two "simple structure" rather than "small size" deterministic samples, i.e., one deterministic sample is very small (size two), while the other is a potentially very long prefix of the pattern. Pattern preprocessing takes time proportional to the word size. Our results also establish, by recent reductions, new bounds for the packed string matching problem.

...read moreread less

9 citations

References

PDF

Open Access

More filters

A fast algorithm for multi-pattern searching

[...]

Sun Wu¹, Udi Manber²•Institutions (2)

National Chung Cheng University¹, University of Arizona²

01 Jan 1999

TL;DR: It is argued that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets.

...read moreread less

Abstract: A new algorithm to search for multiple patterns at the same time is presented. The algorithm is faster than previous algorithms and can support a very large number — tens of thousands — of patterns. Several applications of the multi-pattern matching problem are discussed. We argue that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets. Its advantage, of course, is that no additional search structure is needed.

...read moreread less

564 citations

Journal Article•DOI•

Fast exact string matching algorithms

[...]

Thierry Lecroq¹•Institutions (1)

University of Rouen¹

30 May 2007-Information Processing Letters

TL;DR: A very fast new family of string matching algorithms based on hashing q-grams are proposed, which are the fastest on many cases, in particular, on small size alphabets.

...read moreread less

122 citations

Journal Article•DOI•

Storing text retrieval systems on CD-ROM: compression and encryption considerations

[...]

Shmuel T. Klein¹, Abraham Bookstein¹, Scott Deerwester¹•Institutions (1)

University of Chicago¹

01 Jul 1989-ACM Transactions on Information Systems

TL;DR: Pertinent approaches to compression of the various files are reviewed, and it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

...read moreread less

Abstract: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tresor de la Langue Francaise on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.

...read moreread less

60 citations

"Efficient Pattern Matching on Binar..." refers background or methods in this paper

...This is a reasonable assumption for compressed text [ KBD89 ]....
[...]
...For compression scheme using Huffman coding, such randomness has been shown to hold in [ KBD89 ]....
[...]

Journal Article•DOI•

Efficient variants of the backward-oracle-matching algorithm

[...]

Simone Faro¹, Thierry Lecroq², Litis Ea•Institutions (2)

University of Catania¹, University of Rouen²

01 Dec 2009-International Journal of Foundations of Computer Science

TL;DR: It turns out that the new proposed variants of the BOM string matching algorithm are very flexible and achieve very good results, especially in the case of large alphabets.

...read moreread less

Abstract: In this article we present two efficient variants of the BOM string matching algorithm which are more efficient and flexible than the original algorithm. We also present bitparallel versions of them obtaining an efficient variant of the BNDM algorithm. Then we compare the newly presented algorithms with some of the most recent and effective string matching algorithms. It turns out that the new proposed variants are very flexible and achieve very good results, especially in the case of large alphabets.

...read moreread less

51 citations

Proceedings Article•DOI•

Storing text retrieval systems on CD-ROM: compression and encryption considerations

[...]

Shmuel T. Klein¹, Abraham Bookstein¹, Scott Deerwester¹•Institutions (1)

University of Chicago¹

01 May 1989

TL;DR: Until a few years ago, large full-text information retrieval systems could only be operated on powerful mainframes, but recently, the CD-ROM (compact disc read only memory) optical disc medium has become widespread, permitting access by a PC to very large amounts of storage at very low cost.

...read moreread less

46 citations