scispace - formally typeset
Search or ask a question
Topic

Compressed pattern matching

About: Compressed pattern matching is a research topic. Over the lifetime, 133 publications have been published within this topic receiving 10207 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text that has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Abstract: This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.

3,270 citations

Journal ArticleDOI
TL;DR: The algorithm has the unusual property that, in most cases, not all of the first i.” in another string, are inspected.
Abstract: An algorithm is presented that searches for the location, “il” of the first occurrence of a character string, “pat,” in another string, “string.” During the search operation, the characters of pat are matched starting with the last character of pat. The information gained by starting the match at the end of the pattern often allows the algorithm to proceed in large jumps through the text being searched. Thus the algorithm has the unusual property that, in most cases, not all of the first i characters of string are inspected. The number of characters actually inspected (on the average) decreases as a function of the length of pat. For a random English pattern of length 5, the algorithm will typically inspect i/4 characters of string before finding a match at i. Furthermore, the algorithm has been implemented so that (on the average) fewer than i + patlen machine instructions are executed. These conclusions are supported with empirical evidence and a theoretical analysis of the average behavior of the algorithm. The worst case behavior of the algorithm is linear in i + patlen, assuming the availability of array space for tables linear in patlen plus the size of the alphabet.

2,542 citations

Journal ArticleDOI
TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Abstract: T h e string-matching problem is a very c o m m o n problem. We are searching for a string P = PtP2. . "Pro i n s i d e a la rge t ex t f i le T = t l t2. . . t . , b o t h sequences of characters from a f i n i t e character set Z. T h e characters may be English characters in a text file, DNA base pairs, lines of source code, angles between edges in polygons, machines or machine parts in a production schedule, music notes and tempo in a musical score, and so fo r th . We w a n t to f i n d a l l occurrences of P i n T; n a m e l y , we are searching for the set of starting posit ions F = {i[1 --i--n m + 1 s u c h t h a t titi+ l " " t i + m 1 = P } " T h e two most famous algorithms for this problem are t h e B o y e r M o o r e algorithm [3] and t h e K n u t h Morris Pratt algorithm [10]. There are many extensions to t h i s problem; for example, we may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression. String-matching tools are included in every reasonable text editor, word processor, and many other applications.

806 citations

Journal ArticleDOI
TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.
Abstract: We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.

276 citations

Proceedings ArticleDOI
29 Mar 1999
TL;DR: A compression scheme is developed that is a combination of a simple but powerful phrase derivation method and a compact dictionary encoding that is highly efficient, particularly in decompression, and has characteristics that make it a favorable choice when compressed data is to be searched directly.
Abstract: Dictionary-based modelling is the mechanism used in many practical compression schemes. We use the full message (or a large block of it) to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. Intuitively, the advantage of this offline approach is that with the benefit of having access to all of the message, it should be possible to optimize the choice of phrases so as to maximize compression performance. Indeed, we demonstrate that very good compression can be attained by an offline method without compromising the fast decoding that is a distinguishing characteristic of dictionary-based techniques. Several nontrivial sources of overhead, in terms of both computation resources required to perform the compression, and bits generated into the compressed message, have to be carefully managed as part of the offline process. To meet this challenge, we have developed a novel phrase derivation method and a compact dictionary encoding. In combination these two techniques produce the compression scheme RE-PAIR, which is highly efficient, particularly in decompression.

228 citations


Network Information
Related Topics (5)
Time complexity
36K papers, 879.5K citations
81% related
Tree (data structure)
44.9K papers, 749.6K citations
79% related
String (computer science)
19.4K papers, 333.2K citations
79% related
Approximation algorithm
23.9K papers, 654.3K citations
77% related
Data structure
28.1K papers, 608.6K citations
76% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20212
20191
20182
20174
20163
20153