# Shift-And Approach to Pattern Matching in LZW Compressed Text

01 Jan 1999-Vol. 156

TL;DR: In this article, the Shift-And algorithm was used to solve the problem of pattern matching in LZW compressed text, where a pattern length is at most 32 or the word length.

Abstract: This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |Σ|) time and O(|Σ|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.

##### Citations

More filters

••

TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.

Abstract: We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.

276 citations

••

TL;DR: This work addresses the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights and presents an algorithm for comparing two {run-length} encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n + n'm) complexity.

Abstract: Given two strings of size $n$ over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., {Time Warps, String Edits, and Macromolecules}; Addison--Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, { J.\ Molec.\ Biol., 147 (1981), pp. 195--197] uses a dynamic programming matrix and compares the two strings in O(n2) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both {local} and {global} similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel--Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an $O(n^2 / \log n)$, algorithm for an input of constant alphabet size. For most texts, the time complexity is actually $O(h n^2 / \log n)$, where $h \le 1$ is the entropy of the text. We also present an algorithm for comparing two {run-length} encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n + n'm) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty.

156 citations

••

22 Jul 1999TL;DR: A general technique for string matching when the text comes as a sequence of blocks is developed, which abstracts the essential features of Ziv-Lempel compression and presents the first algorithm to find all the matches of a pattern in a text compressed using LZ77.

Abstract: We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Ziv-Lempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78.

123 citations

••

TL;DR: A general framework suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions is introduced, which includes such compression methods as Lempel-Ziv family, RE-PAIR, SEQUITUR, and the static Dictionary-based method.

109 citations

•

01 Jan 2001

TL;DR: This chapter discusses Parallel Models of Computation, a model for parallel computing that was originally published in 1993 and then updated in 2013 and then again in 2016.

Abstract: 1. RAM Model * 2. Lists * 3. Induction and Recursion * 4. Trees * 5. Algorithms Design Techniques * 6. Hashing * 7. Heaps * 8. Balanced Trees * 9. Sets Over a Small Universe * 10. Discrete Fourier Transform (DFT) * 11. Strings * 12. Graphs * 13. Parallel Models of Computation * Appendix of Common Sums * Bibliography * Notation * Index

84 citations

##### References

More filters

••

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.

Abstract: A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.

5,844 citations

••

TL;DR: A new compression algorithm is introduced that is based on principles not found in existing commercial methods in that it dynamically adapts to the redundancy characteristics of the data being compressed, and serves to illustrate system problems inherent in using any compression scheme.

Abstract: Data stored on disks and tapes or transferred over communications links in commercial computer systems generally contains significant redundancy. A mechanism or procedure which recodes the data to lessen the redundancy could possibly double or triple the effective data densitites in stored or communicated data. Moreover, if compression is automatic, it can also aid in the rise of software development costs. A transparent compression mechanism could permit the use of "sloppy" data structures, in that empty space or sparse encoding of data would not greatly expand the use of storage space or transfer time; however , that requires a good compression procedure. Several problems encountered when common compression methods are integrated into computer systems have prevented the widespread use of automatic data compression. For example (1) poor runtime execution speeds interfere in the attainment of very high data rates; (2) most compression techniques are not flexible enough to process different types of redundancy; (3) blocks of compressed data that have unpredictable lengths present storage space management problems. Each compression ' This article was written while Welch was employed at Sperry Research Center; he is now employed with Digital Equipment Corporation. 8 m, 2 /R4/OflAb l strategy poses a different set of these problems and, consequently , the use of each strategy is restricted to applications where its inherent weaknesses present no critical problems. This article introduces a new compression algorithm that is based on principles not found in existing commercial methods. This algorithm avoids many of the problems associated with older methods in that it dynamically adapts to the redundancy characteristics of the data being compressed. An investigation into possible application of this algorithm yields insight into the compressibility of various types of data and serves to illustrate system problems inherent in using any compression scheme. For readers interested in simple but subtle procedures, some details of this algorithm and its implementations are also described. The focus throughout this article will be on transparent compression in which the computer programmer is not aware of the existence of compression except in system performance. This form of compression is "noiseless," the decompressed data is an exact replica of the input data, and the compression apparatus is given no special program information, such as data type or usage statistics. Transparency is perceived to be important because putting an extra burden on the application programmer would cause

2,426 citations

••

TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.

Abstract: T h e string-matching problem is a very c o m m o n problem. We are searching for a string P = PtP2. . "Pro i n s i d e a la rge t ex t f i le T = t l t2. . . t . , b o t h sequences of characters from a f i n i t e character set Z. T h e characters may be English characters in a text file, DNA base pairs, lines of source code, angles between edges in polygons, machines or machine parts in a production schedule, music notes and tempo in a musical score, and so fo r th . We w a n t to f i n d a l l occurrences of P i n T; n a m e l y , we are searching for the set of starting posit ions F = {i[1 --i--n m + 1 s u c h t h a t titi+ l " " t i + m 1 = P } " T h e two most famous algorithms for this problem are t h e B o y e r M o o r e algorithm [3] and t h e K n u t h Morris Pratt algorithm [10]. There are many extensions to t h i s problem; for example, we may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression. String-matching tools are included in every reasonable text editor, word processor, and many other applications.

806 citations

••

TL;DR: A family of simple and fast algorithms for solving the classical string matching problem, string matching with don't care symbols and complement symbols, and multiple patterns are introduced.

Abstract: We introduce a family of simple and fast algorithms for solving the classical string matching problem, string matching with don't care symbols and complement symbols, and multiple patterns. In addition we solve the same problems allowing up to k mismatches. Among the features of these algorithms are that they are real time algorithms, they don't need to buffer the input, and they are suitable to be implemented in hardware.

656 citations

••

TL;DR: A generalization of string matching, in which the pattern is a sequence of pattern elements, each compatible with a set of symbols, is investigated, which shows that generalized string matching requires a time-space product of $\Omega ({{n^2 } / {\log n}})$ on a powerful model of computation, when the alphabet is restricted to n symbols.

Abstract: Given a pattern string of length n and an object string of length m, the string matching problem asks for the positions of all occurrences of the pattern in the object string. This paper investigates a generalization of string matching, in which the pattern is a sequence of pattern elements, each compatible with a set of symbols. The alphabet of symbols is infinite, with its members encoded in a finite alphabet. In contrast to standard string matching, which can be solved in simultaneous linear time and constant space, it is shown that generalized string matching requires a time-space product of $\Omega ({{n^2 } / {\log n}})$ on a powerful model of computation, when the alphabet is restricted to n symbols. Our proof uses a method of Borodin. The obvious algorithm for generalized string matching requires time $O(NM)$, where N is the length of the encoding of the pattern, and M is that of the object string. We describe an algorithm which solves generalized string matching in time $O(N + M + mN^{{1 / 2}} {\o...

351 citations