scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

01 Mar 2006-Information Processing and Management (Pergamon Press, Inc.)-Vol. 42, Iss: 2, pp 429-439
TL;DR: A bitwise KMP algorithm is proposed that can move one extra bit in the case of a mismatch since the alphabet is binary, and two practical Huffman decoding schemes which handle more than a single bit per machine operation are combined.
Abstract: In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid search times of our algorithms compared to the "decompress then search" method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.

Summary (1 min read)

Jump to:  and [Summary]

Summary

  • A modified Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text.
  • The authors propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch, since the alphabet is binary.
  • To avoid processing any encoded text bit more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that the encoded pattern is always aligned with the start of a codeword in the encoded text.
  • The authors call the combined algorithms sk-kmp and win-kmp respectively.
  • The following table compares their algorithms with cgrep of Moura et al. [2] and agrep which searches the uncompressed text.
  • Columns three and four compare the compression performance (size of the compressed text as a percentage of the uncompressed text) of the Huffman code (huff ) with cgrep.
  • The next columns compare the processing time of pattern matching of these algorithms.
  • The “decompress and search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding window and search in parallel using agrep, are called sk-d and win-d respectively.
  • The search times are average values for patterns ranging from infrequent to frequent ones.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Adapting the Knuth-Morris-Pratt Algorithm for Pattern
Matching in Huffman Encoded Texts
Ajay Daptardar and Dana Shapira
{amax/shapird}@cs.brandeis.edu
Computer Science Department, Brandeis University, Waltham, MA
We perform compressed pattern matching in Huffman encoded texts. A modified
Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of
false matches, i.e., an occurrence of the encoded pattern in the encoded text that does
not correspond to an occurrence of the pattern itself in the original text. We propose
a bitwise KMP algorithm that can move one extra bit in the case of a mismatch,
since the alphabet is binary. To avoid processing any encoded text bit more than
once, a preprocessed table is used to determine how far to back up when a mismatch
is detected, and is defined so that the encoded pattern is always aligned with the
start of a codeword in the encoded text. We combine our KMP algorithm with
two Huffman decoding algorithms which handle more than a single bit per machine
operation; Skeleton trees defined by Klein [1], and numerical comparisons between
special canonical values and portions of a sliding window presented in Moffat and
Turpin [3]. We call the combined algorithms sk-kmp and win-kmp resp e ctively.
The following table compares our algorithms with cgrep of Moura et al. [2] and
agrep which searches the uncompressed text. Columns three and four compare the
compression performance (size of the compressed text as a percentage of the uncom-
pressed text) of the Huffman code (huff ) with cgrep. The next columns compare
the processing time of pattern matching of these algorithms. The “decompress and
search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding
window and search in parallel using agrep, are called sk-d and win-d respectively. The
search times are average values for patterns ranging from infrequent to frequent ones.
Files Size (bytes) Compression Search Times (sec)
cgrep huff cgrep sk-kmp win-kmp sk-d win-d
world192.txt 2,473,400 50.88 32.20 0.07 0.13 0.08 0.21 0.13
bible.txt 4,047,392 49.70 26.18 0.05 0.22 0.13 0.36 0.22
books.txt 12,582,090 52.10 30.30 0.21 0.69 0.39 1.21 0.74
95-03-erp.txt 23,976,547 34.49 25.14 0.18 1.10 0.65 1.80 1.11
As can be seen, the KMP variants are faster than the methods corresponding to
“decompress and search” but slower than cgrep. However, when compression perfor-
mance is important or when one does not want to re-compress Huffman encoded files
in order to use cgrep, the proposed algorithms are the better choice.
References
[1] Klein S.T., Skeleton Trees for efficient decoding of Huffman encoded texts, Infor-
mation Retrieval , 3, 7-23, 2000.
[2] Moura E.S., Navarro G., Ziviani N. and Baeza-Yates R., Fast and flexible
word searching on compressed Text, ACM TOIS, 18(2), 113-139, 2000.
[3] Turpin A., Moffat A., Fast file search using text compression, 20th Proc. Aus-
tralian Computer Science Conference, 1-8, 1997.
Citations
More filters
Proceedings ArticleDOI
01 Aug 2017
TL;DR: Comparison of the results of both serial and parallel implementations will give insights into how performance and efficiency is achieved through various techniques of parallelism.
Abstract: String matching refers to the search of each and every occurrence of a string in another string. Nowadays, this issue presents itself in various segments in a great deal, starting from standard programs for text editing and processing, through databases and all the way to their various applications in other sciences. There are numerous different efficient algorithms to solve this problem. One of the efficient algorithms is Rabin-Karp algorithm which has complexity of O(m(n-m+l)) whereas the complexity of proposed advanced Rabin-Karp algorithm is O(n-m). However, the main focus of this research is to apply the concepts of parallelism to improve the performance of the algorithm. There are lots of parallel processing Application Programming Interfaces (APIs) available, like OpenMP, MPI, CUDA MapReduce, etc. out of these we have chosen OpenMP and CUDA to achieve parallelism. Comparison of the results of both serial and parallel implementations will give us insights into how performance and efficiency is achieved through various techniques of parallelism.

9 citations


Cites background from "Adapting the Knuth-Morris-Pratt alg..."

  • ...It makes sure that the string search won’t take more than n (length of text) comparisons for string matching [12], [13]....

    [...]

Book ChapterDOI
01 Apr 2021
TL;DR: A new dynamic Huffman encoding approach is proposed, that provably always performs at least as good as static Huffman coding, and may be better than the standard dynamic HuffMan coding for certain files.
Abstract: Huffman coding is known to be optimal, yet its dynamic version may yield smaller compressed files. The best known bound is that the number of bits used by dynamic Huffman coding in order to encode a message of n characters is at most larger by n bits than the number of bits required by static Huffman coding. In particular, dynamic Huffman coding can also generate a larger encoded file than the static variant, though in practice the file might often, but not always, be smaller. We propose here a new dynamic Huffman encoding approach, that provably always performs at least as good as static Huffman coding, and may be better than the standard dynamic Huffman coding for certain files. This is achieved by reversing the direction for the references of the encoded elements to those forming the model of the encoding, from pointing backwards to looking into the future.

8 citations


Cites background from "Adapting the Knuth-Morris-Pratt alg..."

  • ...Compressed Pattern Matching , that is, searching for strings directly in the compressed form of the text [1, 26]....

    [...]

Journal ArticleDOI
TL;DR: Evidence is presented here that arithmetic coding may produce an output that is identical to that of Huffman coding, and it is found that there is much variability in the randomness of the output of these techniques.
Abstract: It seems reasonable to expect from a good compression method that its output should not be further compressible, because it should behave essentially like random data. We investigate this premise for a variety of known lossless compression techniques, and find that, surprisingly, there is much variability in the randomness, depending on the chosen method. Arithmetic coding seems to produce perfectly random output, whereas that of Huffman or Ziv-Lempel coding still contains many dependencies. In particular, the output of Huffman coding has already been proven to be random under certain conditions, and we present evidence here that arithmetic coding may produce an output that is identical to that of Huffman.

7 citations

Journal ArticleDOI
TL;DR: In this paper, an efficient approach to the compressed string matching problem on Huffman encoded texts, based on the BOYER-MOORE strategy, was proposed, where a candidate valid shift has been located, a subsequent verification phase checks whether the shift is codeword aligned by taking advantage of the skeleton tree data structure.
Abstract: In this paper we propose an efficient approach to the compressed string matching problem on Huffman encoded texts, based on the BOYER-MOORE strategy. Once a candidate valid shift has been located, a subsequent verification phase checks whether the shift is codeword aligned by taking advantage of the skeleton tree data structure. Our approach leads to algorithms that exhibit a sublinear behavior on the average, as shown by extensive experimentation.

6 citations

Proceedings ArticleDOI
16 Mar 2009
TL;DR: A new algorithm for solving CTDE is proposed and its compression performance is compared against the traditional ``double delta decompression''.
Abstract: Given a source file $S$ and two differencing files $\Delta (S,T)$ and $\Delta(T,R)$, where $\Delta(X,Y)$ is used to denote the delta file of the target file $Y$ with respect to the source file $X$, the objective is to be able to construct $R$.This is intended for the scenario of upgrading software where intermediate releases are missing, or for the case of file system backups, where non consecutive versions must be recovered.The traditional way is to decompress $\Delta(S,T)$ in order to construct$T$ and then apply $\Delta(T,R)$ on $T$ and obtain $R$.The {\it Compressed Transitive Delta Encoding (CTDE)} paradigm, introduced in this paper, is to construct a delta file $\Delta(S,R)$ working directly on the two given delta files, $\Delta (S,T)$ and $\Delta(T,R)$, without any decompression or the use of the base file $S$. A new algorithm for solving CTDE is proposed and its compression performance is compared against the traditional ``double delta decompression''.Not only does it use constant additional space, as opposed to the traditional method which uses linear additional memory storage, but experiments show that the size of the delta files involved is reduced by 15\% on average.

6 citations

References
More filters
Journal ArticleDOI
01 Sep 1952
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Abstract: An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.

5,221 citations

Journal ArticleDOI
TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.
Abstract: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings. The constant of proportionality is low enough to make this algorithm of practical use, and the procedure can also be extended to deal with some more general pattern-matching problems. A theoretical application of the algorithm shows that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time. Other algorithms which run even faster on the average are also considered.

3,156 citations

Journal ArticleDOI
TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Abstract: T h e string-matching problem is a very c o m m o n problem. We are searching for a string P = PtP2. . "Pro i n s i d e a la rge t ex t f i le T = t l t2. . . t . , b o t h sequences of characters from a f i n i t e character set Z. T h e characters may be English characters in a text file, DNA base pairs, lines of source code, angles between edges in polygons, machines or machine parts in a production schedule, music notes and tempo in a musical score, and so fo r th . We w a n t to f i n d a l l occurrences of P i n T; n a m e l y , we are searching for the set of starting posit ions F = {i[1 --i--n m + 1 s u c h t h a t titi+ l " " t i + m 1 = P } " T h e two most famous algorithms for this problem are t h e B o y e r M o o r e algorithm [3] and t h e K n u t h Morris Pratt algorithm [10]. There are many extensions to t h i s problem; for example, we may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression. String-matching tools are included in every reasonable text editor, word processor, and many other applications.

806 citations

Journal ArticleDOI
TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.
Abstract: We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.

276 citations


"Adapting the Knuth-Morris-Pratt alg..." refers background in this paper

  • ...[2] and agrep which searches the uncompressed text....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors consider pattern matching without decompression in the UNIX Z-compression scheme and show how to modify their algorithms to achieve a trade-off between the amount of extra space used and the algorithm's time complexity.

223 citations