scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

01 Mar 2006-Information Processing and Management (Pergamon Press, Inc.)-Vol. 42, Iss: 2, pp 429-439
TL;DR: A bitwise KMP algorithm is proposed that can move one extra bit in the case of a mismatch since the alphabet is binary, and two practical Huffman decoding schemes which handle more than a single bit per machine operation are combined.
Abstract: In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid search times of our algorithms compared to the "decompress then search" method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.

Summary (1 min read)

Jump to:  and [Summary]

Summary

  • A modified Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text.
  • The authors propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch, since the alphabet is binary.
  • To avoid processing any encoded text bit more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that the encoded pattern is always aligned with the start of a codeword in the encoded text.
  • The authors call the combined algorithms sk-kmp and win-kmp respectively.
  • The following table compares their algorithms with cgrep of Moura et al. [2] and agrep which searches the uncompressed text.
  • Columns three and four compare the compression performance (size of the compressed text as a percentage of the uncompressed text) of the Huffman code (huff ) with cgrep.
  • The next columns compare the processing time of pattern matching of these algorithms.
  • The “decompress and search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding window and search in parallel using agrep, are called sk-d and win-d respectively.
  • The search times are average values for patterns ranging from infrequent to frequent ones.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Adapting the Knuth-Morris-Pratt Algorithm for Pattern
Matching in Huffman Encoded Texts
Ajay Daptardar and Dana Shapira
{amax/shapird}@cs.brandeis.edu
Computer Science Department, Brandeis University, Waltham, MA
We perform compressed pattern matching in Huffman encoded texts. A modified
Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of
false matches, i.e., an occurrence of the encoded pattern in the encoded text that does
not correspond to an occurrence of the pattern itself in the original text. We propose
a bitwise KMP algorithm that can move one extra bit in the case of a mismatch,
since the alphabet is binary. To avoid processing any encoded text bit more than
once, a preprocessed table is used to determine how far to back up when a mismatch
is detected, and is defined so that the encoded pattern is always aligned with the
start of a codeword in the encoded text. We combine our KMP algorithm with
two Huffman decoding algorithms which handle more than a single bit per machine
operation; Skeleton trees defined by Klein [1], and numerical comparisons between
special canonical values and portions of a sliding window presented in Moffat and
Turpin [3]. We call the combined algorithms sk-kmp and win-kmp resp e ctively.
The following table compares our algorithms with cgrep of Moura et al. [2] and
agrep which searches the uncompressed text. Columns three and four compare the
compression performance (size of the compressed text as a percentage of the uncom-
pressed text) of the Huffman code (huff ) with cgrep. The next columns compare
the processing time of pattern matching of these algorithms. The “decompress and
search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding
window and search in parallel using agrep, are called sk-d and win-d respectively. The
search times are average values for patterns ranging from infrequent to frequent ones.
Files Size (bytes) Compression Search Times (sec)
cgrep huff cgrep sk-kmp win-kmp sk-d win-d
world192.txt 2,473,400 50.88 32.20 0.07 0.13 0.08 0.21 0.13
bible.txt 4,047,392 49.70 26.18 0.05 0.22 0.13 0.36 0.22
books.txt 12,582,090 52.10 30.30 0.21 0.69 0.39 1.21 0.74
95-03-erp.txt 23,976,547 34.49 25.14 0.18 1.10 0.65 1.80 1.11
As can be seen, the KMP variants are faster than the methods corresponding to
“decompress and search” but slower than cgrep. However, when compression perfor-
mance is important or when one does not want to re-compress Huffman encoded files
in order to use cgrep, the proposed algorithms are the better choice.
References
[1] Klein S.T., Skeleton Trees for efficient decoding of Huffman encoded texts, Infor-
mation Retrieval , 3, 7-23, 2000.
[2] Moura E.S., Navarro G., Ziviani N. and Baeza-Yates R., Fast and flexible
word searching on compressed Text, ACM TOIS, 18(2), 113-139, 2000.
[3] Turpin A., Moffat A., Fast file search using text compression, 20th Proc. Aus-
tralian Computer Science Conference, 1-8, 1997.
Citations
More filters
Journal ArticleDOI
TL;DR: The experimental results show that the proposed algorithm could achieve an excellent compression ratio without losing data when compared to the standard compression algorithms.
Abstract: The development of multimedia and digital imaging has led to high quantity of data required to represent modern imagery. This requires large disk space for storage, and long time for transmission over computer networks, and these two are relatively expensive. These factors prove the need for images compression. Image compression addresses the problem of reducing the amount of space required to represent a digital image yielding a compact representation of an image, and thereby reducing the image storage/transmission time requirements. The key idea here is to remove redundancy of data presented within an image to reduce its size without affecting the essential information of it. We are concerned with lossless image compression in this paper. Our proposed approach is a mix of a number of already existing techniques. Our approach works as follows: first, we apply the well-known Lempel-Ziv-Welch (LZW) algorithm on the image in hand. What comes out of the first step is forward to the second step where the Bose, Chaudhuri and Hocquenghem (BCH) error correction and detected algorithm is used. To improve the compression ratio, the proposed approach applies the BCH algorithms repeatedly until “inflation” is detected. The experimental results show that the proposed algorithm could achieve an excellent compression ratio without losing data when compared to the standard compression algorithms.

29 citations


Additional excerpts

  • ...An advantage of this technique is that it allows for higher compression ratio than the lossless [3,4]....

    [...]

Journal ArticleDOI
TL;DR: The Wavelet tree is adapted, in this paper, to Fibonacci codes, so that in addition to supporting direct access to the fibonacci encoded file, it also increases the compression savings when compared to the original Fib onacci compressed file.

27 citations


Cites methods from "Adapting the Knuth-Morris-Pratt alg..."

  • ...This has also been applied on Huffman trees [20] producing a compact tree for efficient use, such as compressed pattern matching [29]....

    [...]

Journal ArticleDOI
01 Sep 2017
TL;DR: This research modeled a search process of the Knuth-Morris-Pratt algorithm in the form of easy-to-understand visualization, Knuth/Morris algorithm selection because this algorithm is easy to learn and easy to implement into many programming languages.
Abstract: In this research modeled a search process of the Knuth-Morris-Pratt algorithm in the form of easy-to-understand visualization, Knuth-Morris-Pratt algorithm selection because this algorithm is easy to learn and easy to implement into many programming languages.

26 citations

Posted Content
TL;DR: This paper presents two efficient algorithms for the binary string matching problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte.
Abstract: The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.

15 citations

Journal ArticleDOI
TL;DR: The pruning procedure is improved and empirical evidence is given that when memory storage is of main concern, the suggested data structure outperforms other direct access techniques such as those due to Külekci, DACs and sampling, with a slowdown as compared to DAC’s and fixed length encoding.

15 citations


Cites methods from "Adapting the Knuth-Morris-Pratt alg..."

  • ...Skeleton trees have been used to accelerate compressed pattern matching in [23]....

    [...]

References
More filters
Proceedings ArticleDOI
24 Mar 1992
TL;DR: The authors propose that compression be used as a time saving tool, in addition to its traditional role of space saving, by introducing a new pattern matching paradigm, compressed matching.
Abstract: Digitized images are known to be extremely space consuming. However, regularities in the images can often be exploited to reduce the necessary storage area. Thus, many systems store images in a compressed form. The authors propose that compression be used as a time saving tool, in addition to its traditional role of space saving. They introduce a new pattern matching paradigm, compressed matching. A text array T and pattern array P are given in compressed forms c(T) and c(P). They seek all appearances of P in T, without decompressing T. This achieves a search time that is sublinear in the size of the uncompressed text mod T mod . They show that for the two-dimensional run-length compression there is a O( mod c(T) mod log mod P mod + mod P mod ), or almost optimal algorithm. The algorithm uses a novel multidimensional pattern matching technique, two-dimensional periodicity analysis. >

182 citations

Journal ArticleDOI
Udi Manber1
TL;DR: A new text compression scheme is presented in this article to speed up string matching by searching the compressed file directly, and can remain compressed indefinitely, saving space while allowing faster search at the same time.
Abstract: A new text compression scheme is presented in this article. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and precessing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.

151 citations

Journal ArticleDOI
TL;DR: Computer programs for generating a minimum-redundancy exhaustive prefix encoding are described, which generates a Huffman frequency tree, another determines the structure functions of an encoding, and a third program assigns codes.
Abstract: Computer programs for generating a minimum-redundancy exhaustive prefix encoding are described. One program generates a Huffman frequency tree, another determines the structure functions of an encoding, and a third program assigns codes.

137 citations

Journal ArticleDOI
TL;DR: A modified decoding method is described that allows improved decoding speed, requiring just a few machine operations per output symbol (rather than for each decoded bit), and usesjust a few hundred bytes of memory above and beyond the space required to store an enumeration of the source alphabet.
Abstract: Minimum redundancy coding (also known as Huffman coding) is one of the enduring techniques of data compression. Many efforts have been made to improve the efficiency of minimum redundancy coding, the majority based on the use of improved representations for explicit Huffman trees. In this paper, we examine how minimum redundancy coding can be implemented efficiently by divorcing coding from a code tree, with emphasis on the situation when n is large, perhaps on the order of 10/sup 6/. We review techniques for devising minimum redundancy codes, and consider in detail how encoding and decoding should be accomplished. In particular, we describe a modified decoding method that allows improved decoding speed, requiring just a few machine operations per output symbol (rather than for each decoded bit), and uses just a few hundred bytes of memory above and beyond the space required to store an enumeration of the source alphabet.

114 citations

Book ChapterDOI
16 Aug 1995
TL;DR: It is shown that if p is sorted then it is possible to calculate the array l in-place, with l i overwriting p i , in O(n) time and using O(1) additional space.
Abstract: The optimal prefixfree code problem is to determine, for a given array p=[p i ¦i∈{1...n}] of n weights, an integer array l= [l i ¦∈{1...n}] of n codeword lengths such that \(\sum olimits_{i = 1}^n {2^{ - l_i } \leqslant 1}\)and \(\sum olimits_{i = 1}^n {p_i l_i }\)is minmized. Huffman's famous greedy algorithm solves this problem in O(n log n) time, if p is unsorted; and can be implemented to execute in O(n) time, if the input array p is sorted. Here we consider the space requirements of the greedy method. We show that if p is sorted then it is possible to calculate the array l in-place, with l i overwriting p i , in O(n) time and using O(1) additional space. The new implementation leads directly to an O(n log n)-time and n + O(1) words of extra space implementation for the case when p is not sorted. The proposed method is simple to implement and executes quickly.

66 citations