Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

doi:10.1016/J.IPM.2005.02.003

Journal Article•DOI•

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Dana Shapira¹, Ajay H. Daptardar¹•Institutions (1)

01 Mar 2006-Information Processing and Management (Pergamon Press, Inc.)-Vol. 42, Iss: 2, pp 429-439

TL;DR: A bitwise KMP algorithm is proposed that can move one extra bit in the case of a mismatch since the alphabet is binary, and two practical Huffman decoding schemes which handle more than a single bit per machine operation are combined.

read less

Abstract: In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid search times of our algorithms compared to the "decompress then search" method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.

...read moreread less

Summary (1 min read)

Jump to: and [Summary]

Summary

A modified Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text.
The authors propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch, since the alphabet is binary.
To avoid processing any encoded text bit more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that the encoded pattern is always aligned with the start of a codeword in the encoded text.
The authors call the combined algorithms sk-kmp and win-kmp respectively.
The following table compares their algorithms with cgrep of Moura et al. [2] and agrep which searches the uncompressed text.
Columns three and four compare the compression performance (size of the compressed text as a percentage of the uncompressed text) of the Huffman code (huff ) with cgrep.
The next columns compare the processing time of pattern matching of these algorithms.
The “decompress and search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding window and search in parallel using agrep, are called sk-d and win-d respectively.
The search times are average values for patterns ranging from infrequent to frequent ones.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Adapting the Knuth-Morris-Pratt Algorithm for Pattern

Matching in Huﬀman Encoded Texts

Ajay Daptardar and Dana Shapira

{amax/shapird}@cs.brandeis.edu

Computer Science Department, Brandeis University, Waltham, MA

We perform compressed pattern matching in Huﬀman encoded texts. A modiﬁed

Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of

false matches, i.e., an occurrence of the encoded pattern in the encoded text that does

not correspond to an occurrence of the pattern itself in the original text. We propose

a bitwise KMP algorithm that can move one extra bit in the case of a mismatch,

since the alphabet is binary. To avoid processing any encoded text bit more than

once, a preprocessed table is used to determine how far to back up when a mismatch

is detected, and is deﬁned so that the encoded pattern is always aligned with the

start of a codeword in the encoded text. We combine our KMP algorithm with

two Huﬀman decoding algorithms which handle more than a single bit per machine

operation; Skeleton trees deﬁned by Klein [1], and numerical comparisons between

special canonical values and portions of a sliding window presented in Moﬀat and

Turpin [3]. We call the combined algorithms sk-kmp and win-kmp resp e ctively.

The following table compares our algorithms with cgrep of Moura et al. [2] and

agrep which searches the uncompressed text. Columns three and four compare the

compression performance (size of the compressed text as a percentage of the uncom-

pressed text) of the Huﬀman code (huﬀ ) with cgrep. The next columns compare

the processing time of pattern matching of these algorithms. The “decompress and

search” methods, which decode using skeleton trees or Moﬀat and Turpin’s sliding

window and search in parallel using agrep, are called sk-d and win-d respectively. The

search times are average values for patterns ranging from infrequent to frequent ones.

Files Size (bytes) Compression Search Times (sec)

cgrep huﬀ cgrep sk-kmp win-kmp sk-d win-d

world192.txt 2,473,400 50.88 32.20 0.07 0.13 0.08 0.21 0.13

bible.txt 4,047,392 49.70 26.18 0.05 0.22 0.13 0.36 0.22

books.txt 12,582,090 52.10 30.30 0.21 0.69 0.39 1.21 0.74

95-03-erp.txt 23,976,547 34.49 25.14 0.18 1.10 0.65 1.80 1.11

As can be seen, the KMP variants are faster than the methods corresponding to

“decompress and search” but slower than cgrep. However, when compression perfor-

mance is important or when one does not want to re-compress Huﬀman encoded ﬁles

in order to use cgrep, the proposed algorithms are the better choice.

References

[1] Klein S.T., Skeleton Trees for eﬃcient decoding of Huﬀman encoded texts, Infor-

mation Retrieval , 3, 7-23, 2000.

[2] Moura E.S., Navarro G., Ziviani N. and Baeza-Yates R., Fast and ﬂexible

word searching on compressed Text, ACM TOIS, 18(2), 113-139, 2000.

[3] Turpin A., Moffat A., Fast ﬁle search using text compression, 20th Proc. Aus-

tralian Computer Science Conference, 1-8, 1997.

HTML Viewer

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Summary (1 min read)

Summary

Citations

Additional excerpts

Cites methods from "Adapting the Knuth-Morris-Pratt alg..."

Cites methods from "Adapting the Knuth-Morris-Pratt alg..."

References

Related Papers (5)