Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

doi:10.1016/J.IPM.2005.02.003

Adapting the Knuth-Morris-Pratt Algorithm for Pattern

Matching in Huﬀman Encoded Texts

Ajay Daptardar and Dana Shapira

{amax/shapird}@cs.brandeis.edu

Computer Science Department, Brandeis University, Waltham, MA

We perform compressed pattern matching in Huﬀman encoded texts. A modiﬁed

Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of

false matches, i.e., an occurrence of the encoded pattern in the encoded text that does

not correspond to an occurrence of the pattern itself in the original text. We propose

a bitwise KMP algorithm that can move one extra bit in the case of a mismatch,

since the alphabet is binary. To avoid processing any encoded text bit more than

once, a preprocessed table is used to determine how far to back up when a mismatch

is detected, and is deﬁned so that the encoded pattern is always aligned with the

start of a codeword in the encoded text. We combine our KMP algorithm with

two Huﬀman decoding algorithms which handle more than a single bit per machine

operation; Skeleton trees deﬁned by Klein [1], and numerical comparisons between

special canonical values and portions of a sliding window presented in Moﬀat and

Turpin [3]. We call the combined algorithms sk-kmp and win-kmp resp e ctively.

The following table compares our algorithms with cgrep of Moura et al. [2] and

agrep which searches the uncompressed text. Columns three and four compare the

compression performance (size of the compressed text as a percentage of the uncom-

pressed text) of the Huﬀman code (huﬀ ) with cgrep. The next columns compare

the processing time of pattern matching of these algorithms. The “decompress and

search” methods, which decode using skeleton trees or Moﬀat and Turpin’s sliding

window and search in parallel using agrep, are called sk-d and win-d respectively. The

search times are average values for patterns ranging from infrequent to frequent ones.

Files Size (bytes) Compression Search Times (sec)

cgrep huﬀ cgrep sk-kmp win-kmp sk-d win-d

world192.txt 2,473,400 50.88 32.20 0.07 0.13 0.08 0.21 0.13

bible.txt 4,047,392 49.70 26.18 0.05 0.22 0.13 0.36 0.22

books.txt 12,582,090 52.10 30.30 0.21 0.69 0.39 1.21 0.74

95-03-erp.txt 23,976,547 34.49 25.14 0.18 1.10 0.65 1.80 1.11

As can be seen, the KMP variants are faster than the methods corresponding to

“decompress and search” but slower than cgrep. However, when compression perfor-

mance is important or when one does not want to re-compress Huﬀman encoded ﬁles

in order to use cgrep, the proposed algorithms are the better choice.

References

[1] Klein S.T., Skeleton Trees for eﬃcient decoding of Huﬀman encoded texts, Infor-

mation Retrieval , 3, 7-23, 2000.

[2] Moura E.S., Navarro G., Ziviani N. and Baeza-Yates R., Fast and ﬂexible

word searching on compressed Text, ACM TOIS, 18(2), 113-139, 2000.

[3] Turpin A., Moffat A., Fast ﬁle search using text compression, 20th Proc. Aus-

tralian Computer Science Conference, 1-8, 1997.

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Citations

A Space Efficient Direct Access Data Structure

Optimal skeleton and reduced Huffman trees

Optimal Skeleton Huffman Trees Revisited

Content-aware compression for big textual data analysis

Building a information system for looking up contextual technical dictionary

References

A Method for the Construction of Minimum-Redundancy Codes

Fast Pattern Matching in Strings

Fast text searching: allowing errors

Fast and flexible word searching on compressed text

Let Sleeping Files Lie

Related Papers (5)

Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts

A Method for the Construction of Minimum-Redundancy Codes

High-order entropy-compressed text indexes

Space-efficient static trees and graphs

Fast, small, simple rank/select on bitmaps