scispace - formally typeset
Open AccessJournal ArticleDOI

Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Dana Shapira, +1 more
- 01 Mar 2006 - 
- Vol. 42, Iss: 2, pp 429-439
TLDR
A bitwise KMP algorithm is proposed that can move one extra bit in the case of a mismatch since the alphabet is binary, and two practical Huffman decoding schemes which handle more than a single bit per machine operation are combined.
Abstract
In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid search times of our algorithms compared to the "decompress then search" method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.

read more

Content maybe subject to copyright    Report

Adapting the Knuth-Morris-Pratt Algorithm for Pattern
Matching in Huffman Encoded Texts
Ajay Daptardar and Dana Shapira
{amax/shapird}@cs.brandeis.edu
Computer Science Department, Brandeis University, Waltham, MA
We perform compressed pattern matching in Huffman encoded texts. A modified
Knuth-Morris-Pratt (KMP) algorithm is used in order to overcome the problem of
false matches, i.e., an occurrence of the encoded pattern in the encoded text that does
not correspond to an occurrence of the pattern itself in the original text. We propose
a bitwise KMP algorithm that can move one extra bit in the case of a mismatch,
since the alphabet is binary. To avoid processing any encoded text bit more than
once, a preprocessed table is used to determine how far to back up when a mismatch
is detected, and is defined so that the encoded pattern is always aligned with the
start of a codeword in the encoded text. We combine our KMP algorithm with
two Huffman decoding algorithms which handle more than a single bit per machine
operation; Skeleton trees defined by Klein [1], and numerical comparisons between
special canonical values and portions of a sliding window presented in Moffat and
Turpin [3]. We call the combined algorithms sk-kmp and win-kmp resp e ctively.
The following table compares our algorithms with cgrep of Moura et al. [2] and
agrep which searches the uncompressed text. Columns three and four compare the
compression performance (size of the compressed text as a percentage of the uncom-
pressed text) of the Huffman code (huff ) with cgrep. The next columns compare
the processing time of pattern matching of these algorithms. The “decompress and
search” methods, which decode using skeleton trees or Moffat and Turpin’s sliding
window and search in parallel using agrep, are called sk-d and win-d respectively. The
search times are average values for patterns ranging from infrequent to frequent ones.
Files Size (bytes) Compression Search Times (sec)
cgrep huff cgrep sk-kmp win-kmp sk-d win-d
world192.txt 2,473,400 50.88 32.20 0.07 0.13 0.08 0.21 0.13
bible.txt 4,047,392 49.70 26.18 0.05 0.22 0.13 0.36 0.22
books.txt 12,582,090 52.10 30.30 0.21 0.69 0.39 1.21 0.74
95-03-erp.txt 23,976,547 34.49 25.14 0.18 1.10 0.65 1.80 1.11
As can be seen, the KMP variants are faster than the methods corresponding to
“decompress and search” but slower than cgrep. However, when compression perfor-
mance is important or when one does not want to re-compress Huffman encoded files
in order to use cgrep, the proposed algorithms are the better choice.
References
[1] Klein S.T., Skeleton Trees for efficient decoding of Huffman encoded texts, Infor-
mation Retrieval , 3, 7-23, 2000.
[2] Moura E.S., Navarro G., Ziviani N. and Baeza-Yates R., Fast and flexible
word searching on compressed Text, ACM TOIS, 18(2), 113-139, 2000.
[3] Turpin A., Moffat A., Fast file search using text compression, 20th Proc. Aus-
tralian Computer Science Conference, 1-8, 1997.
Citations
More filters
Proceedings ArticleDOI

A Space Efficient Direct Access Data Structure

TL;DR: The pruning procedure is improved and empirical evidence is given that when memory storage is of main concern, the suggested data structure outperforms other direct access techniques such as those due to Kulekci, DACs and sampling, with a slowdown as compared to DAC's and fixed length encoding.
Journal ArticleDOI

Optimal skeleton and reduced Huffman trees

TL;DR: It is shown that the straightforward ways of basing the constructions of a skeleton tree as well as that of a reduced skeleton tree on a canonical Huffman tree does not necessarily yield the least number of nodes.
Book ChapterDOI

Optimal Skeleton Huffman Trees Revisited

TL;DR: In this paper, an optimal skeleton Huffman tree with the least number of nodes among all optimal prefix-free code trees (not necessarily Huffman's) with shrunk perfect subtrees is presented.
Dissertation

Content-aware compression for big textual data analysis

Dapeng Dong
TL;DR: Higher Education Authority Programme for Research in Third-Level Institutions Cycle 5 & European Regional Development Fund (Telecommunications Graduate Initiative program)
Journal ArticleDOI

Building a information system for looking up contextual technical dictionary

TL;DR: This paper proposed the model for searching technical terms and context of terms based on analyzing, evaluating and choosing an optimal algorithm in pattern matching technique and was integrated on a dictionary system.
References
More filters
Journal ArticleDOI

A Method for the Construction of Minimum-Redundancy Codes

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Journal ArticleDOI

Fast Pattern Matching in Strings

TL;DR: An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings, showing that the set of concatenations of even palindromes, i.e., the language $\{\alpha \alpha ^R\}^*$, can be recognized in linear time.
Journal ArticleDOI

Fast text searching: allowing errors

TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Journal ArticleDOI

Fast and flexible word searching on compressed text

TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.
Journal ArticleDOI

Let Sleeping Files Lie

TL;DR: In this article, the authors consider pattern matching without decompression in the UNIX Z-compression scheme and show how to modify their algorithms to achieve a trade-off between the amount of extra space used and the algorithm's time complexity.