Showing papers on "Compressed pattern matching published in 2002"

PDF

Open Access

Journal Article•DOI•

On the Complexity of Pattern Matching for Highly Compressed Two-Dimensional Texts

[...]

Piotr Berman¹, Marek Karpinski², Lawrence L. Larmore³, Wojciech Plandowski, Wojciech Rytter⁴ - Show less +1 more•Institutions (4)

Pennsylvania State University¹, University of Bonn², University of Nevada, Las Vegas³, University of Liverpool⁴

01 Sep 2002-Journal of Computer and System Sciences

TL;DR: This work considers the complexity of problems related to two-dimensional texts (2D-texts) described succinctly, and gives efficient algorithms for the related problems of randomized equality testing and testing for a given occurrence of an uncompressed pattern.

...read moreread less

59 citations

Book Chapter•DOI•

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

[...]

Masayuki Takeda¹, Masayuki Takeda², Satoru Miyamoto¹, Takuya Kida¹, Ayumi Shinohara¹, Ayumi Shinohara², Shuichi Fukamachi³, Takeshi Shinohara³, Setsuo Arikawa¹ - Show less +5 more•Institutions (3)

Kyushu University¹, National Presto Industries², Kyushu Institute of Technology³

11 Sep 2002

TL;DR: This paper generalizes the compressed pattern matching technique so as to handle structured texts such as XML documents, and can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

...read moreread less

Abstract: Techniques in processing text files "as is" are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the "as-is" principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

...read moreread less

17 citations

Journal Article•

Processing text files as is: Pattern matching over compressed texts, multi-byte character texts, and semi-structured texts

[...]

Masayuki Takeda¹, Masayuki Takeda², Satoru Miyamoto¹, Takuya Kida¹, Ayumi Shinohara², Ayumi Shinohara¹, Shuichi Fukamachi³, Takeshi Shinohara³, Setsuo Arikawa¹ - Show less +5 more•Institutions (3)

Kyushu University¹, National Presto Industries², Kyushu Institute of Technology³

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this article, a technique for string matching over multi-byte character text files is presented. But the technique is applicable to any prefix code such as the Huffman code and variants of Unicode.

...read moreread less

Abstract: Techniques in processing text files as is are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the as-is principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.

...read moreread less

16 citations

Proceedings Article•DOI•

Pattern matching in BWT-transformed text

[...]

Donald A. Adjeroh¹, Amar Mukherjee, Tim Bell, M. Powell, Nan Zhang - Show less +1 more•Institutions (1)

West Virginia University¹

02 Apr 2002

TL;DR: The motivation for the approach is the observation that the BWT provides a lexicographic ordering of the input text as part of its inverse transformation process.

...read moreread less

Abstract: Summary form only given. The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T, with minimal (or no) decompression. The BWT performs a permutation of the characters in the text, such that characters in lexically similar contexts will be near to each other. The motivation for our approach is the observation that the BWT provides a lexicographic ordering of the input text as part of its inverse transformation process.

...read moreread less

15 citations

Proceedings Article•DOI•

A dictionary-based compressed pattern matching algorithm

[...]

Meng-Hang Ho, Hsu-Chun Yen

26 Aug 2002

TL;DR: This work designs and implements a dictionary-based compressed pattern matching algorithm that takes advantage of the dictionary structure common in the LZ78 family and is able to do 'block decompression' (a key in many existing compressedpattern matching schemes) as well as pattern matching on-the-fly, resulting in performance improvement as the experimental results indicate.

...read moreread less

Abstract: Compressed pattern matching refers to the process of, given a text in a compressed form and a pattern, finding all the occurrences of the pattern in the text without decompression. To utilize bandwidth more effectively in the Internet environment, it is highly desirable that data be kept and sent over the Internet in compressed form. In order to support information retrieval for compressed data, compressed pattern matching has been gaining increasing attention from both theoretical and practical viewpoints. We design and implement a dictionary-based compressed pattern matching algorithm. Our algorithm takes advantage of the dictionary structure common in the LZ78 family. With the help of a slightly modified dictionary structure, we are able to do 'block decompression' (a key in many existing compressed pattern matching schemes) as well as pattern matching on-the-fly, resulting in performance improvement as our experimental results indicate.

...read moreread less

11 citations

Journal Article•

Time/space efficient compressed pattern matching

[...]

Leszek Gasieniec¹, Igor Potapov¹•Institutions (1)

University of Liverpool¹

01 Oct 2002-Fundamenta Informaticae

TL;DR: In this article, the problem of exact pattern matching is solved in an optimal linear time, with a help of a constant extra space, where the problem is to find all occurrences of a pattern p in a text t in a straight-line program.

...read moreread less

Abstract: An exact pattern matching problem is to find all occurrences of a pattern p in a text t. We say that the pattern matching algorithm is optimal if its running time is linear in the sizes of t and p, i.e., O(t + p). Perhaps one of the most interesting settings of the pattern matching problem is when one has to design an efficient algorithm with a help of a small extra space. In this paper we explore this setting to the extreme. We work under an assumption that the text t is available only in a compressed form, represented by a straight-line program. The compression methods based on efficient construction of straight-line programs are as competitive as the compression standards, including the Lempel-Ziv compression scheme and recently intensively studied text compression via block sorting, due to Burrows and Wheeler. Our main result is an algorithm that solves the compressed string matching problem in an optimal linear time, with a help of a constant extra space. We also discuss an efficient implementation of a version our algorithm showing that the new concept may have also some interesting real applications. Our result is in contrast with many other compressed pattern matching algorithms where the goal is to find all pattern occurrences in time related to the size of the compressed text. However one must remember that all previous algorithms used at least a linear (in a compressed text, a dictionary, or a pattern) extra memory while our algorithm can be implemented in a constant size extra space. Also from the practical point of view, when the compression ratio is constant (very rarely smaller than 25%), there is no dramatic difference between the running time based on the size of the compressed text and the size of the original text, while an extra space resources might be strictly limited.

...read moreread less

8 citations