Author

# Takuya Kida

Other affiliations: Kyushu University, Hokkai Gakuen University

Bio: Takuya Kida is an academic researcher from Hokkaido University. The author has contributed to research in topics: Compressed pattern matching & Pattern matching. The author has an hindex of 16, co-authored 55 publications receiving 793 citations. Previous affiliations of Takuya Kida include Kyushu University & Hokkai Gakuen University.

##### Papers published on a yearly basis

##### Papers

More filters

••

TL;DR: A general framework suitable to capture the essence of compressed pattern matching according to various dictionary-based compressions is introduced, which includes such compression methods as Lempel-Ziv family, RE-PAIR, SEQUITUR, and the static Dictionary-based method.

109 citations

••

30 Mar 1998TL;DR: This work addresses the problem of searching in LZW compressed text directly, and presents a new algorithm for finding multiple patterns by simulating the move of the Aho-Corasick (1975) pattern matching machine.

Abstract: We address the problem of searching in LZW compressed text directly, and present a new algorithm for finding multiple patterns by simulating the move of the Aho-Corasick (1975) pattern matching machine. The new algorithm finds all occurrences of multiple patterns whereas the algorithm proposed by Amir, Benson, and Farach (see Journal of Computer and System Sciences, vol.52, p.299-307, 1996) finds only the first occurrence of a single pattern. The new algorithm runs in O(n+m/sup 2/+r/sub a/) time using O(n+m/sup 2/) space, where n is the length of the compressed text, m is the length of the total length of the patterns, and r is the number of occurrences of the patterns. We implemented a simple version of the algorithm, and showed that it is approximately twice faster than a decompression followed by a search using the Aho-Corasick machine.

79 citations

••

01 Mar 2000TL;DR: In this paper, the authors show that BPE compression is suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly.

Abstract: Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression.
In this paper, we bring out a potential advantage of BPE compression. We show that it is very suitable from a practical view point of compressed pattern matching, where the goal is to find a pattern directly in compressed text without decompressing it explicitly. We compare running times to find a pattern in (1) BPE compressed files, (2) Lempel-Ziv-Welch compressed files, and (3) original text files, in various situations. Experimental results show that pattern matching in BPE compressed text is even faster than matching in the original text. Thus the BPE compression reduces not only the disk space but also the searching time.

62 citations

••

22 Jul 1999TL;DR: This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it, and shows that the algorithm is indeed fast when a pattern length is at most 32, or the word length.

Abstract: This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |Σ|) time and O(|Σ|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.

58 citations

••

27 Mar 2001TL;DR: This work presents a different approach to the approximate string matching problem, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas, thus becoming the first practical solution to the problem.

Abstract: Approximate string matching on compressed text was an open problem for almost a decade The two existing solutions are very new Despite that they represent important complexity breakthroughs, in most practical cases they are not useful, in the sense that they are slower than uncompressing the text and then searching the uncompressed text We present a different approach, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas We show experimentally that this solution is 10-30 times faster than previous work and up to three times faster than the trivial approach of uncompressing and searching, thus becoming the first practical solution to the problem

58 citations

##### Cited by

More filters

•

27 May 2002TL;DR: This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice, and includes all of the most significant new developments in complex pattern searching.

Abstract: This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice. It covers searching for simple, multiple, and extended strings, as well as regular expressions, exactly and approximately. It includes all of the most significant new developments in complex pattern searching. The clear explanations, step-by-step examples, algorithms pseudo-code, and implementation efficiency maps will enable researchers, professionals, and students in bioinformatics, computer science, and software engineering to choose the most appropriate algorithms for their applications.

463 citations

••

TL;DR: This paper shows that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP, and bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR.

Abstract: This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string /spl sigma/? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem's inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n/sup 1/2/). We finish by presenting two novel algorithms with exponentially better ratios of O(log/sup 3/n) and O(log(n/m/sup */)), where m/sup */ is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.

457 citations

••

TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.

Abstract: We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.

276 citations

••

TL;DR: In this paper, the data structure of the compressed suffix array is modified so that pattern matching can be done without any access to the text, and new operations search, decompress and inverse are added.

269 citations