scispace - formally typeset
Journal ArticleDOI

Faster Approximate String Matching

Ricardo Baeza-Yates, +1 more
- 01 Feb 1999 - 
- Vol. 23, Iss: 2, pp 127-158
Reads0
Chats0
TLDR
The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input and it is shown that the algorithms are among the fastest for typical text searching, being the fastest in some cases.
Abstract
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to $ O(\sqrt{mk/w} n) $ . Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near $ O(\sqrt{mk/(\sigma w)} n) $ average complexity (σ is the alphabet size). We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology.

read more

Citations
More filters
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Book

Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences

TL;DR: This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice, and includes all of the most significant new developments in complex pattern searching.
Journal ArticleDOI

Fast and flexible word searching on compressed text

TL;DR: A fast compression technique for natural language texts that allows a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching.
Journal ArticleDOI

An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing

TL;DR: The design and development of the automata processor is presented, a massively parallel non-von Neumann semiconductor architecture that is purpose-built for automata processing that exceeds the capabilities of high-performance FPGA-based implementations of regular expression processors.
Journal ArticleDOI

Fast and flexible string matching by combining bit-parallelism and suffix automata

TL;DR: A new automaton to recognize suffixes of patterns with classes of characters is introduced, which seems very adequate for computational biology applications, since it is the fastest algorithm to search on DNA sequences and flexible searching is an important problem in that area.
References
More filters
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Journal ArticleDOI

Efficient string matching: an aid to bibliographic search

TL;DR: A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text that has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Journal ArticleDOI

Fast text searching: allowing errors

TL;DR: T h e string-matching problem is a very c o m m o n problem; there are many extensions to t h i s problem; for example, it may be looking for a set of patterns, a pattern w i t h "wi ld cards," or a regular expression.
Journal ArticleDOI

Algorithms for approximate string matching

TL;DR: An improved algorithm that works in time and in space O and algorithms that can be used in conjunction with extended edit operation sets, including, for example, transposition of adjacent characters.
Journal ArticleDOI

Approximate string-matching with q -grams and maximal matches

TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.