scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1998"


Journal ArticleDOI
TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

897 citations


Journal ArticleDOI
TL;DR: This paper considers the following incremental version of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them, and obtains O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.
Abstract: The problem of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k2) time required to compute a solution from scratch. We further show, with a series of applications, that this algorithm is indeed more powerful than its nonincremental counterpart. We show this by solving the applications with greater asymptotic efficiency than heretofore possible. For example, we obtain O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.

216 citations


Journal ArticleDOI
TL;DR: This paper gives the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm, which is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit.
Abstract: String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the naive ``decompresion'' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) .

179 citations


Book ChapterDOI
Gene Myers1
20 Jul 1998
TL;DR: This work presents an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the approximate string matching problem.
Abstract: The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in O(nmk/w) time where w is the word size of the machine (e.g. 32 or 64 in practice). Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m.

124 citations


Proceedings ArticleDOI
Piotr Indyk1
08 Nov 1998
TL;DR: This paper gives a randomized O(nlogn)-time algorithm for the string matching with don't cares problem, which improves the Fischer-Paterson bound from 1974 and answers the open problem posed by Weiner and Galil.
Abstract: In this paper we give a randomized O(nlogn)-time algorithm for the string matching with don't cares problem. This improves the Fischer-Paterson bound from 1974 and answers the open problem posed (among others) by Weiner and Galil. Using the same technique, we give an O(nlogn)-time algorithm for other problems, including subset matching, tree pattern matching, (general) approximate threshold matching and point set matching. As this bound essentially matches the complexity of computing of the fast Fourier transform which is the only known technique for solving problems of this type, it is likely that the algorithms are in fact optimal. Additionally the technique used for the threshold matching problem can be applied to the on-line version of this problem, in which we are allowed to preprocess the text and require to process the pattern in time sublinear in the text length. This result involves an interesting variant of the Karp-Rabin fingerprint method in which hash functions are locality-sensitive, i.e. the probability of collision of two words depends on the distance between them.

116 citations


Proceedings ArticleDOI
09 Sep 1998
TL;DR: It is shown that with reasonable space overhead the authors can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching).
Abstract: A successful technique to search large textual databases allowing errors relies on an online search in the vocabulary of the text. To reduce the time of that online search, we index the vocabulary as a metric space. We show that with reasonable space overhead we can improve by a factor of two over the fastest online algorithms, when the tolerated error level is low (which is reasonable in text searching).

82 citations


Proceedings ArticleDOI
23 Feb 1998
TL;DR: This work proposes techniques for retrieving songs by rhythm from music databases by defining similarity measures on rhythm strings and proposing an index structure, called L-tree, to support efficient sub-string matching.
Abstract: We propose techniques for retrieving songs by rhythm from music databases. The rhythm of songs is modeled by rhythm strings. The song retrieval problem is then transformed to the string matching problem. In order to allow approximate string matching, we define similarity measures on rhythm strings. An index structure, called L-tree, is proposed to support efficient sub-string matching. Retrieval algorithms based on L-tree are then designed to provide approximate and sub- song retrieval. Experimental results show that this approach is effective and efficient.

77 citations


Proceedings ArticleDOI
01 Jan 1998
TL;DR: This article gave two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k. The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings.
Abstract: We give two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings (defined later) The second algorithm runs in time O( nk 4 m + n + m )o nk-break periodic patterns The two classes of patterns are easily distinguished in O(m) time

55 citations


Proceedings Article
01 Jul 1998
TL;DR: A model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring is described; this is close to the situation of DNA, for example.
Abstract: We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string under the model, some optimal and many suboptimal. Rather than commit to one optimal explanation, we sum the probabilities over all explanations under the model because this gives the probability of the data under the model. The model has a small number of parameters and these can be estimated from the given string by an expectationmaximization (EM) algorithm. Each iteration of the EM algorithm takes O(n 2) time and a few iterations are typically sufficient. O(n 2) complexity is impractical for strings of more than a few tens of thousands of characters and a faster approximation algorithm is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings. Tests include the recovery of parameter estimates from known sources and applications to real DNA strings.

50 citations


Patent
Jean-Pierre Chanod1
29 Dec 1998
TL;DR: In this paper, a character string is automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings and that ends at a probable string ending.
Abstract: Selected character strings are automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings. The automatic search includes a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending. Each iteration also finds a starting point for the next iteration that is a probable string beginning. The selected strings can be words and multiple word expressions, in which case probable string endings and beginnings are word boundaries. A finite state lexicon, such as a finite state transducer or a finite state automation, can be used to determine whether character strings match the list of selected strings. A tokenizing automation can be used to find starting points.

50 citations


Proceedings Article
20 Jul 1998
TL;DR: In this paper, a small amount of germanium or gallium was added to the ferrite and an atmosphere, such as air, was used during the sintering and cooling steps.
Abstract: Desirable properties of manganese zinc ferrites are obtained without the need for controlling or changing the oxygen partial pressure during the sintering and cooling steps by adding a small amount of germanium or gallium to the ferrite and using an atmosphere, such as air, during the sintering and cooling steps, that has at least 1 percent oxygen by volume.

Journal ArticleDOI
TL;DR: Some interesting special cases of patterns are considered, namely, patterns where there is no length-one run, i.e., there are no a, b, c ϵ ∑ where b ≧ a and b ≠ c and where the substring abc appears in the pattern.

Journal ArticleDOI
TL;DR: Experimental results show that this approach can effectively discover the hidden costs of elementary operations in a set of string classes.

Book ChapterDOI
20 Jul 1998
TL;DR: This work develops notions of what constitutes a significant match, and develops algorithms for them, and gives algorithms for finding a longest match and all symbols in a match for exact regular expression pattern matching.
Abstract: While much work has been done on determining if a document or a line of a document contains an exact or approximate match to a regular expression, less effort has been expended in formulating and determining what to report as “the match” once such a “hit” is detected. For exact regular expression pattern matching, we give algorithms for finding a longest match, all symbols involved in some match, and finding optimal submatches to tagged parts of a pattern. For approximate regular expression matching, we develop notions of what constitutes a significant match, give algorithms for them, and also for finding a longest match and all symbols in a match.

Journal ArticleDOI
TL;DR: This paper considers a class of opposite problems connected with string noninclusion relations: find a shortest string included in no string of a given finite language and find a longest string including nostring of agiven finite language.
Abstract: For every string inclusion relation there are two optimization problems: find a longest string included in every string of a given finite language, and find a shortest string including every string of a given finite language. As an example, the two well-known pairs of problems, the longest common substring (or subsequence) problem and the shortest common superstring (or supersequence) problem, are interpretations of these two problems. In this paper we consider a class of opposite problems connected with string noninclusion relations: find a shortest string included in no string of a given finite language and find a longest string including no string of a given finite language. The predicate "string $\alpha$ is not included in string $\beta$" is interpreted as either "$\alpha$ is not a substring of $\beta$" or "$\alpha$ is not a subsequence of $\beta$". The main purpose is to determine the complexity status of the string noninclusion optimization problems. Using graph approaches we present polynomial-time algorithms for the first interpretation and NP-hardness proofs for the second. We also discuss restricted versions of the problems, correlations between the string inclusion and noninclusion problems, and generalized problems which are the string inclusion problems for one language and the string noninclusion problems for another. In applications the string inclusion problems are used to find a similarity between any structures which can be represented by strings. Respectively, the noninclusion problems can be used to find a nonsimilarity. Such problems occur in computational molecular biology, data compression, pattern recognition, and flexible manufacturing. The above generalized problems arise naturally in all of these applied areas. Apart from this practical reason, we hope that studying the string noninclusion problems will yield deeper understanding of the string inclusion problems.

Proceedings ArticleDOI
01 Sep 1998
TL;DR: To copy o~hewise, to republis~ lo post on servers or to redistribute 10 lists, requires prior specific permission and or a fee.
Abstract: Perrrrks]onto make dig]ial or hard copies of all or parr of xh~ work for personal or cl-room use is granted uithout fee provided Ihat copies are nol made or distribmed for profi! or commercial advantage, and ~ha~ copies bear ~hls notice and ihe full citation on the first page. To copy o~hewise, to republis~ lo post on servers or to redistribute 10 lists, requires prior specific permission and or a fee.

Journal ArticleDOI
TL;DR: An efficient multi-attribute pattern matching machine to locate all occurrences of any of a finite number of the sequence of rule structures (called matching rules) in a sequence of input structures is described.

Journal ArticleDOI
TL;DR: Experimental results where symbols are taken among potentially infinite sets such as integers, reals or composed structures show that, in most cases, it is better to decompose each symbol into a sequence of bytes and use algorithms which assume that the alphabet is bounded.
Abstract: Various string matching algorithms have been designed and some experimental work on string matching over bounded alphabets has been performed, but string matching over unbounded alphabets has been little investigated. We present here experimental results where symbols are taken among potentially infinite sets such as integers, reals or composed structures. These results show that, in most cases, it is better to decompose each symbol into a sequence of bytes and use algorithms which assume that the alphabet is bounded, and use heuristics on symbols. © 1998 John Wiley & Sons, Ltd.

Book ChapterDOI
12 Aug 1998
TL;DR: New similarity measures are presented and they can be used to perform more general two-dimensional approximate pattern matching and to compute the edit distance between two images.
Abstract: In this paper we discuss how to compute the edit distance (or similarity) between two images. We present new similarity measures and how to compute them. They can be used to perform more general two-dimensional approximate pattern matching. Previous work on two-dimensional approximate string matching either work with only substitutions or a restricted edit distance that allows only some type of errors.

Book
08 Jul 1998
TL;DR: A fast bit-vector algorithm for approximate string matching based on dynamic programming and a bit-parallel approach to suffix automata: Fast extended string matching.
Abstract: A fast bit-vector algorithm for approximate string matching based on dynamic programming.- A bit-parallel approach to suffix automata: Fast extended string matching.- A dictionary matching algorithm fast on the average for terms of varying length.- A very fast string matching algorithm for small alphabets and long patterns.- Approximate word sequence matching over Sparse Suffix Trees.- Efficient parallel algorithm for the editing distance between ordered trees.- Reporting exact and approximate regular expression matches.- An approximate oracle for distance in metric spaces.- A rotation invariant filter for two-dimensional string matching.- Constructing suffix arrays for multi-dimensional matrices.- Simple and flexible detection of contiguous repeats using a suffix tree Preliminary Version.- Comparison of coding DNA.- Fixed topology alignment with recombination.- Aligning alignments.- Efficient special cases of pattern matching with swaps.- Aligning DNA sequences to minimize the change in protein.- Genome halving.

Proceedings Article
01 Jan 1998
TL;DR: In this article, a reduced finite automata (NFA) for approximate string matching is presented, where the pattern can occur with some limited number of errors given by edit distance.
Abstract: Approximate string and sequence matching is a problem of searching for all occurrences of a pattern (string or sequence) in some text, where the pattern can occur with some limited number of errors given by edit distance.Several methods were designed for the approximate string matching that simulate nondeterministic finite automata (NFA) constructed for this problem. This paper presents reduced NFAs for the approximate string matching usable in case, when we are interested only in occurrences having edit distance less than or equal to a given integer, but we are not interested in exact edit distance of each found occurrence. Then an algorithm based on the dynamic programming that simulates these reduced NFAs is presented. It is also presented how to use this algorithm for the approximate sequence matching.

Proceedings Article
01 Jan 1998
TL;DR: In this paper, a fuzzy automaton-based approximate string matching algorithm is presented, which can be used for approximate searching in special cases when some pairs of symbols are more similar to each other than the others.
Abstract: We explain new ways of constructing search algorithms using fuzzy sets and fuzzy automata. This technique can be used to search or match strings in special cases when some pairs of symbols are more similar to each other than the others. This kind of similarity cannot be handled by usual searching algorithms.We present sample situations, which would use this kind of searching. Then we define a fuzzy automaton, and some basic constructions we need for our purposes. We continue with definition of our fuzzy automaton based approximate string matching algorithm, and add some notes to fuzzy-trellis construction which can be used for approximate searching.

Book ChapterDOI
20 Jul 1998
TL;DR: In this paper, word sequence matching is discussed, and the common edit distance metric for approximate string matching to searching for words and sequences of words is adapted.
Abstract: In this paper, we discuss word sequence matching, and we adapt the common edit distance metric for approximate string matching to searching for words and sequences of words. We furthermore create a variant of the Sparse Suffix Tree([3]) and adapt algorithms for approximate word and word sequence matching over the Sparse Suffix Tree variant. The algorithms have been implemented and tested in WWW information retrieval environment, and performance data is presented.

Proceedings ArticleDOI
16 Aug 1998
TL;DR: This paper describes the effort to mark and annotate read Cantonese speech for both citation pronunciation and reading aloud sentences/phrases using dynamic programming as for approximate string matching.
Abstract: This paper describes our effort to mark and annotate read Cantonese speech for both citation pronunciation and reading aloud sentences/phrases. Four signals are recorded simultaneously to assist marking and annotation: acoustic, laryngograph, nasal and air burst signals. A coarse match between voiced segments of speech and voiced segments of the phonetic spelling of the utterance is executed by dynamic programming as for approximate string matching. Finally, we discuss general issues in the design of our software for annotation.

Proceedings ArticleDOI
16 Aug 1998
TL;DR: This work considers Gaussian stationary sources and studies the problem of string matching with distortion, and proves theorems concerning the asymptotic behavior of the probability of string match with distortion and the waiting time for the string matching.
Abstract: Wyner and Ziv (1989) studied the asymptotic properties of recurrence times of stationary processes, and applied the results to obtain optimal data compression schemes in information transmission. Since then many data compression algorithms based upon string matching have been proposed and studied. We consider Gaussian stationary sources and study the problem of string matching with distortion. We prove theorems concerning the asymptotic behavior of the probability of string matching with distortion and the waiting time for the string matching.