scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1993"


Book ChapterDOI
02 Jun 1993
TL;DR: It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree.
Abstract: The classical approximate string-matching problem of finding the locations of approximate occurrences P′ of pattern string P in text string T such that the edit distance between P and P′ is ≤ k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m2q + size of the output). Here n = ¦T¦, m = ¦P¦, and q varies depending on the problem instance between 0 and n. In the case of the unit cost edit distance it is shown that q = O(min(n, mk+1¦∑¦ k )) where ∑ is the alphabet.

159 citations


Journal ArticleDOI
TL;DR: A new method for the recognition of arbitrary two-dimensional shapes based on string edit distance computation is described, which is invariant under translation, rotation, scaling and partial occlusion.

154 citations


Journal ArticleDOI
TL;DR: The generalized Boyer–Moore algorithm is shown to solve the k mismatches problem and a related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with k differences.
Abstract: The Boyer–Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mismatches. The generalized Boyer–Moore algorithm is shown (under a mild independence assumption) to solve the problem in expected time $O(kn({1 / {(m - k) + ({k / c})}}))$, where c is the size of the alphabet. A related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with $ \leqslant k$ differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported, showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer–Moore algorithm when $k = 0$.

117 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: A minimum string error rate training algorithm, segmental minimum stringerror rate training, is described, which takes a further step in modeling the basic speech recognition units by directly applying discriminative analysis to string level acoustic model matching.
Abstract: The authors study issues related to string level acoustic modeling in continuous speech recognition. They derive the formulation of minimum string error rate training. A minimum string error rate training algorithm, segmental minimum string error rate training, is described. It takes a further step in modeling the basic speech recognition units by directly applying discriminative analysis to string level acoustic model matching. One of the advantages of this training algorithm lies in its ability to model strings which are competitive with the correct string but are unseen in the training material. The robustness and acoustic resolution of the unit model set can therefore be significantly improved. Various experimental results have shown that significant error rate reduction can be achieved using this approach. >

99 citations


Journal ArticleDOI
TL;DR: A newO(kn) algorithm for approximate string matching problem, wheren is the length of the text, based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table is given.
Abstract: Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported.

65 citations



Proceedings ArticleDOI
20 Oct 1993
TL;DR: A new algorithm for string edit distance computation that needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase is proposed.
Abstract: A new algorithm for string edit distance computation is proposed. It needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase. The algorithm can be extended to matching a word against a dictionary of any size. In this case the time complexity is independent of the length of the dictionary words, and the number of entries in the dictionary. >

24 citations


Book ChapterDOI
13 Sep 1993
TL;DR: A new method for jigsaw puzzle solving that takes real images of puzzle pieces as input data and resolves ambiguities that may result from local shape matching using a best-first tree search procedure with backtracking.
Abstract: In this paper we describe a new method for jigsaw puzzle solving. The main steps of the method are local shape analysis followed by global assembly. Local shape analysis is based on an approximate string matching procedure that detects corresponding partial boundaries of pairs of puzzle pieces. In the assembly phase, ambiguities that may result from local shape matching are resolved using a best-first tree search procedure with backtracking. The method takes real images of puzzle pieces as input data. It has been completely implemented and successfully tested on a number of puzzles.

22 citations


Proceedings ArticleDOI
01 Aug 1993
TL;DR: This work presents a parallel algorithm for two dimensional matching that takes time O(log m) on a CREW PRAM, thus matching the lower bound for string matching on a PRAM without concurrent writes.
Abstract: We present a parallel algorithm for two dimensional matching. This algorithm is optimal in two ways. First, the tot al number of operations on the text is linear. Second, the algorithm takes time O(log m) on a CREW PRAM, thus matching the lower bound for string matching on a PRAM without concurrent writes. On a CRCW, the algorithm runs in time O(log log m.). Finding such an algorithm was a problem posed in 1985 and has been open since.

22 citations


Book ChapterDOI
01 Nov 1993
TL;DR: This article considers some well known processes and considers their prediction based on a model which takes precisely into account the influence of the parameters involved in the modification of their state, based on an approximate string matching.
Abstract: This article deals with the prediction of processes. Research work on this topic considers some well known processes. Their prediction is based on a model which takes precisely into account the influence of the parameters involved in the modification of their state. Such models are not conceivable here: the point is indeed of some processes that human beings control little, like forest fires, which is the subject of the system presented here. The reasoning that it uses relies on cases. We consider indeed that if two processes behaved the same way during a certain interval then their behaviours are very likely to be similar afterwards. The matching is based on an approximate string matching. Because of the complexity of the handled processes, points of view have been introduced. Their consideration requires a matching adapted to each one. They are presented here.

20 citations


Proceedings ArticleDOI
01 Jun 1993
TL;DR: This work describes the first efficient algorithm for simultaneously matching multiple rectangular patterns of varying sizes and aspect, ratios in a rectangular text, and extends the algorithm to a dynamic setting where the set of patterns can change over time.
Abstract: We describe the first worst-case efficient algorithm for simultaneously matching multiple rectangular patterns of varying sizes and aspect ratios in a rectangular text. Efficient means significantly more efficient asymptotically than applying known algorithms that handle one height (or width or aspect ratio) at a time for each height. Our algorithm features an interesting use of multidimensional range searching, as well as new adaptations of several known techniques for two-dimensional string matching. We also extend our algorithm to a dynamic setting where the set of patterns can change over time.

Journal ArticleDOI
TL;DR: The Knuth-Morris-Pratt string matching algorithm can be easily adapted to solve the string prefix-matching problem without making additional comparisons.

Proceedings ArticleDOI
03 Oct 1993
TL;DR: The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs, and makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.
Abstract: The edit distance between two strings is defined as the minimum cost of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This paper presents a linear systolic array for computing the edit distance between two strings over a given alphabet. An encoding scheme is proposed which reduces the number of bits required to represent a state in the computation. The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs. More importantly, the architecture does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation. A prototype of this array is currently being built. >

Book ChapterDOI
02 Jun 1993
TL;DR: This paper describes a two-stage process that uses a new technique to preselect roughly similar m-tuples and demonstrates the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.
Abstract: Given a text of length n and a query of length q we present an algorithm for finding all locations of m-tuples in the text and in the query that differ by at most K mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple filtration which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar m-tuples. The second stage compares these m-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

Journal ArticleDOI
Lloyd Allison1
TL;DR: Ukkonen's (pair-wise) string alignment technique is extended to the problem of finding an optimal alignment for three strings, which has worst-case time-complexity O(nd2) and space-complexe O(d3), where the string lengths are ñ and d is the three-way edit-distance based on tree-costs.

Proceedings ArticleDOI
Y. Mishina1, K. Kojima1
03 Oct 1993
TL;DR: Measurements show that this string matching algorithm using IDP is more than 10 times faster than a scalar program using the Aho-Corasick method.
Abstract: The paper describes a new string matching algorithm that is suitable for vector processors. The hardware implementation of the algorithm is also presented. The algorithm consists of two parts. In the first part, candidate strings that are similar to pattern strings are extracted from a text string (cutout part). Candidate strings may include noise strings, and these are removed in the second part of the algorithm. Each part is efficiently vectorized using vector instructions of conventional vector processors for numerical computations. Moreover, the cutoff part is implemented as an added instruction of the Integrated Database Processor (IDP). Measurements show that this algorithm using IDP is more than 10 times faster than a scalar program using the Aho-Corasick method. >

Proceedings ArticleDOI
I. Sadeh1
30 Mar 1993
TL;DR: The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources.
Abstract: Two practical universal source coding schemes are proposed. One is an approximate fixed length string matching data compression, and the other is LZ-type quasi parsing by approximate string matching. It is shown that in the former algorithm the compression rate converges to the theoretical bound of R(D) for a large class of processes as the database size and the string length tend to infinity. A similar result holds for the latter algorithm in the limit of infinite data base size. The performance of the two algorithms is evaluated where data base size is finite and string length finite. The duality between the two algorithms is proved with some asymptotic properties concerning the workings of an approximate string matching algorithm for ergodic stationary sources. >

Journal ArticleDOI
TL;DR: The (string) pattern matching problem in a probabilistic framework is investigated, namely, it is assumed that both strings form an independent sequences of i.i.d. symbols and it is proved that Mm,n/mP almost surely (a.s.) for log n = o(m).
Abstract: The study and comparison of strings of symbols from a finite or an infinite alphabet is relevant to various areas of science, notably molecular biology, speech recognition, and computer science. In particular, the problem of finding the minimum “distance” between two strings (in general, two blocks of data) is of a practical importance. In this article we investigate the (string) pattern matching problem in a probabilistic framework, namely, it is assumed that both strings form an independent sequences of i.i.d. symbols. Given a text string a of length n and a pattern string b of length m, let Mm,n be the maximum number of matches between b and all m-substrings of a. Our main probabilistic result shows that for a wide range of input parameters in probability (pr.) provided m, n ∞ such that log n = o(m), where P is the probability of a match between any two symbols of these strings, and T is the probability of a match between two positions in the text string and a given position of the pattern string. We also prove that Mm,n/mP almost surely (a.s.) for log n = o(m). © 1993 John Wiley & Sons. Inc.

Book ChapterDOI
30 Sep 1993
TL;DR: This paper shows how to modify their algorithm to use fewer comparisons, an elegant linear-time constant-space string matching algorithm that makes at most 2n−m symbol comparison.
Abstract: Crochemore and Perrin discovered an elegant linear-time constant-space string matching algorithm that makes at most 2n−m symbol comparison. This paper shows how to modify their algorithm to use fewer comparisons.

Book ChapterDOI
02 Jun 1993
TL;DR: The complexity of two problems in this context is investigated, namely, checking if there is any false match, and identifying all the false matches in the match vector.
Abstract: Consider a text string of length n, a pattern string of length m and a match vector of length n which declares each location in the text to be either a mismatch (the pattern does not occur beginning at that location in the text) or a potential match (the pattern may occur beginning at that location in the text). Some of the potential matches could be false, i.e., the pattern may not occur beginning at some location in the text declared to be a potential match. We investigate the complexity of two problems in this context, namely, checking if there is any false match, and identifying all the false matches in the match vector.

Book ChapterDOI
01 Jan 1993
TL;DR: The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.
Abstract: Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.

Journal ArticleDOI
TL;DR: Recurrence relations and closed form analytic expressions are derived for the run time complexity of two models of “fuzzy pattern matching” for use in music analysis; each model assumes the existence of an atomic exact pattern matching operation.
Abstract: In music analysis it is a common requirement to search a musical score for occurrences of a particular musical motif and its variants. This tedious and time-consuming task can be carried out by computer, using one of several models to specify which variants are to be included in the search. The question arises: just how many variants must be explicitly considered? The answer has a profound effect on the computer time needed. In this paper, recurrence relations and closed form analytic expressions are derived for the run time complexity of two models of “fuzzy pattern matching” for use in music analysis; each model assumes the existence of an atomic exact pattern matching operation. The formulae so obtained are evaluated and tabulated as a function of their independent parameters. These results enable a priori estimates to be made of the relative run times of different music searches performed using either model. This is illustrated by applying the results to an actual musical example.


Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of finding the shortest string included in no string of a given finite language and finding the longest string including every string of every string in a given language.
Abstract: Similarity problems intensively investigated in computational molecular biology have the following two stringology models: find the longest string included in any string of a given finite language, and find the shortest string including every string of a given finite language. These two problems are exemplified by the two well-known pairs of problems, the longest common subsequence (or substring) problem and the shortest common supersequence (or superstring) problem, interpretations. In this paper we consider opposite problems connected with string non-inclusion relations: find the shortest string included in no string of a given finite language and find the longest string including no string of a given finite language. The predicate "string alpha is not included in string beta" is interpreted either as "alpha is not a subsequence of beta" or as "alpha is not a substring of beta". The main purpose is to determine the complexity status of the non-similarity problems. Using graph approaches, we present NP-hardness proofs for the first interpretation and polynomial-time algorithms for the second one. Special cases of the problems, and related issues are discussed.

Book ChapterDOI
01 Jan 1993
TL;DR: An algorithm is derived to find all occurrences of P in T with bounded distance k, in time O(kǀTǀ + ǀPǀ).
Abstract: In this paper we consider matching problems on arbitrary ordered labelled trees and ranked trees, which have important applications in many fields such as molecular biology, term rewriting systems and language processing. Given a text tree T and a pattern tree P, we derive an algorithm to find all occurrences of P in T with bounded distance k, in time O(kǀTǀ + ǀPǀ). The distance refers to the number of subtrees to be inserted or deleted from T to obtain P. This problem is an extension of the tree pattern matching problem where deletions of subtrees occur only in T, and of the approximate string matching problem applied to trees. Extensions of the algorithm to solve other relevant problems, such as ranked trees matching, as well as their parallel versions are then devised.

Patent
19 Feb 1993
TL;DR: In this paper, an apparatus and method for finding a target string in a history buffer, where the found target string matches a given current string to a maximum practical length, is described.
Abstract: An apparatus and method are disclosed for finding a target string in a history buffer, where the found target string matches a given current string to a maximum practical length. A presorted array of array entries (SP) is defined where each entry uniquely identifies a value and a location of a respective string-start byte pair in the history buffer. The array entries are sorted primarily upon their string-start byte-pair values and secondarily upon their pointed-to locations. A direct lookup table (DLT) is further provided, indexable by each possible string-start byte pair that may appear in the history buffer. The DLT is used to locate a first array entry for a given string-start byte pair. To find a longest matching target string, the first two bytes of the current string are used as an index into the direct lookup table, and the given table entry is then used as an index into the pre-sorted SP array. The corresponding array entry is used as an index to a first target string in the buffer. Each subsequent array entry having the same string-start byte pair value is used to locate a next target string. A longest matching string is determined from among the target strings pointed to by the SP array. The location and length of the longest matching string are returned as a result.

01 Jan 1993
TL;DR: The method used in the error correction system is based on approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary.
Abstract: The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.

Book ChapterDOI
01 Jan 1993
TL;DR: This paper surveys recent results on parallel algorithms for the string matching problem and concludes that many of the simple algorithms perform very well in practice.
Abstract: The string matching problem is one of the most studied problems in computer science. While it is very easily stated and many of the simple algorithms perform very well in practice, numerous works have been published on the subject and research is still very active. In this paper we survey recent results on parallel algorithms for the string matching problem.

Book
18 May 1993
TL;DR: A linear time pattern matching algorithm between a string and a tree for string prefix-matching and a unifying look at d-dimensional periodicities and space coverings.
Abstract: A linear time pattern matching algorithm between a string and a tree.- Tight comparison bounds for the string prefix-matching problem.- 3-D docking of protein molecules.- Minimal separators of two words.- Covering a string.- On the worst-case behaviour of some approximation algorithms for the shortest common supersequence of k strings.- An algorithm for locating non-overlapping regions of maximum alignment score.- Exact and approximation algorithms for the inversion distance between two chromosomes.- The maximum weight trace problem in multiple sequence alignment.- An algorithm for approximate tandem repeats.- Two dimensional pattern matching in a digitized image.- Analysis of a string edit problem in a probabilistic framework.- Detecting false matches in string matching algorithms.- On suboptimal alignments of biological sequences.- A fast filtration algorithm for the substring matching problem.- A unifying look at d-dimensional periodicities and space coverings.- Approximate string-matching over suffix trees.- Multiple sequence comparison and n-dimensional image reconstruction.- A new editing based distance between unordered labeled trees.