scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1992"


Journal ArticleDOI
06 Jan 1992
TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Abstract: We study approximate string matching in connection with two string distance functions that are computable in linear time. The first function is based on the so-called $q$-grams. An algorithm is given for the associated string matching problem that finds the locally best approximate occurences of pattern $P$, $|P|=m$, in text $T$, $|T|=n$, in time $O(n\log (m-q))$. The occurences with distance $\leq k$ can be found in time $O(n\log k)$. The other distance function is based on finding maximal common substrings and allows a form of approximate string matching in time $O(n)$. Both distances give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edit distance based string matching.

665 citations


Journal ArticleDOI
01 Sep 1992
TL;DR: The most striking results are found while implementing a large software package for parametric sequence analysis, and in turn have led to faster algorithms for those tasks.
Abstract: The optimal alignment or the weighted minimum edit distance between two DNA or amino acid sequences for a given set of weights is computed by classical dynamic programming techniques, and is widely used in Molecular Biology. However, in DNA and amino acid sequences there is considerable disagreement about how to weight matches, mismatches, insertions/deletions (indels) and gaps. Parametric Sequence alignment is the problem of computing the optimal valued alignment between two sequences as a function of variable weights for matches, mismatches, spaces and gaps. The goal is to partition the parameter space into regions (which are necessarily convex) such that in each region one alignment is optimal throughout and such that the regions are maximal for this property. In this paper we are primarily concerned with the structure of this convex decomposition, and secondarily with the complexity of computing the decomposition. The most striking results are the following: For the special case where only matches, mismatches and spaces are counted, and where spaces are counted throughout the alignment, we show that the decomposition is surprisingly simple: all regions are infinite; there are at most n2/3 regions; the lines that bound the regions are all of the form β = c+(c + 0.5)α; and the entire decomposition can be found in O(knm) time, where k is the actual number of regions and n are the lengths of the two strings. These results were found while implementing a large software package to do parametric sequence analysis, and in turn have led to faster algorithms for those tasks.

119 citations


Book ChapterDOI
29 Apr 1992
TL;DR: A probabilistic analysis of the DP table is given in order to prove that the expected running time of the algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text.
Abstract: We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

89 citations


Book ChapterDOI
Guy Jacobson1, Kiem-Phong Vo1
29 Apr 1992
TL;DR: The famous Robinson-Schensted correspondence between permutations and pairs of Young tableaux can be extended to compute heaviest increasing subsequences and gives rise to a specialized HCS algorithm of the same type as the Apostolico-Guerra LCS algorithm.
Abstract: In this paper, we define the heaviest increasing subsequence (HIS) and heaviest common subsequence (HCS) problems as natural generalizations of the well-studied longest increasing subsequence (LIS) and longest common subsequence (LCS) problems. We show how the famous Robinson-Schensted correspondence between permutations and pairs of Young tableaux can be extended to compute heaviest increasing subsequences. Then, we point out a simple weight-preserving correspondence between the HIS and HCS problems. ? From this duality between the two problems, the Hunt-Szymanski LCS algorithm can be seen as a special case of the Robinson-Schensted algorithm. Our HIS algorithm immediately gives rise to a Hunt-Szymanski type of algorithm for HCS with the same time complexity. When weights are position-independent, we can exploit the structure inherent in the HIS-HCS correspondence to further refine the algorithm. This gives rise to a specialized HCS algorithm of the same type as the Apostolico-Guerra LCS algorithm.

76 citations


Journal ArticleDOI
TL;DR: Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages that allows the posterior odds-ratio of two theories or hypotheses to be calculated in problems of aligning or relating two strings.
Abstract: Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.

71 citations


Patent
John C. Handley1, Thomas B. Hickey1
18 Mar 1992
TL;DR: In this paper, three OCR systems are employed for text conversion and the results generated from each of the three are merged using a edit distance algorithm to estimate a correct common text ancestor.
Abstract: Three OCR systems are employed for text conversion and the results generated from each of the three are merged using a edit distance algorithm to estimate a correct common text ancestor. To make the process computationally feasible for large strings such as pages of documentation with 3,000 characters, the method is executed in two stages. The first procedure is carried out with each page considered as a string of lines. Where differences exist using the edit distance between the lines on a page to find the optimal alignment of the lines. In the event that choice must be made among three non-null lines, the procedure then is invoked on the three lines , by using the edit distance between the characters on a line to find the optimal alignment. The number of computations required of the procedure is further reduced by comer-cutting that hueristically determines an upper bound on the edit distance and limits calculations to those which do not exceed the upper bound.

37 citations


Book ChapterDOI
24 May 1992
TL;DR: For a noisy clock-controlled shift register statistically optimal probabilistic constrained edit distance a recursive algorithm for its efficient computation are derived and corresponding generalized correlation attack is proposed.
Abstract: For a noisy clock-controlled shift register statistically optimal probabilistic constrained edit distance a recursive algorithm for its efficient computation are derived. corresponding generalized correlation attack is proposed.

35 citations


Book ChapterDOI
29 Apr 1992
TL;DR: This work considers a string matching problem where the pattern is a template that matches many different strings with various degrees of perfection, and shows that the structure of Pn can be exploited and the problem reduced to essentially solving a dynamic programming of size O(mn).
Abstract: We consider a string matching problem where the pattern is a template that matches many different strings with various degrees of perfection. The quality of a match is given by a penalty matrix that assigns each pair of characters a score that characterizes how well the characters match. Superfluous characters in the text and superfluous characters in the pattern may also occur and the respective penalties for such gaps in the alignment are also given by the penalty matrix. For a text T of length n, and a template P of length m, we wish to find the best alignment of T with Pn, which is the concatenation of n copies of P, (m will typically be much smaller than n). Such an alignment can simply be obtained by solving a dynamic programming problem of size O(n2m), and ignoring the periodic character of Pn. We show that the structure of Pn can be exploited and the problem reduced to essentially solving a dynamic programming of size O(mn). If the complexity of computing gap penalties is O(1), (which is frequently the case), our algorithm runs in O(mn) time. The problem was motivated by a protein structure problem.

35 citations


01 Apr 1992
TL;DR: A new systolic algorithm for the sequence alignment problem is introduced and an implementation on the SPLASH programmable logic array is described, which performs several orders of magnitude faster.
Abstract: This report introduces a new systolic algorithm for the sequence alignment problem This work builds upon an existing systolic array for computing the edit distance between two sequences The alignment array is meant to be used as the second phase in a two-phase design, with a modified edit distance array serving as the first phase An implementation on the SPLASH programmable logic array is described Because of the extensive pipelining in the systolic array, computing an alignment on the array takes that same amount of time as computing just the edit distance Compared to conventional computers, SPLASH implementation performs several orders of magnitude faster

29 citations


Proceedings ArticleDOI
01 Jul 1992
TL;DR: This paper introduces a new noise model on learning sets of strings in the framework of PAC learning and considers the effect of the noise on learning, and shows general upper bounds on the EDIT noise rate that a learning algorithm of taking the strategy of minimizing disagreements can tolerate and a learning algorithms can tolerate.
Abstract: In this paper, we introduce a new noise model on learning sets of strings in the framework of PAC learning and consider the effect of the noise on learning. The instance domain is the set Sn of strings over a finite alphabet S, and the examples are corrupted by purely random errors affecting only the instances (and not the labels). We consider three types of errors on instances, called EDIT operation errors. EDIT operations consist of “insertion”, “deletion”, and “change” of a symbol in a string. We call such a noise where the examples are corrupted by random errors of EDIT operations on instances the EDIT noise. First we show general upper bounds on the EDIT noise rate that a learning algorithm of taking the strategy of minimizing disagreements can tolerate and a learning algorithm can tolerate. Next we present an efficient algorithm that can learn a class of decision lists over the attributes “a string w contains a pattern p?” from noisy examples under some restriction on the EDIT noise rate.

25 citations


Proceedings ArticleDOI
01 Apr 1992
TL;DR: An algorithm for measuring the similarity of runlength coded strings using as basic data structure an edit matrix similar to the classical algorithm of Wagner and Fischer.
Abstract: We give an algorithm for measuring the similarity of runlength coded strings. In run-length coding, not all individual symbols in a string are listed. Instead, one run of identical consecutive symbols is coded by giving one representative symbol together with its multiplicity. If the strings under consideration consist of long runs of identical symbols, significant reductions in memory and access time can be achieved by run-length coding. Our algorithm determines the minimum coat sequence of edit operations needed to transform one string into another. It uses as basic data structure an edit matrix similar to the classical algorithm of Wagner and Fischer [1]. However, dependh-ig on the particular pair of strings to be compared, only a part of this edit matrix usually needs to be computed. In the worst case, our algorithm has a time complexity of O(n . m), where n and m give the lengths of the strings to be compared. In the best case, the time complexity is O(k. /), where k and/ are the numbers of runs of identical symbols in the two strings under comparison,

Proceedings ArticleDOI
30 Aug 1992
TL;DR: The authors propose a generalized version of the string matching algorithm by Wagner and Fischer (1974) based on a parametrization of the edit cost, which computes the edit distance of A and B in terms of the parameter r.
Abstract: String matching is a useful concept in pattern recognition that is constantly receiving attention from both theoretical and practical points of view. The authors propose a generalized version of the string matching algorithm by Wagner and Fischer (1974). It is based on a parametrization of the edit cost. The authors assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter r. For any two given strings A and B, the algorithm computes the edit distance of A and B in terms of the parameter r. The authors give the new algorithm and study some of its properties. Its time complexity is O(n/sup 2/.m), where n and m are the lengths of the two strings to be compared and n >

Journal ArticleDOI
TL;DR: Recent algorithms for computing the modified edit distance given convex or concave gap cost functions are shown to require Ω(n2) space for certain input.

01 Jan 1992
TL;DR: This work reduces the problem to finding an optimal path in a weighted grid graph and provides several results regarding a typical behavior of such a path, observing that the optimal path is asymptotically almost surely (a.s) equal to an where a is a constant and n is the sum of lengths of both strings.
Abstract: We consider a string edit problem in a probabilistic framework This problem is of considerable interest to many facets of science, most notably molecular biology and computer science A string editing transformes one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost An edit operation can be the deletion of a symbol, the insertion of a symbol or the substitution of a symbol We assume that these weights can be arbitrary distributed We reduce the problem to finding an optimal path in a weighted grid graph and provide several results regarding a typical behavior of such a path In particular, we observe that the optimal path (ie, edit distance) is asymptotically almost surely (as) equal to an where a is a constant and n is the sum of lengths of both strings We also obtain explicit bounds on the constant a More importantly, we show that the edit distance is well concentrated around its average value As a by-product of our results, we also present a precise estimate of the number of alignments between two strings To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions and the method of bounded difference

Book
01 Jan 1992
TL;DR: This paper presents a probabilistic analysis of generalized suffix trees and two algorithms for the longest common subsequence of three (or more) strings.
Abstract: Probabilistic analysis of generalized suffix trees.- A language approach to string searching evaluation.- Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm.- Fast multiple keyword searching.- Heaviest increasing/common subsequence problems.- Approximate regular expression pattern matching with concave gap penalties.- Matrix longest common subsequence problem, duality and hilbert bases.- From regular expressions to DFA's using compressed NFA's.- Identifying periodic occurrences of a template with applications to protein structure.- Edit distance for genome comparison based on non-local operations.- 3-D substructure matching in protein Molecules.- Fast serial and parallel algorithms for approximate tree matching with VLDC's (Extended Abstract).- Grammatical tree matching.- Theoretical and empirical comparisons of approximate string matching algorithms.- Fast and practical approximate string matching.- DZ A text compression algorithm for natural languages.- Multiple alignment with guaranteed error bounds and communication cost.- Two algorithms for the longest common subsequence of three (or more) strings.- Color Set Size problem with applications to string matching.- Computing display conflicts in string and circular string visualization.- Efficient randomized dictionary matching algorithms.- Dynamic dictionary matching with failure functions.

Book ChapterDOI
Norbert Blum1
13 Feb 1992
TL;DR: It is shown how to compute all substrings of x which have c- locally minimal distance from y and all corresponding alignments in O(m · n) time where n is the length of x and m is thelength of y.
Abstract: A substring \(\tilde x\) of a text string x has c-locally minimal distance from a pattern string y, c e N ∪ {0}, if no other substring x′ of x with smaller edit distance to y exists which overlaps \(\tilde x\) by more than c characters. We show how to compute all substrings of x which have c- locally minimal distance from y and all corresponding alignments in O(m · n) time where n is the length of x and m is the length of y.

Proceedings ArticleDOI
01 Apr 1992
TL;DR: An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm.
Abstract: Wc present an approximate string matching case study. An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm. The cvoluhon and nature of the optimization are detailed and test results are presented.

Proceedings ArticleDOI
04 Aug 1992
TL;DR: The author first develops a 'greedy' algorithm to determine some of the LCS and then proposes a generalization to determine all LCS of the given pair of sequences.
Abstract: This paper presents special-purpose linear array processor architecture for determining longest common subsequences (LCS) of two sequences. The algorithm uses systolic and pipelined architectures suitable for VLSI implementation. The algorithms are also suitable for implementation on parallel machines. The author first develops a 'greedy' algorithm to determine some of the LCS and then proposes a generalization to determine all LCS of the given pair of sequences. Earlier hardware algorithms were concerned with determining only the length of LCS or the edit distance of two sequences. >

01 Jan 1992
TL;DR: An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm.
Abstract: Wc present an approximate string matching case study. An optimized version of the edit distance algorithm is described which has proven more accurate for a particular commercial application than the existing (benchmark) algorithm. The cvoluhon and nature of the optimization are detailed and test results are presented.