scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1993"


Journal ArticleDOI
TL;DR: Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances.
Abstract: Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d(X,Y) is defined as the minimum of W(P)/L(P), where P is an editing path between X and Y, W(P) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P). It is shown that in general, d(X,Y) cannot be computed by first obtaining the conventional (unnormalized) edit distance between X and Y and then normalizing this value by the length of the corresponding editing path. In order to compute normalized edit distances, an algorithm that can be implemented to work in O(m*n/sup 2/) time and O(n/sup 2/) memory space is proposed, where m and n are the lengths of the strings under consideration, and m>or=n. Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances. >

339 citations


Proceedings ArticleDOI
Dzung T. Hoang1
05 Apr 1993
TL;DR: Simulations indicate that the faster Splash 2 implementation can search a database at a rate of 12 million characters per second, several orders of magnitude faster than implementations of the dynamic programming algorithm on conventional computers.
Abstract: The author describes two systolic arrays for computing the edit distance between two genetic sequences using a well-known dynamic programming algorithm. The systolic arrays have been implemented for the Splash 2 programmable logic array and are intended to be used for database searching. Simulations indicate that the faster Splash 2 implementation can search a database at a rate of 12 million characters per second, several orders of magnitude faster than implementations of the dynamic programming algorithm on conventional computers. >

202 citations


Journal ArticleDOI
TL;DR: This paper considers two previously proposed measures, and given two computationaly efficient multiple alignment methods whose deviation from the optimal value is guaranteed to be less than a factor of two, gives a related randomized method which gives, with high probability, multiple alignments with fairly small error bounds.

198 citations


Book ChapterDOI
02 Jun 1993
TL;DR: It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree.
Abstract: The classical approximate string-matching problem of finding the locations of approximate occurrences P′ of pattern string P in text string T such that the edit distance between P and P′ is ≤ k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m2q + size of the output). Here n = ¦T¦, m = ¦P¦, and q varies depending on the problem instance between 0 and n. In the case of the unit cost edit distance it is shown that q = O(min(n, mk+1¦∑¦ k )) where ∑ is the alphabet.

159 citations


Journal ArticleDOI
TL;DR: A new method for the recognition of arbitrary two-dimensional shapes based on string edit distance computation is described, which is invariant under translation, rotation, scaling and partial occlusion.

154 citations


Book ChapterDOI
02 Jun 1993
TL;DR: This paper considers two criterions of similarity: the Hamming distance (k mismatches) and the edit distance ( k differences) for a string S of length n and an integer k.
Abstract: A perfect tandem repeat within a string S is a substring r = r1,... r2l of S, for which r1 ... rl = rl+1 ... r2l. An approximate tandem repeat is a substring r = r1,..., rl′,... rl, for which r1,..., rl′ and rl′+1, ... rl are similar. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = ūu, for which the Hamming distance of ū and u is at most k in O(nk log (n/k)) time, or all those for which the edit distance of ū and u is at most k, in O(nk log k log n) time.

136 citations


Book ChapterDOI
01 Nov 1993
TL;DR: This paper proposes a similarity measure for structured representations that is based on graph edit operations and shows how this similarity measure can be computed by means of state space search and considers subgraph isomorphism as a special case of graph similarity.
Abstract: A key concept in case-based reasoning is similarity. In this paper, we first propose a similarity measure for structured representations that is based on graph edit operations. Then we show how this similarity measure can be computed by means of state space search. Subsequently, subgraph isomorphism is considered as a special case of graph similarity and a new efficient algorithm for its detection is proposed. The new algorithm is particularly suitable if there is a large number of library cases being tested against an input graph. Finally, we present experimental results showing the computational efficiency of the proposed approach.

120 citations


Journal ArticleDOI
TL;DR: The generalized Boyer–Moore algorithm is shown to solve the k mismatches problem and a related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with k differences.
Abstract: The Boyer–Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mismatches. The generalized Boyer–Moore algorithm is shown (under a mild independence assumption) to solve the problem in expected time $O(kn({1 / {(m - k) + ({k / c})}}))$, where c is the size of the alphabet. A related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with $ \leqslant k$ differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported, showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer–Moore algorithm when $k = 0$.

117 citations


Book ChapterDOI
02 Jun 1993
TL;DR: This work considers the problem of computing the shortest series of reversals that transform one permutation to another, and takes an arbitrary substring of elements and reverses their order.
Abstract: Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements and reverses their order.

96 citations


Book ChapterDOI
02 Jun 1993
TL;DR: This work states that for most sequences, the true alignment is unknown, and a method that either assesses the significance of the optimal alignment, or that provides few “close” alternatives to the optimal one, is of great importance.
Abstract: It is widely accepted that the optimal alignment between a pair of proteins or nucleic acid sequences that minimizes the edit distance may not necessarily reflect the correct biological alignment. Alignments of proteins based on their structures or of DNA sequences based on evolutionary changes are often different from alignments that minimize edit distance. However, in many cases (e.g. when the sequences are close), the edit distance alignment is a good approximation to the biological one. Since, for most sequences, the true alignment is unknown, a method that either assesses the significance of the optimal alignment, or that provides few “close” alternatives to the optimal one, is of great importance.

31 citations


Proceedings ArticleDOI
20 Oct 1993
TL;DR: A new algorithm for string edit distance computation that needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase is proposed.
Abstract: A new algorithm for string edit distance computation is proposed. It needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase. The algorithm can be extended to matching a word against a dictionary of any size. In this case the time complexity is independent of the length of the dictionary words, and the number of entries in the dictionary. >

Journal ArticleDOI
TL;DR: An algorithm for the computation of the edit distance of run-length coded strings is given, which determines the minimum cost sequence of edit operations transforming one string into another.
Abstract: An algorithm for the computation of the edit distance of run-length coded strings is given. In run-length coding, not all individual symbols in a string are explicitly listed. Instead, one run of identical consecutive symbols is coded by giving one representative symbol together with its multiplicity. The algorithm determines the minimum cost sequence of edit operations transforming one string into another. In the worst case, the algorithm has a time complexity ofO(n·m), wheren andm give the lengths of the strings to be compared. In the best case, the time complexity isO(k·l), wherek andl are the numbers of runs of identical symbols in the two strings under comparison.

Proceedings ArticleDOI
08 Nov 1993
TL;DR: The experimental results show that the tool is capable of classifying various types of office documents, even with very few samples in the sample base, and the matching process involves both computing the edit distance between two trees using a previously developed pattern matching toolkit.
Abstract: The authors present the design of a tool for classifying office documents. They represent a document's layout structure using an ordered labeled tree, called the layout structure tree (L-S-tree), based on a nested segmentation procedure. The tool uses a sample-based approach for learning, where concepts are learned by retaining samples and new documents are classified by matching their L-S-trees with samples. The matching process involves both computing the edit distance between two trees using a previously developed pattern matching toolkit, and calculating the degree of conceptual closeness between the documents and samples. The experimental results show that the tool is capable of classifying various types of office documents, even with very few samples in the sample base.

Proceedings ArticleDOI
03 Oct 1993
TL;DR: The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs, and makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.
Abstract: The edit distance between two strings is defined as the minimum cost of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This paper presents a linear systolic array for computing the edit distance between two strings over a given alphabet. An encoding scheme is proposed which reduces the number of bits required to represent a state in the computation. The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs. More importantly, the architecture does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation. A prototype of this array is currently being built. >

Journal ArticleDOI
TL;DR: An algorithm is derived that computes the minimum edit distance associated with editing X to Y subject to the specified constraints and possible applications for the synchronization error-correcting codes and for the cryptanalysis of certain stream ciphers are discussed.

Journal ArticleDOI
01 Mar 1993
TL;DR: A structural model of cursive handwriting, built on the median axis of the word, is proposed, which constitutes an alternative to the use of the conventional, letter-by-letter analytical model and could be used for other problems involving cursive word recognition.
Abstract: The studies presented in this paper deal with the global recognition of a restricted variety of handwritten words comprising the vocabulary used to write French bank checks. Several tools have been developed within the constraints of this application, tools that relate to the general problem of off-line cursive handwriting recognition. The first difficulty when one wants to read a text is the location of words. This is done by using a top-down analysis that first locates lines of text before segmenting them into individual words. A structural model of cursive handwriting, built on the median axis of the word, is proposed. This constitutes an alternative to the use of the conventional, letter-by-letter analytical model and could be used for other problems involving cursive word recognition. According to this structural model of cursive handwriting, an edit distance is computed between the extracted structural description and the reference descriptions that are interpreted as grapheme strings. This provides an ordered list of candidates for each individual word. Feature extraction in the binary image of the word is performed using a specific line-following algorithm. Since it is possible to express the syntax of the sentences by a finite grammar, this information is used to discard the inconsistent sentences from the possible ones. These various algorithms have been tested on personal data, as well as on real check images.

Proceedings ArticleDOI
25 Oct 1993
TL;DR: A new class of minimum distance binary pattern classifiers based on a generalized Hamming distance metric applied to binary patterns is presented and it is demonstrated that calculation of their weights is very simple.
Abstract: In this paper we present a new class of minimum distance binary pattern classifiers based on a generalized Hamming distance metric applied to binary patterns. While classical minimum distance classifiers and especially the ones using Hamming-distance consider pattern features as having the same significance for the classification task, the proposed new distance metric based classifiers assign weights to the features according to their distinguishing abilities. Concerning neural network implementation of such weighted Hamming distance based classifiers, it is demonstrated that calculation of their weights is very simple. Finally we evaluate their distinguishing properties and we find that their performance is much better than the one of traditional Hamming distance classifiers.

Book ChapterDOI
01 Jan 1993
TL;DR: These new algorithms for the solution of many dynamic programming recurrences for sequence comparison and for RNA secondary structure prediction effectively exploit the physical constraints of the problem to derive more efficient methods for sequence analysis.
Abstract: We consider new algorithms for the solution of many dynamic programming recurrences for sequence comparison and for RNA secondary structure prediction. The techniques upon which the algorithms are based effectively exploit the physical constraints of the problem to derive more efficient methods for sequence analysis.

Book ChapterDOI
01 Jan 1993
TL;DR: The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.
Abstract: Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.


Book ChapterDOI
02 Jun 1993
TL;DR: In this paper, the problem of string edit is reduced to finding an optimal path in a weighted grid graph, and several results regarding a typical behavior of such a path are provided, such as the edit distance is asymptotically almost surely (a.s.) equal to αn where α is a constant and n is the sum of lengths of both strings.
Abstract: We consider a string edit problem in a probabilistic framework. This problem is of considerable interest to many facets of science, most notably molecular biology and computer science. A string editing transforms one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost. An edit operation can be the deletion of a symbol, the insertion of a symbol or the substitution of a symbol. We assume that these weights can be arbitrary distributed. We reduce the problem to finding an optimal path in a weighted grid graph, and provide several results regarding a typical behavior of such a path. In particular, we observe that the optimal path (i.e., edit distance) is asymptotically almost surely (a.s.) equal to αn where α is a constant and n is the sum of lengths of both strings. We also obtained some bounds on α in the so called independent model in which all weights (in the associated grid graph) are assumed to be independent. More importantly, we show that the edit distance is well concentrated around its average value. As a by-product of our results, we also present a precise estimate of the number of alignments between two strings. To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions, and the method of bounded difference.