Showing papers on "Edit distance published in 1993"

PDF

Open Access

Journal Article•DOI•

Computation of normalized edit distance and applications

[...]

01 Sep 1993-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances.

...read moreread less

Abstract: Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d(X,Y) is defined as the minimum of W(P)/L(P), where P is an editing path between X and Y, W(P) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P). It is shown that in general, d(X,Y) cannot be computed by first obtaining the conventional (unnormalized) edit distance between X and Y and then normalizing this value by the length of the corresponding editing path. In order to compute normalized edit distances, an algorithm that can be implemented to work in O(m*n/sup 2/) time and O(n/sup 2/) memory space is proposed, where m and n are the lengths of the strings under consideration, and m>or=n. Experiments in hand-written digit recognition are presented, revealing that the normalized edit distance consistently provides better results than both unnormalized or post-normalized classical edit distances. >

...read moreread less

339 citations

Proceedings Article•DOI•

Searching genetic databases on Splash 2

[...]

Dzung T. Hoang¹•Institutions (1)

Brown University¹

05 Apr 1993

TL;DR: Simulations indicate that the faster Splash 2 implementation can search a database at a rate of 12 million characters per second, several orders of magnitude faster than implementations of the dynamic programming algorithm on conventional computers.

...read moreread less

Abstract: The author describes two systolic arrays for computing the edit distance between two genetic sequences using a well-known dynamic programming algorithm. The systolic arrays have been implemented for the Splash 2 programmable logic array and are intended to be used for database searching. Simulations indicate that the faster Splash 2 implementation can search a database at a rate of 12 million characters per second, several orders of magnitude faster than implementations of the dynamic programming algorithm on conventional computers. >

...read moreread less

202 citations

Journal Article•DOI•

Efficient methods for multiple sequence alignment with guaranteed error bounds

[...]

Dan Gusfield¹•Institutions (1)

University of California, Davis¹

01 Jan 1993-Bulletin of Mathematical Biology

TL;DR: This paper considers two previously proposed measures, and given two computationaly efficient multiple alignment methods whose deviation from the optimal value is guaranteed to be less than a factor of two, gives a related randomized method which gives, with high probability, multiple alignments with fairly small error bounds.

...read moreread less

198 citations

Book Chapter•DOI•

Approximate String-Matching over Suffix Trees

[...]

Esko Ukkonen¹•Institutions (1)

University of Helsinki¹

02 Jun 1993

TL;DR: It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree.

...read moreread less

Abstract: The classical approximate string-matching problem of finding the locations of approximate occurrences P′ of pattern string P in text string T such that the edit distance between P and P′ is ≤ k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m2q + size of the output). Here n = ¦T¦, m = ¦P¦, and q varies depending on the problem instance between 0 and n. In the case of the unit cost edit distance it is shown that q = O(min(n, mk+1¦∑¦ k )) where ∑ is the alphabet.

...read moreread less

159 citations

Journal Article•DOI•

Applications of approximate string matching to 2D shape recognition

[...]

Horst Bunke¹, Urs Bühler¹•Institutions (1)

University of Bern¹

01 Dec 1993-Pattern Recognition

TL;DR: A new method for the recognition of arbitrary two-dimensional shapes based on string edit distance computation is described, which is invariant under translation, rotation, scaling and partial occlusion.

...read moreread less

154 citations

Book Chapter•DOI•

An Algorithm for Approximate Tandem Repeats

[...]

Gad M. Landau¹, Jeanette P. Schmidt¹•Institutions (1)

New York University¹

02 Jun 1993

TL;DR: This paper considers two criterions of similarity: the Hamming distance (k mismatches) and the edit distance ( k differences) for a string S of length n and an integer k.

...read moreread less

Abstract: A perfect tandem repeat within a string S is a substring r = r1,... r2l of S, for which r1 ... rl = rl+1 ... r2l. An approximate tandem repeat is a substring r = r1,..., rl′,... rl, for which r1,..., rl′ and rl′+1, ... rl are similar. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = ūu, for which the Hamming distance of ū and u is at most k in O(nk log (n/k)) time, or all those for which the edit distance of ū and u is at most k, in O(nk log k log n) time.

...read moreread less

136 citations

Book Chapter•DOI•

[...]

Horst Bunke¹, Bruno T. Messmer¹•Institutions (1)

University of Bern¹

01 Nov 1993

TL;DR: This paper proposes a similarity measure for structured representations that is based on graph edit operations and shows how this similarity measure can be computed by means of state space search and considers subgraph isomorphism as a special case of graph similarity.

...read moreread less

Abstract: A key concept in case-based reasoning is similarity. In this paper, we first propose a similarity measure for structured representations that is based on graph edit operations. Then we show how this similarity measure can be computed by means of state space search. Subsequently, subgraph isomorphism is considered as a special case of graph similarity and a new efficient algorithm for its detection is proposed. The new algorithm is particularly suitable if there is a large number of library cases being tested against an input graph. Finally, we present experimental results showing the computational efficiency of the proposed approach.

...read moreread less

120 citations

Journal Article•DOI•

Approximate Boyer-Moore string matching

[...]

Jorma Tarhio, Esko Ukkonen

01 Apr 1993-SIAM Journal on Computing

TL;DR: The generalized Boyer–Moore algorithm is shown to solve the k mismatches problem and a related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with k differences.

...read moreread less

Abstract: The Boyer–Moore idea applied in exact string matching is generalized to approximate string matching. Two versions of the problem are considered. The k mismatches problem is to find all approximate occurrences of a pattern string (length m) in a text string (length n) with at most k mismatches. The generalized Boyer–Moore algorithm is shown (under a mild independence assumption) to solve the problem in expected time $O(kn({1 / {(m - k) + ({k / c})}}))$, where c is the size of the alphabet. A related algorithm is developed for the k differences problem, where the task is to find all approximate occurrences of a pattern in a text with $ \leqslant k$ differences (insertions, deletions, changes). Experimental evaluation of the algorithms is reported, showing that the new algorithms are often significantly faster than the old ones. Both algorithms are functionally equivalent with the Horspool version of the Boyer–Moore algorithm when $k = 0$.

...read moreread less

117 citations

Book Chapter•DOI•

Exact and Approximation Algorithms for the Inversion Distance Between Two Chromosomes

[...]

John Kececioglu¹, David Sankoff²•Institutions (2)

University of California, Davis¹, Université de Montréal²

02 Jun 1993

TL;DR: This work considers the problem of computing the shortest series of reversals that transform one permutation to another, and takes an arbitrary substring of elements and reverses their order.

...read moreread less

Abstract: Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements and reverses their order.

...read moreread less

96 citations

Book Chapter•DOI•

On Suboptimal Alignments of Biological Sequences

[...]

Dalit Naor¹, Douglas L. Brutlag¹•Institutions (1)

Stanford University¹

02 Jun 1993

TL;DR: This work states that for most sequences, the true alignment is unknown, and a method that either assesses the significance of the optimal alignment, or that provides few “close” alternatives to the optimal one, is of great importance.

...read moreread less

Abstract: It is widely accepted that the optimal alignment between a pair of proteins or nucleic acid sequences that minimizes the edit distance may not necessarily reflect the correct biological alignment. Alignments of proteins based on their structures or of DNA sequences based on evolutionary changes are often different from alignments that minimize edit distance. However, in many cases (e.g. when the sequences are close), the edit distance alignment is a good approximation to the biological one. Since, for most sequences, the true alignment is unknown, a method that either assesses the significance of the optimal alignment, or that provides few “close” alternatives to the optimal one, is of great importance.

...read moreread less

31 citations

Proceedings Article•DOI•

A fast algorithm for finding the nearest neighbor of a word in a dictionary

[...]

H. Bunke

20 Oct 1993

TL;DR: A new algorithm for string edit distance computation that needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase is proposed.

...read moreread less

Abstract: A new algorithm for string edit distance computation is proposed. It needs time that is only linear in the length of one of the two strings to be matched, provided that the other string has undergone some preprocessing in an off-line phase. The algorithm can be extended to matching a word against a dictionary of any size. In this case the time complexity is independent of the length of the dictionary words, and the number of entries in the dictionary. >

...read moreread less

Journal Article•DOI•

An algorithm for matching run-length coded strings

[...]

Horst Bunke¹, János Csirik²•Institutions (2)

University of Bern¹, University of Szeged²

01 Dec 1993-Computing

TL;DR: An algorithm for the computation of the edit distance of run-length coded strings is given, which determines the minimum cost sequence of edit operations transforming one string into another.

...read moreread less

Abstract: An algorithm for the computation of the edit distance of run-length coded strings is given. In run-length coding, not all individual symbols in a string are explicitly listed. Instead, one run of identical consecutive symbols is coded by giving one representative symbol together with its multiplicity. The algorithm determines the minimum cost sequence of edit operations transforming one string into another. In the worst case, the algorithm has a time complexity ofO(n·m), wheren andm give the lengths of the strings to be compared. In the best case, the time complexity isO(k·l), wherek andl are the numbers of runs of identical symbols in the two strings under comparison.

...read moreread less

Proceedings Article•DOI•

A tool for classifying office documents

[...]

X. Hao¹, J. Wang, Michael Bieber¹, P.A. Ng¹•Institutions (1)

New Jersey Institute of Technology¹

08 Nov 1993

TL;DR: The experimental results show that the tool is capable of classifying various types of office documents, even with very few samples in the sample base, and the matching process involves both computing the edit distance between two trees using a previously developed pattern matching toolkit.

...read moreread less

Abstract: The authors present the design of a tool for classifying office documents. They represent a document's layout structure using an ordered labeled tree, called the layout structure tree (L-S-tree), based on a nested segmentation procedure. The tool uses a sample-based approach for learning, where concepts are learned by retaining samples and new documents are classified by matching their L-S-trees with samples. The matching process involves both computing the edit distance between two trees using a previously developed pattern matching toolkit, and calculating the degree of conceptual closeness between the documents and samples. The experimental results show that the tool is capable of classifying various types of office documents, even with very few samples in the sample base.

...read moreread less

Proceedings Article•DOI•

A systolic array for approximate string matching

[...]

R. Sastry¹, Nagarajan Ranganathan¹•Institutions (1)

University of Florida¹

03 Oct 1993

TL;DR: The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs, and makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation.

...read moreread less

Abstract: The edit distance between two strings is defined as the minimum cost of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This paper presents a linear systolic array for computing the edit distance between two strings over a given alphabet. An encoding scheme is proposed which reduces the number of bits required to represent a state in the computation. The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs. More importantly, the architecture does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest-neighbor communication, which makes it suitable for VLSI implementation. A prototype of this array is currently being built. >

...read moreread less

Journal Article•DOI•

String editing under a combination of constraints

[...]

Slobodan Petrovic, Jovan Dj. Golic¹•Institutions (1)

University of Belgrade¹

15 Oct 1993-Information Sciences

TL;DR: An algorithm is derived that computes the minimum edit distance associated with editing X to Y subject to the specified constraints and possible applications for the synchronization error-correcting codes and for the cryptanalysis of certain stream ciphers are discussed.

...read moreread less

Journal Article•DOI•

Automatic reading of the literal amount of bank checks

[...]

Thierry Paquet¹, Yves Lecourtier¹•Institutions (1)

University of Rouen¹

01 Mar 1993

TL;DR: A structural model of cursive handwriting, built on the median axis of the word, is proposed, which constitutes an alternative to the use of the conventional, letter-by-letter analytical model and could be used for other problems involving cursive word recognition.

...read moreread less

Abstract: The studies presented in this paper deal with the global recognition of a restricted variety of handwritten words comprising the vocabulary used to write French bank checks. Several tools have been developed within the constraints of this application, tools that relate to the general problem of off-line cursive handwriting recognition. The first difficulty when one wants to read a text is the location of words. This is done by using a top-down analysis that first locates lines of text before segmenting them into individual words. A structural model of cursive handwriting, built on the median axis of the word, is proposed. This constitutes an alternative to the use of the conventional, letter-by-letter analytical model and could be used for other problems involving cursive word recognition. According to this structural model of cursive handwriting, an edit distance is computed between the extracted structural description and the reference descriptions that are interpreted as grapheme strings. This provides an ordered list of candidates for each individual word. Feature extraction in the binary image of the word is performed using a specific line-following algorithm. Since it is possible to express the syntax of the sentences by a finite grammar, this information is used to discard the inconsistent sentences from the possible ones. These various algorithms have been tested on personal data, as well as on real check images.

...read moreread less

Proceedings Article•DOI•

Pattern classification using a generalised Hamming distance metric

[...]

N. Gaitanis, G. Kapogianopoulos, Dimitrios A. Karras

25 Oct 1993

TL;DR: A new class of minimum distance binary pattern classifiers based on a generalized Hamming distance metric applied to binary patterns is presented and it is demonstrated that calculation of their weights is very simple.

...read moreread less

Abstract: In this paper we present a new class of minimum distance binary pattern classifiers based on a generalized Hamming distance metric applied to binary patterns. While classical minimum distance classifiers and especially the ones using Hamming-distance consider pattern features as having the same significance for the classification task, the proposed new distance metric based classifiers assign weights to the features according to their distinguishing abilities. Concerning neural network implementation of such weighted Hamming distance based classifiers, it is demonstrated that calculation of their weights is very simple. Finally we evaluate their distinguishing properties and we find that their performance is much better than the one of traditional Hamming distance classifiers.

...read moreread less

Book Chapter•DOI•

Efficient Algorithms for Sequence Analysis

[...]

David Eppstein¹, Zvi Galil², Zvi Galil³, Raffaele Giancarlo⁴, Giuseppe F. Italiano⁵, Giuseppe F. Italiano³ - Show less +2 more•Institutions (5)

University of California¹, Tel Aviv University², Columbia University³, Bell Labs⁴, Sapienza University of Rome⁵

01 Jan 1993

TL;DR: These new algorithms for the solution of many dynamic programming recurrences for sequence comparison and for RNA secondary structure prediction effectively exploit the physical constraints of the problem to derive more efficient methods for sequence analysis.

...read moreread less

Abstract: We consider new algorithms for the solution of many dynamic programming recurrences for sequence comparison and for RNA secondary structure prediction. The techniques upon which the algorithms are based effectively exploit the physical constraints of the problem to derive more efficient methods for sequence analysis.

...read moreread less

Book Chapter•DOI•

Approximate string-matching and the q-gram distance

[...]

Esko Ukkonen¹•Institutions (1)

University of Helsinki¹

01 Jan 1993

TL;DR: The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.

...read moreread less

Abstract: Some results are summarized on approximate string-matching with a string distance function that is computable in linear time and is based on the so-called q-grams (‘n-grams’). An algorithm is given for the associated string matching problem that finds the locally best approximate occurrences of pattern P, ∣P∣ = m, in text T, ∣T∣ = n, in time O(n log(m - q)). The occurrences with distance ≤ k can be found in time O(nlog k). This should be compared to the edit distance based k-differences problem for which the best algorithm currently known needs O(kn). The q-gram distance yields a lower bound for the unit cost edit distance, which leads to a fast hybrid algorithm for the k-differences problem.

...read moreread less

Journal Article•

Data Compression of ECG Based on the Edit Distance Algorithms (Special Section on ECG Data Compression)

[...]

Hiroyoshi Morita, Kingo Kobayashi

25 Dec 1993-IEICE Transactions on Information and Systems

Book Chapter•DOI•

Analysis of a String Edit Problem in a Probabilistic Framework (Extended Abstract)

[...]

Guy Louchard¹, Wojciech Szpankowski²•Institutions (2)

Université libre de Bruxelles¹, Purdue University²

02 Jun 1993

TL;DR: In this paper, the problem of string edit is reduced to finding an optimal path in a weighted grid graph, and several results regarding a typical behavior of such a path are provided, such as the edit distance is asymptotically almost surely (a.s.) equal to αn where α is a constant and n is the sum of lengths of both strings.

...read moreread less

Abstract: We consider a string edit problem in a probabilistic framework. This problem is of considerable interest to many facets of science, most notably molecular biology and computer science. A string editing transforms one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost. An edit operation can be the deletion of a symbol, the insertion of a symbol or the substitution of a symbol. We assume that these weights can be arbitrary distributed. We reduce the problem to finding an optimal path in a weighted grid graph, and provide several results regarding a typical behavior of such a path. In particular, we observe that the optimal path (i.e., edit distance) is asymptotically almost surely (a.s.) equal to αn where α is a constant and n is the sum of lengths of both strings. We also obtained some bounds on α in the so called independent model in which all weights (in the associated grid graph) are assumed to be independent. More importantly, we show that the edit distance is well concentrated around its average value. As a by-product of our results, we also present a precise estimate of the number of alignments between two strings. To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions, and the method of bounded difference.

...read moreread less