scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1995"


Posted Content
Kemal Oflazer1
TL;DR: Error-tolerant recognition as mentioned in this paper enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer, which can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer.
Abstract: Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again, it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by a finite state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, error-tolerant

190 citations


Book ChapterDOI
05 Jul 1995
TL;DR: This work focuses on the case in which T is fixed and preprocessed in linear time, while P and k vary over consecutive searches, and gives an O(mq+t vanocc) time and O(q) space algorithm, where q≤n depends on the problem instance, and t vanocc is the size of the output.
Abstract: Let T be a text of length n and P a pattern of length m, both strings over a fixed finite alphabet σ. We wish to find all approximate occurrences of P in T having weighted edit distance at most k from P: this is the approximate substring matching problem. We focus on the case in which T is fixed and preprocessed in linear time, while P and k vary over consecutive searches. We give an O(mq+t vanocc ) time and O(q) space algorithm, where q≤n depends on the problem instance, and t vanocc is the size of the output. The running time is proportional to the amount of matching, in the worst case as fast as standard dynamic programming. The algorithm uses the suffix tree representation of the text. The best previous algorithm requires O(mq log q+t vanocc ) time and O(mq) space.

81 citations


Book ChapterDOI
05 Jul 1995
TL;DR: To the knowledge, this work gives the first polynomial time algorithm ever presented to solve the edit distance problem between undirected acyclie graphs.
Abstract: Using these simple, efficient algorithms, a user can submit a query structure and obtain those data structures approximately matching the query. To our knowledge, this work gives the first polynomial time algorithm ever presented to solve the edit distance problem between undirected acyclie graphs. We will have this algorithm implemented within a few months and will make it available to the community.

69 citations


Journal ArticleDOI
TL;DR: This paper focuses on string distance computation based on a set of edit operations, which is based on dynamic programming and has a time complexity of O(n . m), where n and m give the lengths of the two strings to be compared.

69 citations


Journal ArticleDOI
TL;DR: In this paper, an algorithm for computing the normalized edit distance (NED) between two strings X and Y is proposed, which is defined as the minimum quotient between the sum of weights of the edit operations required to transform X into Y and the length of the editing path corresponding to these operations.
Abstract: The normalized edit distance (NED) between two strings X and Y is defined as the minimum quotient between the sum of weights of the edit operations required to transform X into Y and the length of the editing path corresponding to these operations. An algorithm for computing the NED was introduced by Marzal and Vidal (1993) that exhibits 0(mn/sup 2/) computing complexity, where m and n are the lengths of X and Y. We propose here an algorithm that is observed to require in practice the same 0(mn) computing resources as the conventional unnormalized edit distance algorithm does. The performance of this algorithm is illustrated through computational experiments with synthetic data, as well as with real data consisting of OCR chain-coded strings. >

59 citations


Journal ArticleDOI
01 Jan 1995
TL;DR: A generalized version of the string matching algorithm by Wagner and Fischer (1974) is proposed, based on a parametrization of the edit cost, which computes their edit distance in terms of the parameter /spl tau/.
Abstract: A generalized version of the string matching algorithm by Wagner and Fischer (1974) is proposed. It is based on a parametrization of the edit cost. We assume constant cost for any delete and insert operation, but the cost for replacing a symbol is given as a parameter /spl tau/. For any two strings A and B, our algorithm computes their edit distance in terms of the parameter /spl tau/. We give the new algorithm, study some of its properties, and discuss potential applications to pattern recognition. >

50 citations


Journal ArticleDOI
TL;DR: The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs, and makes use of simple basic cells and requires regular nearest neighbor communication, which makes it suitable for VLSI implementation.
Abstract: The edit distance between two strings a1, ..., a/sub m/ and b/sub 1/, ..., b/sub n/ is the minimum cost s of a sequence of editing operations (insertions, deletions and substitutions) that convert one string into the other. This paper describes the design and implementation of a linear systolic array chip for computing the edit distance between two strings over a given alphabet. An encoding scheme is proposed which reduces the number of bits required to represent a state in the computation. The architecture is a parallel realization of the standard dynamic programming algorithm proposed by Wagner and Fischer (1974), and can perform approximate string matching for variable edit costs. More importantly, the architecture does not place any constraint on the lengths of the strings that can be compared. It makes use of simple basic cells and requires regular nearest neighbor communication, which makes it suitable for VLSI implementation. A prototype of this array has been built at the University of South Florida. >

40 citations


Patent
05 Apr 1995
TL;DR: In this article, a method for comparing an electronic handwritten pattern to a stored string is presented, where a linear systolic array processor determines an edit distance between the string and the pattern, and a plurality of edit distance components are generated based on the comparison.
Abstract: Apparatus and a method for comparing an electronic handwritten pattern to a stored string are provided. The string includes a group of portions, each having at least one stroke. Movement of a stylus forms the pattern, and a sequence of strokes is generated. Each stroke represents a stylus movement within a predetermined alphabet. The sequence of strokes has a plurality of portions. A linear systolic array processor determines an edit distance between the string and the pattern. The processor compares a first portion of the string to a first portion of the pattern. A plurality of edit distance components are generated based on the comparison. Each component corresponds to a different set of operations that transforms the first portion of the stored string into the first portion of the pattern. The components are calculated based on a further comparison between additional portions of the stored string and the pattern. The component which has a minimum value is selected. The comparison is performed between each respective portion of the pattern and the corresponding portion of the stored string. The total edit distance is based on the component selected during a last comparison between a last portion of the stored string and a last portion of the pattern.

33 citations


Book ChapterDOI
05 Jul 1995
TL;DR: The case when the given tree is a regular d-ary tree for some fixed d and a d+1/d−1-approximation algorithm for this problem is considered, which runs in time O(d(2kn) d + n2k2d) where k is the number of leaves in the tree and n is the maximum length of any of the sequences labeling the leaves.
Abstract: We consider the problem of aligning sequences related by a given evolutionary tree: given a fixed tree with its leaves labeled with sequences, find ancestral sequences to label the internal nodes so as to minimize the total cost of all the edges in the tree The cost of an edge is the edit distance between the sequences labeling its endpoints In this paper, we consider the case when the given tree is a regular d-ary tree for some fixed d and provide a d+1/d−1-approximation algorithm for this problem that runs in time O(d(2kn) d + n2k2d) where k is the number of leaves in the tree and n is the maximum length of any of the sequences labeling the leaves

23 citations


Proceedings ArticleDOI
14 Aug 1995
TL;DR: Column segmentation logically precedes OCR in the document analysis process and correctly segments the page image for a (fairly) wide range of parameter values, although small, local and repairable errors may be made, an effect measured by a repair cost function.
Abstract: Column segmentation logically precedes OCR in the document analysis process. The trainable algorithm XYCUT relies on horizontal and vertical binary profiles to produce an XY-tree representing the column structure of a page of a technical document in a single pass through the bit image. Training against ground truth adjusts a single, resolution independent, parameter using only local information and guided by an edit distance function. The algorithm correctly segments the page image for a (fairly) wide range of parameter values, although small, local and repairable errors may be made, an effect measured by a repair cost function.

19 citations


Proceedings ArticleDOI
14 Aug 1995
TL;DR: An iterative supervised automatic learning algorithm is proposed which determines the costs for the edit operations and it is revealed that this method significantly improves the recognition accuracy.
Abstract: We describe the realization of a dictionary based lexical postprocessing approach. A character hypotheses lattice (CHL) serves as input which is compared with the words of the vocabulary, using a generalization of the weighted edit distance. The search for the best word is based on a depth first traversal through the paths of the CHL and is directed by several heuristics to achieve a reasonable processing speed without deteriorating the recognition rate significantly. An iterative supervised automatic learning algorithm is proposed which determines the costs for the edit operations. Experiments reveal that this method significantly improves the recognition accuracy.

Proceedings ArticleDOI
30 Mar 1995
TL;DR: The Damerau-Levenshtein string difference metric is generalized in two ways to more accurately compensate for the types of errors that are present in the script recognition domain.
Abstract: In this paper the Damerau-Levenshtein string difference metric is generalized in two ways to more accurately compensate for the types of errors that are present in the script recognition domain. First, the basic dynamic programming method for computing such a measure is extended to allow for merges, splits and two-letter substitutions. Second, edit operations are refined into categories according to the effect they have on the visual `appearance' of words. A set of recognizer-independent constraints is developed to reflect the severity of the information lost due to each operation. These constraints are solved to assign specific costs to the operations. Experimental results on 2,335 corrupted strings and a lexicon of 21,299 words show higher correcting rates than with the original form.

Journal ArticleDOI
TL;DR: A new algorithm for string edit distance computation is given that assumes that one of the two strings to be compared is a dictionary entry that is known a priori that is converted in an off-line phase into a deterministic finite state automaton.
Abstract: A new algorithm for string edit distance computation is given. The algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted in an off-line phase into a deterministic finite state automaton. Given an input string and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the input string. It is independent of the length of the dictionary word. Given not only one butN different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the computation of the edit distance between the input word and each dictionary entry, and the determination of the nearest neighbor in the dictionary need time that is only linear in the length of the input string. However, the number os states of the automation is exponential.

Journal ArticleDOI
TL;DR: A discriminative error criterion is proposed for the optimisation of the parameters of the substitution costs of the minimal edit distance and the method has been successfully tested on a string-to-string matching problem.

Journal ArticleDOI
TL;DR: This paper observes that the optimal path is almost surely (a.s.) equal to αn for large n where α is a constant and n is the sum of lengths of both strings, and derives some bounds for the constant α.
Abstract: We consider a string editing problem in a probabilistic framework. This problem is of considerable interest to many facets of science, most notably molecular biology and computer science. A string editing transforms one string into another by performing a series of weighted edit operations of overall maximum (minimum) cost. The problem is equivalent to finding an optimal path in a weighted grid graph. In this paper we provide several results regarding a typical behaviour of such a path. In particular, we observe that the optimal path (i.e. edit distance) is almost surely (a.s.) equal to αn for large n where α is a constant and n is the sum of lengths of both strings. More importantly, we show that the edit distance is well concentrated around its average value. In the so called independent model in which all weights (in the associated grid graph) are statistically independent, we derive some bounds for the constant α. As a by-product of our results, we also present a precise estimate of the number of alignments between two strings. To prove these findings we use techniques of random walks, diffusion limiting processes, generating functions, and the method of bounded difference. © 1995, Cambridge University Press. All rights reserved.

01 May 1995
TL;DR: It is shown that the two new alternative methods for defining distance strings appear to outperform the earlier approach and can be used to aid users in determining the extent that distance string cutoffs should be employed in application as well as aid in the development of location modeling software.
Abstract: Many solution techniques for discrete location-allocation problems make use of a sorted distance strings data structure in order to speed processing time. The primary drawback to distance strings is that in a standard computer architecture they require approximately 50% additional memory storage in comparison to a standard distance matrix. To reduce the additional memory requirements of distance strings, researchers such as Hillsman [1980] and Densham and Rushton [1992a] have proposed a strategy for cutting the distance strings to include only sets of relatively close neighbor nodes. This has become an important implementation issue for solving relatively large location- allocation problems. In fact, the new Location-Allocation module of the ARC/Info GIS system uses a distance string structure and a string cutoff option to save storage and processing time. The danger in employing only partial distance strings is that if too few distance entries are stored in the distance strings, heuristic or algorithmic performance may be compromised in that the quality of the solutions generated may be less than desirable. This paper tests the effects on solution quality of imposing different distance string definitions and sizes. The goal is to establish guidelines concerning the degree to which the distance strings data structure may be reduced for memory savings without adversely affecting solution quality. The paper proposes and compares two alternative methods to that of Hillsman and of Densham and Rushton for the selection of nodes to be included within the distance strings data structure. We show that the two new alternative methods for defining distance strings appear to outperform the earlier approach. The results of this paper can be used to aid users in determining the extent that distance string cutoffs should be employed in application as well as aid in the development of location modeling software.

Journal ArticleDOI
TL;DR: A new constrained edit distance between an ordered set of input strings and a single output string is proposed, where all the strings are over a finite alphabet.

Book ChapterDOI
11 Dec 1995
TL;DR: The noisy subsequence recognition problem is solved by defining and using the constrained edit distance between X e H and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed.
Abstract: We consider a problem which can greatly enhance the areas of cursive script recognition and the recognition of printed character sequences. This problem involves recognizing words/strings by processing their noisy subsequences. Let X* be any unknown word from a finite dictionary H. Let U be any arbitrary subsequence of X*. We study the problem of estimating X* by processing Y, a noisy version of U. Y contains substitution, insertion, deletion and generalized transposition errors — the latter occurring when transposed characters are themselves subsequently substituted. We solve the noisy subsequence recognition problem by defining and using the constrained edit distance between X e H and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed. An algorithm to compute this constrained edit distance has been presented. Using these algorithms we present a syntactic Pattern Recognition (PR) scheme which corrects noisy text containing all these types of errors. Experimental results which involve strings of lengths between 40 and 80 with an average of 30.24 deleted characters and an overall average noise of 68.69 % demonstrate the superiority of our system over existing methods.

Proceedings ArticleDOI
22 Jan 1995
TL;DR: This paper describes in detail a VLSI architecture for computing the edit distance between arbitrary ordered trees, based on a parallel, systolic realization of the dynamic programming algorithm proposed by S.Y. Lu (1979).
Abstract: The distance between two labeled ordered trees, /spl alpha/ and /spl beta/ is the minimum cost sequence of editing operations (insertions, deletions and substitutions, needed to transform or into /spl beta/ such that the predecessor-descendant relation between nodes and the ordering of nodes is not changed). Approximate tree matching has applications in genetic sequence comparison, scene analysis, error recovery and correction in programming languages, and cluster analysis. Edit distance determination is a computationally intensive task, and the design of special purpose hardware could result in a significant speed up. This paper describes in detail a VLSI architecture for computing the edit distance between arbitrary ordered trees, based on a parallel, systolic realization of the dynamic programming algorithm proposed by S.Y. Lu (1979). This architecture represents a significant improvement over that described by Sastry and Ranganathan (1994), which restricted the type of trees that could be processed by it. Two partitioning strategies to process trees of arbitrary sizes and structures on a fixed size implementation in multiple passes are proposed and analyzed. >

Proceedings Article
20 Sep 1995
TL;DR: Error-tolerant recognition enables the recognition of strings that deviate slightly from any string in the regular set recognized by the underlying finite state recognizer as mentioned in this paper, which has applications in error tolerant morphological analysis, and spelling correction.
Abstract: Error-tolerant recognition enables the recognition of strings that deviate slightly from any string in the regular set recognized by the underlying finite state recognizer In the context of natural language processing, it has applications in error-tolerant morphological analysis, and spelling correction After a description of the concepts and algorithms involved, we give examples from these two applications: In morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently The algorithm can be applied to the moiphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes (such as agglutination or productive compounding) and morphographemic phenomena involved We present an application to error tolerant analysis of agglutinative morphology of Turkish words In spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance It can be applied to any language whose morphology is fully described by a finite state transducer, or with a word list comprising all inflected forms with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41 For spelling correction in Turkish, error-tolerant recognition operating with a (circular) recognizer of Turkish words (with about 29,000 states and 119,000 transitions) can generate all candidate words in less than 20 milliseconds (with edit distance 1) Spelling correction using a recognizer constructed from a large word German list that simulates compounding, also indicates that the approach is applicable in such cases

Book ChapterDOI
13 Sep 1995
TL;DR: This note addresses the recognition of rotated hand-printed characters and defines the sides and the lids of a plane figure and outline an inexact matching process using such features based on the edit distance between circular words of different lengths.
Abstract: In this note we address the recognition of rotated hand-printed characters. We define the sides and the lids of a plane figure and outline an inexact matching process using such features based on the edit distance between circular words of different lengths.

Book ChapterDOI
22 Aug 1995
TL;DR: The edit distance between an input string and a language L is the minimum number of edit operations needed to change the input string into a sentence of L.
Abstract: The notion of edit distance arises from very different fields, such as self-correcting codes, parsing theory, speech recognition and molecular biology. The edit distance between an input string and a language L is the minimum number of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol) needed to change the input string into a sentence of L.

Book ChapterDOI
04 Dec 1995
TL;DR: The basic idea behind the algorithms is to represent each dictionary pattern with one or two points in a ¦Σ¦q — dimensional real space under the L1-metric where Σ is the underlying alphabet and q a fixed integer and then organize these points with some spatial data structure to make subsequent searches with different texts of different lengths and different tolerance values fast.
Abstract: In the approximate dictionary matching problem, a dictionary that contains a set of pattern strings is given. The user presents a text string and a tolerance k (k is a positive integer) and asks for all occurrences of all dictionary patterns that appear in the text with at most k differences to the original patterns. We present two algorithms for the problem. The first algorithm assumes that all patterns in the dictionary are of the same length. The second algorithm removes this assumption at the expense of a bit more complicated preprocess of the dictionary and slower query time. The basic idea behind our algorithms is to represent each dictionary pattern with one or two points in a ¦Σ¦q — dimensional real space under the L1-metric where Σ is the underlying alphabet and q a fixed integer and then organize these points with some spatial data structure to make subsequent searches with different texts of different lengths and different tolerance values fast. Although the approximate dictionary matching would be of enormous importance in molecular biological applications, no previous results for the problem are known.