scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 1998"


Journal ArticleDOI
TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

897 citations


Journal ArticleDOI
TL;DR: It is formally shown that the new distance measure is a metric, based on the maximal common subgraph of two graphs, which is superior to edit distance based measures in that no particular edit operations together with their costs need to be defined.

782 citations


Book ChapterDOI
Philip N. Klein1
24 Aug 1998
TL;DR: This work gives an O(n3 log n) algorithm to compute the edit distance between two ordered trees, a tree in which each node's incident edges are cyclically ordered.
Abstract: An ordered tree is a tree in which each node's incident edges are cyclically ordered; think of the tree as being embedded in the plane. Let A and B be two ordered trees. The edit distance between A and B is the minimum cost of a sequence of operations (contract an edge, uncontract an edge, modify the label of an edge) needed to transform A into B. We give an O(n3 log n) algorithm to compute the edit distance between two ordered trees.

283 citations


Journal ArticleDOI
TL;DR: This paper considers the following incremental version of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them, and obtains O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.
Abstract: The problem of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k2) time required to compute a solution from scratch. We further show, with a series of applications, that this algorithm is indeed more powerful than its nonincremental counterpart. We show this by solving the applications with greater asymptotic efficiency than heretofore possible. For example, we obtain O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.

216 citations


Journal ArticleDOI
TL;DR: This work builds a data structure that supports O(mn log m) time queries about the weight of any of the O(m2n) best paths from the vertices in column 0 of the graph to all other vertices, and presents a simple O(n2 log n) time and $\Theta(n^2)$ space algorithm to find all approximate tandem repeats xy within a string of size n.
Abstract: Weighted paths in directed grid graphs of dimension (m X n) can be used to model the string edit problem, which consists of obtaining optimal (weighted) alignments between substrings of A, |A|=m, and substrings of B, |B|=n. We build a data structure (in O(mn log m) time) that supports O(log m) time queries about the weight of any of the O(m2n) best paths from the vertices in column 0 of the graph to all other vertices. Using these techniques we present a simple O(n2 log n) time and $\Theta(n^2)$ space algorithm to find all (the locally optimal) approximate tandem (or nontandem) repeats xy within a string of size n. This improves (by a factor of log n) upon several previous algorithms for this problem and is the first algorithm to find all locally optimal repeats. For edit graphs with weights in {0, -1, 1}, a slight modification of our techniques yields an O(n2) algorithm for the cyclic string comparison problem, as compared to O(n2 log n) for the case of general weights.

148 citations


Proceedings ArticleDOI
D.C. Gibbon1
23 Feb 1998
TL;DR: The techniques presented can produce high quality hypermedia documents of video programs with little or no additional manual effort.
Abstract: This paper presents a method of automatically creating hypermedia documents from conventional transcriptions of television programs. Using parallel text alignment techniques, the temporal information derived from the closed caption signal is exploited to convert the transcription into a synchronized text stream. Given this text stream, we can create links between the transcription and the image and audio media streams. We describe a two-pass method for aligning parallel texts that first uses dynamic programming techniques to maximize the number of corresponding words (by minimizing the word edit distance). The second stage converts the word alignment into a sentence alignment, taking into account the cases of sentence split and merge. We present results of text alignment on a database of 610 programs (including three television news programs over a one-year period) for which we have closed caption, transcript, audio and image streams. The techniques presented can produce high quality hypermedia documents of video programs with little or no additional manual effort.

120 citations


Book ChapterDOI
Horst Bunke1
TL;DR: Some new optimal algorithms for error-tolerant graph matching are discussed, and under specific conditions, the new algorithms may be significantly more efficient than traditional methods.
Abstract: This paper first reviews some theoretical results in error-tolerant graph matching that were obtained recently. The results include a new metric for error-tolerant graph matching based on maximum common subgraph, a relation between maximum common subgraph and graph edit distance, and the existence of classes of cost functions for error-tolerant graph matching. Then some new optimal algorithms for error-tolerant graph matching are discussed. Under specific conditions, the new algorithms may be significantly more efficient than traditional methods.

57 citations


Proceedings ArticleDOI
01 Jan 1998
TL;DR: This article gave two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k. The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings.
Abstract: We give two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k The first algorithm, which is quite simple, runs in time O( nk 3 m + n + m) on all patterns except k-break periodic strings (defined later) The second algorithm runs in time O( nk 4 m + n + m )o nk-break periodic patterns The two classes of patterns are easily distinguished in O(m) time

55 citations


Proceedings ArticleDOI
05 Aug 1998
TL;DR: The problem of video sequence-to-sequence matching as a pattern matching problem is formulated and the vstring edit distance is proposed as a new similarity measure for video sequence matching.
Abstract: Contrary to current approaches which generally treat the video data as a random collection of static images, content-based video retrieval requires methods for video sequence-to-sequence matching incorporating the temporal order inherent in video data. We formulate the problem of video sequence-to-sequence matching as a pattern matching problem. New string edit operations required for the special characteristics of video sequences and the unique features of the vstring representation are introduced. Based on the edit operations, the vstring edit distance is proposed as a new similarity measure for video sequence matching.

47 citations


Book ChapterDOI
22 Jul 1998
TL;DR: An improved similarity measure for the first-order instance based learner Ribl is presented that employs the concept of edit distances to efficiently compute distances between lists and terms.
Abstract: The similarity measures used in first-order IBL so far have been limited to the function-free case. In this paper we show that a lot of predictive power can be gained by allowing lists and other terms in the input representation and designing similarity measures that work directly on these structures. We present an improved similarity measure for the first-order instance based learner Ribl that employs the concept of edit distances to efficiently compute distances between lists and terms, discuss its computational and formal properties, and show that it is empirically superior by a wide margin on a problem from the domain of biochemistry.

32 citations


Proceedings ArticleDOI
10 Aug 1998
TL;DR: This paper proposes the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence, and applies word segmentation algorithm and finding word sequences with low probability to determine the most probable correction.
Abstract: For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words, we used a modified edit distance which reflects the characteristic of Thai OCR errors. Finally, a part-of-speech trigram model and Winnow algorithm are combined to determine the most probable correction.

Proceedings ArticleDOI
16 Aug 1998
TL;DR: A novel framework for comparing and matching corrupted relational graphs and shows how the normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem.
Abstract: This paper describes a novel framework for comparing and matching corrupted relational graphs. The paper develops the idea of edit distance originally used for graph-matching by Sanfeliu and Fu (1983). We show how the normalised edit distance of Marzal and Vidal (1993) can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare the resulting graph-matching algorithm with that reported by Wilson and Hancock (1997).

Book ChapterDOI
12 Aug 1998
TL;DR: New similarity measures are presented and they can be used to perform more general two-dimensional approximate pattern matching and to compute the edit distance between two images.
Abstract: In this paper we discuss how to compute the edit distance (or similarity) between two images. We present new similarity measures and how to compute them. They can be used to perform more general two-dimensional approximate pattern matching. Previous work on two-dimensional approximate string matching either work with only substitutions or a restricted edit distance that allows only some type of errors.

Book
01 Jan 1998
TL;DR: This thesis proposes index structures for efficient evaluation of temporal queries in temporal databases and similarity based queries in multimedia databases, and introduces a distance-based index structure, called mvp-tree (multi-vantage point tree) for similarity queries on high-dimensional metric spaces.
Abstract: This thesis proposes index structures for efficient evaluation of temporal queries in temporal databases and similarity based queries in multimedia databases. To support temporal operators and to increase the efficiency of temporal queries, indexing based on temporal attributes is required. A temporal database can support two notions of time. Valid time is the time when a data entity is valid in reality, and the transaction time is the time when a data entity is recorded in the database. In this thesis, methods for indexing time intervals in transaction time and valid time databases are proposed. Transaction time databases are append only databases. Data is never deleted from the database, and data versions that were deleted or modified are stored as historical versions. This thesis proposes indexing current and historical versions of temporal entities separately to exploit the behavior of transaction time data. Two structures, namely IB+tree and AD*-tree, are proposed for indexing the historical data versions. IB+tree (Interval B+tree) is an augmented B+tree to support interval search queries. Similarly, AD*-tree is an augmented AD-tree (Append-Delete tree) which is a one-dimensional tree structure specifically designed for FIFO domains. Some experimental and analytical results are provided for evaluation of these proposed structures. In valid time databases, temporal data versions can be inserted, deleted, or modified at any point in time. Furthermore, their lifespans can extend into the future. For indexing a dynamic set of valid time intervals, an indexing scheme that uses IB+trees with time-splits is proposed. Time-splits partition long intervals into disjoint subintervals and distribute them among several leaf nodes to increase efficiency of search operations, especially for timeslice queries. The effect of using time-splits on query performance is evaluated empirically. The extensions to IB+trees for handling valid time intervals that span into future and valid time intervals whose end points move along the current timeline are also shown. In the context of multimedia databases, indexing methods for answering similarity based queries are studied. An indexing method for finding approximate matches in a collection of numerical sequences is proposed, and a general index structure for similarity based queries in metric spaces is introduced. First problem considered is efficient matching and retrieval of sequences of different lengths. For similarity matching of sequences, a modified version of the edit distance function is used. For efficient retrieval of matching sequences, an indexing scheme based on lengths and relative distances between sequences is proposed. Conventional index structures are designed to function in Euclidean domains. They cannot be directly used for non-Euclidean data (ex: text), and they are not as efficient for high dimensional data spaces. Distance based index structures are proposed for applications where the data domain is high dimensional, or the distance function used to compute distances between data objects is non-Euclidean. This thesis introduces a distance-based index structure, called mvp-tree (multi-vantage point tree) for similarity queries on high-dimensional metric spaces. The mvp-tree uses multiple reference points (vantage points) in each node to partition the data space into spherical shell-like regions in a hierarchical manner. It also utilizes the pre-computed distances at search time. The performance of mvp-tree is evaluated empirically, and the results are provided.

Journal ArticleDOI
TL;DR: A new linear systolic algorithm made up of min(n, m) processors and computes the edit distance and the sequence alignment of two sequences Target and Source in time min( n, m), where n and m denote the lengths of Target and source respectively.
Abstract: This paper introduces a new linear systolic algorithm [10] for the sequence alignment problem [18]. It is made up of min(n, m) processors and computes the edit distance and the sequence alignment of two sequences Target and Source in time min(n, m) + 2.max(n, m), where n and m denote the lengths of Target and Source respectively. Its characteristics make it faster and more efficient than the previous linear array algorithm for the alignment search.

Book ChapterDOI
TL;DR: This paper uses the tree adjoining grammar developed by Joshi as a prototypical structural model, and applies genetic algorithms to tree adjoining grammars with the introduction of a new editing operation.
Abstract: This paper describes the use of discrete graphical editing operations to dynamically fit hierarchical structural models to input data. We use the tree adjoining grammar developed by Joshi [l] as a prototypical structural model, and realise the editing process using a genetic algorithm. The novelty of our approach lies firstly in the use of the edit distance between the ordered frontier nodes of a tree and a set of dictionaries of legal labels derived from the input as a cost function. Secondly, we apply genetic algorithms to tree adjoining grammars with the introduction of a new editing operation. We demonstrate the utility of the method on a simple natural language processing problem.

01 Jan 1998
TL;DR: The goal in this dissertation is the development of two techniques to measure the similarity between the behavior of different dynamical systems, based on the string edit distance computation and a representation of system trajectories in a Fuzzy Information Space.
Abstract: Our goal in this dissertation is the development of two techniques to measure the similarity between the behavior of different dynamical systems. The behavior of a dynamical system is defined by its motions or trajectories. The first technique will be based on the string edit distance computation. In this approach we will treat the system trajectories as 2D or 3D curves, and represent them as a string of symbols; the individual symbols will denote the value of the curvature and the torsion along the curve. In the literature, the string edit distance has only been used to match 2D curves. We will extend it to handle 3D curves. This can be considered a static approach, in the sense that it does not take into account the temporal elements of the system. In order to include the temporal elements of the system, we will develop a second technique to measure the similarity of trajectories. This technique will be based on a representation of system trajectories in a Fuzzy Information Space. A Fuzzy Information Space is characterized by a collection of temporal fuzzy sets, which decompose the observed trajectories into simple modes of dynamic activities. Each collection of temporal fuzzy sets is called the dynamic profile of a physical system, and can be used to construct a dynamic fuzzy set, which generates a system trajectory. This dynamic profile of a physical system represents a decomposition of the system trajectory in feature space. A temporal fuzzy set belonging to the dynamic profile characterizes a region of attraction in feature space, and quantifies the extent to which the observed dynamic system is governed by that region at any time. Then, by measuring the similarity of these temporal fuzzy sets, we may infer some characteristics of the underlying dynamical system. This can be considered a dynamic approach, because it takes into account the temporal component of the system in question. Finally, the techniques were successfully applied to real world problems like 3D-curve recognition, analysis of tremor movement and speech recognition.

Journal ArticleDOI
TL;DR: This paper proposes a VLSI architecture for computing the distance between ordered h-ary trees, as well as arbitrary ordered trees, and is the very first special purpose architecture that has been proposed for this important problem.
Abstract: The distance between two labeled ordered trees, /spl alpha/ and /spl beta/, is the minimum cost sequence of editing operations (insertions, deletions, and substitutions) needed to transform a into /spl beta/ such that the predecessor-descendant relation between nodes and the ordering of nodes is not changed. Approximate tree matching has applications in genetic sequence comparison, scene analysis, error recovery and correction in programming languages, and cluster analysis. Edit distance computation is a computationally intensive task, and the design of special purpose hardware could result in a significant speed up. This paper proposes a VLSI architecture for computing the distance between ordered h-ary trees, as well as arbitrary ordered trees. This is the very first special purpose architecture that has been proposed for this important problem. The architecture is a parallel realization of a dynamic programming algorithm and makes use of simple basic cells and requires regular nearest-neighbor communication. The architecture has been simulated and verified using the Cadence design tools.

Book ChapterDOI
14 Dec 1998
TL;DR: A method for retrieving objects within some distance from a given object by utilizing a spatial indexing/access method R-tree, which proves that objects in discrete L1 metric space can be embedded into vertices of a unit hypercube when the square root of L1 distance is used as the distance.
Abstract: High-dimensional data, such as documents, digital images, and audio clips, can be considered as spatial objects, which induce a metric space where the metric can be used to measure dissimilarities between objects. We propose a method for retrieving objects within some distance from a given object by utilizing a spatial indexing/access method R-tree. Since R-tree usually assumes a Euclidean metric, we have to embed objects into a Euclidean space. However, some of naturally defined distance measures, such as L1 distance (or Manhattan distance), cannot be embedded into any Euclidean space. First, we prove that objects in discrete L1 metric space can be embedded into vertices of a unit hypercube when the square root of L1 distance is used as the distance. To take fully advantage of R-tree spatial indexing, we have to project objects into space of relatively lower dimension. We adopt FastMap by Faloutsos and Lin to reduce the dimension of object space. The range corresponding to a query (Q, h) for retrieving objects within distance h from a object Q is naturally considered as a hyper-sphere even after FastMap projection, which is an orthogonal projection in Euclidean space. However, it is turned out that the query range is contracted into a smaller hyper-box than the hyper-sphere by applying FastMap to objects embedded in the above mentioned way. Finally, we give a brief summary of experiments in applying our method to Japanese chess boards.

Proceedings Article
01 Jan 1998
TL;DR: In this article, a reduced finite automata (NFA) for approximate string matching is presented, where the pattern can occur with some limited number of errors given by edit distance.
Abstract: Approximate string and sequence matching is a problem of searching for all occurrences of a pattern (string or sequence) in some text, where the pattern can occur with some limited number of errors given by edit distance.Several methods were designed for the approximate string matching that simulate nondeterministic finite automata (NFA) constructed for this problem. This paper presents reduced NFAs for the approximate string matching usable in case, when we are interested only in occurrences having edit distance less than or equal to a given integer, but we are not interested in exact edit distance of each found occurrence. Then an algorithm based on the dynamic programming that simulates these reduced NFAs is presented. It is also presented how to use this algorithm for the approximate sequence matching.

Book ChapterDOI
20 Jul 1998
TL;DR: Polylogrithmic time algorithm for tree editing problem for input ordered labeled trees T1 and T2, which shows efficient solutions for the more restricted version of degree-2 edit distance problem.
Abstract: Ordered labeled trees are trees whose nodes are labeled and in which the left-to-right order among siblings is significant. The tree editing problem for input ordered labeled trees T1 and T2 is defined as transforming T1 into T2 by performing a series of weighted edit operations on T1 with overall minimum cost. An edit operation can be the deletion, the insertion, and the substitution. Previous results on this problem are only for some special cases and the time complexity depends on the actual distance, though for the more restricted version of degree-2 edit distance problem there are efficient solutions. In this extended abstract, we show polylogrithmic time algorithm for this problem.

Book ChapterDOI
TL;DR: This paper considers the problem of recognizing ordered labeled trees by processing their noisy subsequence-trees which are “patched-up” noisy portions of their fragments and reports the first reported solution to the problem.
Abstract: In this paper we consider the problem of recognizing ordered labeled trees by processing their noisy subsequence-trees which are “patched-up” noisy portions of their fragments. We assume that we are given H, a finite dictionary of ordered labeled trees. X* is an unknown element of H, and U is any arbitrary subsequence-tree of X*. We consider the problem of estimating X* by processing Y — a noisy version of U. We do this by sequentially comparing Y with every element X of H, the basis of comparison being the constrained edit distance between two trees [OL94], where the constraint implicitly captures the properties of the corrupting mechanism (“channel”) which noisily garbles U into Y. Experimental results which involve manually constructed trees of sizes between 25 and 35 nodes and which contain an average of 21.8 errors per tree demonstrate that the scheme has about 92.8% accuracy. Similar experiments for randomly generated trees yielded an accuracy of 86.4%. To our knowledge this is the first reported solution to the problem.

Book ChapterDOI
20 Jul 1998
TL;DR: In this paper, word sequence matching is discussed, and the common edit distance metric for approximate string matching to searching for words and sequences of words is adapted.
Abstract: In this paper, we discuss word sequence matching, and we adapt the common edit distance metric for approximate string matching to searching for words and sequences of words. We furthermore create a variant of the Sparse Suffix Tree([3]) and adapt algorithms for approximate word and word sequence matching over the Sparse Suffix Tree variant. The algorithms have been implemented and tested in WWW information retrieval environment, and performance data is presented.