scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Book ChapterDOI
17 Sep 2007
TL;DR: This article proposes an original experimental approach aiming at representing images by a tree-structured representation and then at using the learned metric in an image recognition task.
Abstract: The problem of learning metrics between structured data (strings, trees or graphs) has been the subject of various recent papers. With regard to the specific case of trees, some approaches focused on the learning of edit probabilities required to compute a so-called stochastic tree edit distance. However, to reduce the algorithmic and learning constraints, the deletion and insertion operations are achieved on entire subtrees rather than on single nodes. We aim in this article at filling the gap with the learning of a more general stochastic tree edit distance where node deletions and insertions are allowed. Our approach is based on an adaptation of the EM optimization algorithm to learn parameters of a tree model. We propose an original experimental approach aiming at representing images by a tree-structured representation and then at using our learned metric in an image recognition task. Comparisons with a non learned tree edit distance confirm the effectiveness of our approach.

19 citations

Journal Article
TL;DR: Ergun, Muthukrishnan and Sahinalp as mentioned in this paper showed that the edit distance problem with block deletions can be solved optimally, but edit distance with block moves and deletions remains NP-complete and can be reduced to the problem of block moves only.
Abstract: We consider the addition of some or all of the operations block move, block delete, block copy, block reversals, and block copy reversals, to the traditional edit distance problem (finding the minimum number of insert-character and delete-character operations to convert one string to another). When all of the above operations are allowed, the problem, called the nearest neighbors problem, is NP hard, and the best known approximation is O(log n log* n), which was achieved by Muthukrishnan and Sahinalp [2000,2002a]. In this paper we show that this problem can be approximated by a constant factor of 3.5 using a simple sliding window method. When eliminating reversals, the same method reduces the best known approximation of 12, achieved by Ergun, Muthukrishnan and Sahinalp [2003], down to a factor of 4. Both constant factors are proved to be tight. Allowing only subsets of these operations does not necessarily make the problem easier. Shapira and Storer [2002] present a log n factor approximation algorithm for edit distance with block moves (which is also an NP-complete problem). Here, we show that edit distance with block deletions can be solved optimally, but edit distance with block moves and block deletions remains NP-complete and can be reduced to the problem of block moves only, keeping the same log n factor approximation.

18 citations

Journal ArticleDOI
TL;DR: This work proposes a new solution for approximate overlaps based on backward backtracking (Lam, et al., 2008) and suffix filters (Karkkainen and Na, 2008), and uses nH"k+o([email protected])+rlogr bits of space, where H"k is the k-th order entropy and @s the alphabet size.
Abstract: Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of strings of total length n and an error-rate @e, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance [email protected][email protected]@[email protected]?, where @? is the length of the overlap. We propose a new solution for this problem based on backward backtracking (Lam, et al., 2008) and suffix filters (Karkkainen and Na, 2008). Our technique uses nH"k+o([email protected])+rlogr bits of space, where H"k is the k-th order entropy and @s the alphabet size. In practice, it is more scalable in terms of space, and comparable in terms of time, than q-gram filters (Rasmussen, et al., 2006). Our method is also easy to parallelize and scales up to millions of DNA reads.

18 citations

01 Jan 2004
TL;DR: A string comparator based on edit distance that uses variable edit-step costs derived from training data and is compared with the JaroWinkler string comparators and with the Census Bureau’s record linkage software.
Abstract: We develop a string comparator based on edit distance that uses variable edit-step costs derived from training data. Using first and last name data from Census files, we compare the performance of this string comparator with one without variable edit step costs and with the JaroWinkler string comparator, which is standardly used in the Census Bureau’s record linkage software.

18 citations

Journal ArticleDOI
TL;DR: The edit-distance between two strings is the smallest number of operations required to transform one string into the other.
Abstract: The edit-distance between two strings is the smallest number of operations required to transform one string into the other. The distance between languages L1 and L2 is the smallest edit-distance be...

18 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139