scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings ArticleDOI
23 Jan 2013
TL;DR: This work presents a systematic and formal framework for obtaining new data structures by quantitatively relaxing existing ones, and gives concurrent implementations of relaxed data structures and demonstrates that bounded relaxations provide the means for trading correctness for performance in a controlled way.
Abstract: There is a trade-off between performance and correctness in implementing concurrent data structures. Better performance may be achieved at the expense of relaxing correctness, by redefining the semantics of data structures. We address such a redefinition of data structure semantics and present a systematic and formal framework for obtaining new data structures by quantitatively relaxing existing ones. We view a data structure as a sequential specification S containing all "legal" sequences over an alphabet of method calls. Relaxing the data structure corresponds to defining a distance from any sequence over the alphabet to the sequential specification: the k-relaxed sequential specification contains all sequences over the alphabet within distance k from the original specification. In contrast to other existing work, our relaxations are semantic (distance in terms of data structure states). As an instantiation of our framework, we present two simple yet generic relaxation schemes, called out-of-order and stuttering relaxation, along with several ways of computing distances. We show that the out-of-order relaxation, when further instantiated to stacks, queues, and priority queues, amounts to tolerating bounded out-of-order behavior, which cannot be captured by a purely syntactic relaxation (distance in terms of sequence manipulation, e.g. edit distance). We give concurrent implementations of relaxed data structures and demonstrate that bounded relaxations provide the means for trading correctness for performance in a controlled way. The relaxations are monotonic which further highlights the trade-off: increasing k increases the number of permitted sequences, which as we demonstrate can lead to better performance. Finally, since a relaxed stack or queue also implements a pool, we actually have new concurrent pool implementations that outperform the state-of-the-art ones.

107 citations

Journal ArticleDOI
TL;DR: It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.
Abstract: Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

106 citations

Proceedings ArticleDOI
Jianbin Qin1, Wei Wang1, Yifei Lu1, Chuan Xiao1, Xuemin Lin1 
12 Jun 2011
TL;DR: This paper shows that the minimum signature size lower bound is t +1, and proposes asymmetric signature schemes that achieve this lower bound, and develops efficient query processing algorithms based on the new scheme.
Abstract: Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.

106 citations

Journal Article
TL;DR: This work considers the more general problem of strings being represented by a singly linked list and being able to apply these operations to the pointer associated with a vertex as well as the character associated with the vertex, and shows that this problem is NP-complete.
Abstract: The traditional edit-distance problem is to find the minimum number of insert-character and delete-character (and sometimes change character) operations required to transform one string into another. Here we consider the more general problem of strings being represented by a singly linked list (one character per node) and being able to apply these operations to the pointer associated with a vertex as well as the character associated with the vertex. That is, in O(1) time, not only can characters be inserted or deleted, but also substrings can be moved or deleted. We limit our attention to the ability to move substrings and leave substring deletions for future research. Note that O(1) time substring move operations imply O(1) substring exchange operations as well, a form of transformation that has been of interest in molecular biology. We show that this problem is NP-complete, show that a recursive sequence of moves can be simulated with at most a constant factor increase by a non-recursive sequence, and present a polynomial time greedy algorithm for non-recursive moves with a worst-case log factor approximation to optimal. The development of this greedy algorithm shows how to reduce moves of substrings to moves of characters, and how to convert moves with characters to only insert and deletes of characters.

106 citations

Proceedings ArticleDOI
01 Jan 1997
TL;DR: This paper proposes an indexing scheme which is totally based on lengths and relative distances between sequences, and uses vp-trees as the underlying distance-based index structures in its method.
Abstract: In this paper, we consider the problem of efficient matching and retrieval of sequences of different lengths. Most of the previous research is concentrated on similarity matching and retrieval of sequences of the same length using Euclidean distance metric. For similarity matching of sequences, we use a modified version of the edit distance function, and consider two sequences matching if a majority of the elements in the sequences match. In the matching process a mapping among non-matching elements is created to check if there are unacceptable deviations among them. This means that two matching sequences should have lengths that are comparable. For efficient retrieval of matching sequences, we propose an indexing scheme which is totally based on lengths and relative distances between sequences. We use vp-trees as the underlying distance-based index structures in our method.

105 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139