scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length and is particularly useful with large error rates.
Abstract: Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. The filter TUIUIU that we introduce in this paper provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment or repeats inference method, eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate repeat. It consists in the verification of several strong necessary conditions that can be checked in a fast way. We implemented three versions of the filter. The first is simply a straightforward extension to the case of multiple sequences of an application of conditions already existing in the literature. The second uses a stronger condition which, as our results show, enable to filter sensibly more with negligible (if any) additional time. The third version uses an additional condition and pushes the sensibility of the filter even further with a non negligible additional time in many circumstances; our experiments show that it is particularly useful with large error rates. The latter version was applied as a preprocessing of a multiple alignment tool, obtaining an overall time (filter plus alignment) on average 63 and at best 530 times smaller than before (direct alignment), with in most cases a better quality alignment. To the best of our knowledge, TUIUIU is the first filter designed for multiple repeats and for dealing with error rates greater than 10% of the repeats length.

20 citations

Proceedings ArticleDOI
17 Jan 2010
TL;DR: It is proved that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings' length) communication complexity, and an intimate connection between non-embeddability, sketching and communication complexity is suggested.
Abstract: We prove that any sketching protocol for edit distance achieving a constant approximation requires nearly logarithmic (in the strings' length) communication complexity. This is an exponential improvement over the previous, doubly-logarithmic, lower bound of [Andoni-Krauthgamer, FOCS'07]. Our lower bound also applies to the Ulam distance (edit distance over non-repetitive strings). In this special case, it is polynomially related to the recent upper bound of [Andoni-Indyk-Krauthgamer, SODA'09].From a technical perspective, we prove a direct-sum theorem for sketching product metrics that is of independent interest. We show that, for any metric X that requires sketch size which is a sufficiently large constant, sketching the max-product metric ld∞(X) requires Ω(d) bits. The conclusion, in fact, also holds for arbitrary two-way communication. The proof uses a novel technique for information complexity based on Poincare inequalities and suggests an intimate connection between non-embeddability, sketching and communication complexity.

20 citations

Book ChapterDOI
25 Jun 2012
TL;DR: The State Set Index (SSI) is introduced, based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton, and implements a novel state labeling strategy making the index highly space-efficient.
Abstract: String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem. SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time. We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.

20 citations

Journal ArticleDOI
TL;DR: A novel unidimensional measure is introduced that is proven to be metric and satisfies a number of qualitative prerequisites that previous measures do not, and that is effective in evaluating scene segmentation techniques and in helping to optimize their parameters.
Abstract: In this paper, a novel approach to evaluating video temporal decomposition algorithms is presented. The evaluation measures typically used to this end are nonlinear combinations of precision-recall or coverage-overflow, which are not metrics and additionally possess undesirable properties, such as nonsymmetricity. To alleviate these drawbacks, we introduce a novel unidimensional measure that is proven to be metric and satisfies a number of qualitative prerequisites that previous measures do not. This measure is named differential edit distance (DED), since it can be seen as a variation of the well-known edit distance. After defining DED, we further introduce an algorithm that computes it in less than cubic time. Finally, DED is extensively compared with state-of-the-art measures, namely, the harmonic means (F-score) of precision-recall and coverage-overflow. The experiments include comparisons of qualitative properties, the time required for optimizing the parameters of scene segmentation algorithms with the help of these measures, and a user study gauging the agreement of these measures with the users' assessment of the segmentation results. The results confirm that the proposed measure is a unidimensional metric that is effective in evaluating scene segmentation techniques and in helping to optimize their parameters.

20 citations

Proceedings Article
01 Jan 2007
TL;DR: An improved comparison method is provided based on the concept of tree edit distance, introducing the notion of commonality between sub-trees, which yields better similarity results with respect to alternative methods, while maintaining quatratic time complexity.
Abstract: Developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several unaddressed structural similarities, i.e. sub-tree related similarities, while comparing XML documents. In this paper, we provide an improved comparison method to deal with such resemblances. Our approach is based on the concept of tree edit distance, introducing the notion of commonality between sub-trees. Experiments demonstrate that our approach yields better similarity results with respect to alternative methods, while maintaining quatratic time complexity.

20 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139