scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
01 Dec 2011
TL;DR: It is proved that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity and introduces the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature.
Abstract: We consider the classical tree edit distance between ordered labeled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of RTED is smaller or equal to the complexity of the best competitors for any input instance, i.e., RTED is both efficient and worst-case optimal. We introduce the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature. We prove that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state-of-the-art.

176 citations

Book ChapterDOI
09 Sep 1991
TL;DR: A scheme in which T is first preprocessed to make the subsequent searches with different P fast to find all approximate occurrences P′ of a pattern string P in a text string T such that the edit distance between P and P′ is ≤k.
Abstract: The problem of finding all approximate occurrences P′ of a pattern string P in a text string T such that the edit distance between P and P′ is ≤k is considered We concentrate on a scheme in which T is first preprocessed to make the subsequent searches with different P fast Two preprocessing methods and the corresponding search algorithms are described The first is based suffix automata and is applicable for edit distances with general edit operation costs The second is a special design for unit cost edit distance and is based on q-gram lists The preprocessing needs in both cases time and space O(|T|) The search algorithms run in the worst case in time O(|P||T|) or O(k|T|), and in the best case in time O(|P|)

175 citations

Proceedings ArticleDOI
14 Jun 2005
TL;DR: This paper proposes to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information and proves that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees.
Abstract: Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filter-and-refine framework to process similarity search on tree-structured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets.

174 citations

Journal ArticleDOI
TL;DR: This work presents an efficient read mapping tool called RazerS, which allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance and guarantees not to lose more reads than specified.
Abstract: Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. [RazerS is freely available at http://www.seqan.de/projects/razers.html.]

173 citations

Journal ArticleDOI
01 Nov 2011
TL;DR: This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join.
Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.

172 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139