Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

[...]

Eike Schallehn¹, Kai-Uwe Sattler¹, Gunter Saake¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

01 Mar 2004

TL;DR: A similarity-based variants of grouping and join operators that produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition is presented.

...read moreread less

Abstract: Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of this operator, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples of application from the context of a data reconciliation project for looted art.

...read moreread less

72 citations

Error log processing for accurate failure prediction

[...]

Felix Salfner¹, Steffen Tschirpke²•Institutions (2)

International Computer Science Institute¹, Humboldt University of Berlin²

07 Dec 2008

TL;DR: By experiments using data of a commercial telecommunication system, it is shown that data preparation is an important step to achieve accurate error-based online failure prediction.

...read moreread less

Abstract: Error logs are a fruitful source of information both for diagnosis as well as for proactive fault handling - however elaborate data preparation is necessary to filter out valuable pieces of information. In addition to the usage of well-known techniques, we propose three algorithms: (a) assignment of error IDs to error messages based on Levenshtein's edit distance, (b) a clustering approach to group similar error sequences, and (c) a statistical noise filtering algorithm. By experiments using data of a commercial telecommunication system we show that data preparation is an important step to achieve accurate error-based online failure prediction.

...read moreread less

72 citations

Dissertation•

Sequence distance embeddings

[...]

Graham Cormode

01 Jan 2003

TL;DR: The embeddings are shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

...read moreread less

Abstract: Sequences represent a large class of fundamental objects in Computer Science sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Fast computation of short sketches in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences. Approximate nearest neighbor and clustering problems, significantly faster than the naive exact solutions. Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time. Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication. Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

...read moreread less

72 citations

Proceedings Article•DOI•

Cache-oblivious dynamic programming

[...]

Rezaul Chowdhury, Vijaya Ramachandran

22 Jan 2006

TL;DR: A new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-OBlivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and 'simple DP', among other problems.

...read moreread less

Abstract: We present efficient cache-oblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and 'simple DP', among other problems.

...read moreread less

72 citations

Book Chapter•DOI•

An optimal decomposition algorithm for tree edit distance

[...]

Erik D. Demaine¹, Shay Mozes¹, Benjamin Rossman¹, Oren Weimann¹•Institutions (1)

Massachusetts Institute of Technology¹

09 Jul 2007

TL;DR: The optimality of the algorithm is proved among the family of decomposition strategy algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2 log2 n) to O(n3), matching the algorithm's running time.

...read moreread less

Abstract: The edit distance between two ordered rooted trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this paper, we present a worst-case O(n3)-time algorithm for this problem, improving the previous best O(n3 log n)-time algorithm [7]. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems, together with a deeper understanding of the previous algorithms for the problem. We prove the optimality of our algorithm among the family of decomposition strategy algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2 log2 n) [4] to O(n3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds of Θ(nm2(1+log n/m)) when the two trees have sizes m and n where m < n.

...read moreread less

71 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics