scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Posted Content
TL;DR: The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance.
Abstract: The edit distance between two rooted ordered trees with $n$ nodes labeled from an alphabet~$\Sigma$ is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. Tree edit distance is a well known generalization of string edit distance. The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance. In this paper we show that a truly subcubic $O(n^{3-\varepsilon})$ time algorithm for tree edit distance is unlikely: For $|\Sigma| = \Omega(n)$, a truly subcubic algorithm for tree edit distance implies a truly subcubic algorithm for the all pairs shortest paths problem. For $|\Sigma| = O(1)$, a truly subcubic algorithm for tree edit distance implies an $O(n^{k-\varepsilon})$ algorithm for finding a maximum weight $k$-clique. Thus, while in terms of upper bounds string edit distance and tree edit distance are highly related, in terms of lower bounds string edit distance exhibits the hardness of the strong exponential time hypothesis [Backurs, Indyk STOC'15] whereas tree edit distance exhibits the hardness of all pairs shortest paths. Our result provides a matching conditional lower bound for one of the last remaining classic dynamic programming problems.

19 citations

Proceedings ArticleDOI
01 Dec 2011
TL;DR: It is shown that morpholexical and n-best-list features are effective in improving the accuracy of the system (0.8%) and the proposed methods are evaluated on a Turkish broadcast news transcription task.
Abstract: This paper explores rich morphological and novel n-best-list features for reranking automatic speech recognition hypotheses. The morpholexical features are defined over the morphological features obtained by using an n-gram language model over lexical and grammatical morphemes in the first-pass. The n-best-list features for each hypothesis are defined using that hypothesis and other alternate hypotheses in an n-best list. Our methodology is to align each hypothesis with other hypotheses one by one using minimum edit distance alignment. This gives us a set of edit operations - substitution, addition and deletion as seen in these alignments. These edit operations constitute our n-best-list features as indicator features. The reranking model is trained using a word error rate sensitive averaged perceptron algorithm introduced in this paper. The proposed methods are evaluated on a Turkish broadcast news transcription task. The baseline systems are word and statistical sub-word systems which also employ morphological features for reranking. We show that morpholexical and n-best-list features are effective in improving the accuracy of the system (0.8%).

19 citations

Book ChapterDOI
21 Jul 2004
TL;DR: A novel solution is proposed for error tolerant graph matching by extending the original edit distance based framework so as to account for a new operator to support node merging during the matching process.
Abstract: In this paper a novel solution is proposed for error tolerant graph matching. The solution belongs to the class of edit distance based techniques. In particular, the original edit distance based framework is extended so as to account for a new operator to support node merging during the matching process.

19 citations

Proceedings ArticleDOI
01 Apr 2014
TL;DR: This paper investigated and evaluated the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial translation memory systems, and showed how well various matching algorithms correlate with human judgments of helpfulness.
Abstract: Translation Memory (TM) systems are one of the most widely used translation technologies. An important part of TM systems is the matching algorithm that determines what translations get retrieved from the bank of available translations to assist the human translator. Although detailed accounts of the matching algorithms used in commercial systems can’t be found in the literature, it is widely believed that edit distance algorithms are used. This paper investigates and evaluates the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial TM systems. This paper presents results showing how well various matching algorithms correlate with human judgments of helpfulness (collected via crowdsourcing with Amazon’s Mechanical Turk). A new algorithm based on weighted n-gram precision that can be adjusted for translator length preferences consistently returns translations judged to be most helpful by translators for multiple domains and language pairs.

19 citations

Proceedings ArticleDOI
14 Nov 2005
TL;DR: The problem of computing the smallest edit distance between any pair of distinct words of a regular language is studied in this paper, which is the smallest number of substitutions, insertions, and deletions that can be used to transform one of the words into another.
Abstract: The edit distance (or Levenshtein distance) between two words is the smallest number of substitutions, insertions, and deletions of symbols that can be used to transform one of the words into the other. In this paper we consider the problem of computing the edit distance of a regular language (also known as constraint system), that is, the set of words accepted by a given finite automaton. This quantity is the smallest edit distance between any pair of distinct words of the language. We show that the problem is of polynomial time complexity. We distinguish two cases depending on whether the given automaton is deterministic or nondeterministic. In the latter case the time complexity is higher.

19 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139