scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper focuses on the variations tractable by the algorithms including the submodule of a network algorithm, either the minimum cost maximum flow algorithm or the maximum weighted bipartite matching algorithm, and shows that both network algorithms are replaceable.
Abstract: In this paper, we investigate the problem of computing structural sensitive variations of an unordered tree edit distance. First, we focus on the variations tractable by the algorithms including the submodule of a network algorithm, either the minimum cost maximum flow algorithm or the maximum weighted bipartite matching algorithm. Then, we show that both network algorithms are replaceable, and hence the time complexity of computing these variations can be reduced to O(nmd) time, where n is the number of nodes in a tree, m is the number of nodes in another tree and d is the minimum degree of given two trees. Next, we show that the problem of computing the bottom-up distance is MAX SNP-hard. Note that the well-known linear-time algorithm for the bottom-up distance designed by Valiente (2001) computes just a bottom-up indel (insertion-deletion) distance allowing no substitutions.

17 citations

Proceedings Article
01 May 2004
TL;DR: This paper explores the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores, and shows that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text.
Abstract: Automatic methods for MT evaluation are often based on the assumption that MT quality is related to some kind of distance between the evaluated text and a professional human translation (e.g., an edit distance or the precision of matched N-grams). However, independently produced human translations are necessarily different, conveying the same content by dissimilar means. Such legitimate translation variation is a serious problem for distance-based evaluation methods, because mismatches do not necessarily mean degradation in MT quality. In this paper we explore the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores. We show that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text, such as translation adequacy and fluency. However legitimate translation variation also reveals fundamental limits on the applicability of distance-based MT evaluation methods and on data-driven architectures for MT.

17 citations

Posted Content
TL;DR: This paper shows an efficient deterministic protocol that is efficient even for large numbers of (adversarial) edit errors, and is the first efficient Deterministic protocol for this problem, if efficiency is measured in both the message size and the running time.
Abstract: Suppose that we have two parties that possess each a binary string. Suppose that the length of the first string (document) is $n$ and that the two strings (documents) have edit distance (minimal number of deletes, inserts and substitutions needed to transform one string into the other) at most $k$. The problem we want to solve is to devise an efficient protocol in which the first party sends a single message that allows the second party to guess the first party's string. In this paper we show an efficient deterministic protocol for this problem. The protocol runs in time $O(n\cdot \mathtt{polylog}(n))$ and has message size $O(k^2+k\log^2n)$ bits. To the best of our knowledge, ours is the first efficient deterministic protocol for this problem, if efficiency is measured in both the message size and the running time. As an immediate application of our new protocol, we show a new error correcting code that is efficient even for large numbers of (adversarial) edit errors.

17 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: This paper proposes the first index structure for subtree similarity-search, provided that the unit cost function is used and extensive experimentation and comparison to previous work shows the huge improvement gained when using the proposed index structure and processing algorithm.
Abstract: Given a tree Q and a large set of trees T = {T1,...,Tn}, the subtree similarity-search problem is that of finding the subtrees of trees among T that are most similar to Q, using the tree edit distance metric. Determining similarity using tree edit distance has been proven useful in a variety of application areas. While subtree similarity-search has been studied in the past, solutions required traversal of all of T, which poses a severe bottleneck in processing time, as T grows larger. This paper proposes the first index structure for subtree similarity-search, provided that the unit cost function is used. Extensive experimentation and comparison to previous work shows the huge improvement gained when using the proposed index structure and processing algorithm.

17 citations

Eva Forsbom1
01 Jan 2003
TL;DR: Preliminary experiments showed that the measures are not portable without redefinitions, so two new measures are defined, WAFT and NEVA, which could be applied for both purposes and granularities.
Abstract: Two string comparison measures, edit distance and n-gram co-occurrence, are tested for automatic evaluation of translation quality, where the quality is compared to one or several reference translations The measures are tested in combination for diagnost

17 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139