Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Adaptive duplicate detection using learnable string similarity measures

[...]

Mikhail Bilenko¹, Raymond J. Mooney¹•Institutions (1)

University of Texas at Austin¹

24 Aug 2003

TL;DR: This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain.

...read moreread less

Abstract: The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

...read moreread less

1,020 citations

Journal Article•DOI•

Learning string-edit distance

[...]

Eric Sven Ristad¹, Peter N. Yianilos¹•Institutions (1)

Princeton University¹

01 May 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

...read moreread less

Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

...read moreread less

897 citations

Proceedings Article•DOI•

Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources

[...]

Bill Dolan¹, Chris Quirk¹, Chris Brockett¹•Institutions (1)

Microsoft¹

23 Aug 2004

TL;DR: Investigation of unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources shows that edit distance data is cleaner and more easily-aligned than the heuristic data.

...read moreread less

Abstract: We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

...read moreread less

895 citations

Journal Article•DOI•

A survey on tree edit distance and related problems

[...]

Philip Bille¹•Institutions (1)

IT University of Copenhagen¹

09 Jun 2005-Theoretical Computer Science

TL;DR: This work surveys the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes and presents one or more of the central algorithms for solving the problem.

...read moreread less

831 citations

Book Chapter•DOI•

On the marriage of Lp-norms and edit distance

[...]

Lei Chen¹, Raymond T. Ng²•Institutions (2)

University of Waterloo¹, University of British Columbia²

31 Aug 2004

TL;DR: A new distance function, which is a marriage of L1- norm and the edit distance, ERP, which can support local time shifting, and is a metric, and dominates all existing strategies.

...read moreread less

Abstract: Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP ("Edit distance with Real Penalty"). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies.

...read moreread less

790 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics