scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings ArticleDOI
24 Aug 2003
TL;DR: This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain.
Abstract: The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

1,020 citations

Journal ArticleDOI
TL;DR: The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

897 citations

Proceedings ArticleDOI
23 Aug 2004
TL;DR: Investigation of unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources shows that edit distance data is cleaner and more easily-aligned than the heuristic data.
Abstract: We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

895 citations

Journal ArticleDOI
TL;DR: This work surveys the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes and presents one or more of the central algorithms for solving the problem.

831 citations

Book ChapterDOI
31 Aug 2004
TL;DR: A new distance function, which is a marriage of L1- norm and the edit distance, ERP, which can support local time shifting, and is a metric, and dominates all existing strategies.
Abstract: Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP ("Edit distance with Real Penalty"). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies.

790 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139