scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: It is shown how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search.

64 citations

Journal ArticleDOI
TL;DR: A use of regular expressions in character recognition problem scenarios in sequence analysis that are ideally suited for the application of regular expression algorithms.
Abstract: Regular expressions are extremely useful, because they allow us to work with text in terms of patterns. They are considered the most sophisticated means of performing operations such as string searching, manipulation, validation, and formatting in all applications that deal with text data. Character recognition problem scenarios in sequence analysis that are ideally suited for the application of regular expression algorithms. This paper describes a use of regular expressions in this problem domain, and demonstrates how the effective use of regular expressions that can serve to facilitate more efficient and more effective character recognition.

64 citations

Proceedings ArticleDOI
13 Dec 2010
TL;DR: This work proposes to use these interdependencies to quantify the quality of subgroups, by integrating Bayesian networks with the Exceptional Model Mining framework, and shows interesting subgroups found with this method on datasets from music theory, semantic scene classification, biology and zoogeography.
Abstract: Whenever a dataset has multiple discrete target variables, we want our algorithms to consider not only the variables themselves, but also the interdependencies between them. We propose to use these interdependencies to quantify the quality of subgroups, by integrating Bayesian networks with the Exceptional Model Mining framework. Within this framework, candidate subgroups are generated. For each candidate, we fit a Bayesian network on the target variables. Then we compare the network’s structure to the structure of the Bayesian network fitted on the whole dataset. To perform this comparison, we define an edit distance-based distance metric that is appropriate for Bayesian networks. We show interesting subgroups that we experimentally found with our method on datasets from music theory, semantic scene classification, biology and zoogeography.

64 citations

Proceedings ArticleDOI
16 Jun 2008
TL;DR: Given several systems' automatic translations of the same sentence, it is shown how to combine them into a confusion network, whose various paths represent composite translations that could be considered in a subsequent rescoring step.
Abstract: Given several systems' automatic translations of the same sentence, we show how to combine them into a confusion network, whose various paths represent composite translations that could be considered in a subsequent rescoring step. We build our confusion networks using the method of Rosti et al. (2007), but, instead of forming alignments using the tercom script (Snover et al., 2006), we create alignments that minimize invWER (Leusch et al., 2003), a form of edit distance that permits properly nested block movements of substrings. Oracle experiments with Chinese newswire and weblog translations show that our confusion networks contain paths which are significantly better (in terms of BLEU and TER) than those in tercom-based confusion networks.

64 citations

Journal ArticleDOI
TL;DR: Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.
Abstract: Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate comparators included the modified Jaro-Winkler method, the longest common substring, and the Levenshtein edit distance. We also calculated the combined root-mean square of all three. We tested each name comparison method using a deterministic record linkage algorithm. Results were consistent across both hospitals. At a threshold comparator score of 0.8, the Jaro-Winkler comparator achieved the highest linkage sensitivities of 97.4% and 97.7%. The combined root-mean square method achieved sensitivities higher than the Levenshtein edit distance or longest common substring while sustaining high linkage specificity. Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.

64 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139