scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper presents a short survey and experimental results for well known sequential approximate string searching algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism.
Abstract: The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities.

28 citations

Book ChapterDOI
10 Oct 2011
TL;DR: SpSim as discussed by the authors is a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.
Abstract: The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as "ph" and "f" in English-Portuguese cognates "phase" and "fase". Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity. This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance -based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.

28 citations

Proceedings Article
08 Jul 1997
TL;DR: In this paper, a stochastic model for string-edit distance is proposed, which is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.
Abstract: In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string-edit distance. Our stochastic model allows us to learn a string-edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the difficult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string-edit distance with nearly one-fifth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

28 citations

Patent
28 Apr 2005
TL;DR: In this article, the A* (or A-star) search is used to search for the answer using a novel counting heuristic that gives a lower bound on the minimum edit distance for any given subproblem.
Abstract: This invention related to a method for computing the minimum edit distance, measured as the number of insertions plus the number of deletions, between two sequences of data, which runs in an amount of time that is nearly proportional to the size of the input data under many circumstances. Utilizing the A* (or A-star) search, the invention searches for the answer using a novel counting heuristic that gives a lower bound on the minimum edit distance for any given subproblem. In addition, regions over which the heuristic matches the maximum value of the answer are optimized by eliminating the search over redundant paths. The invention can also be used to produce the edit script. The invention can be modified for other types of comparison and pattern recognition.

28 citations

Journal ArticleDOI
TL;DR: Optimal linear-time algorithms for solving recurrence equations on simple systolic arrays are presented and applications to some pattern recognition and sequence comparison problems are given.
Abstract: Optimal linear-time algorithms for solving recurrence equations on simple systolic arrays are presented. The systolic arrays use only one-way communication between processors and communicate with the external environment through only one I/O port. Because of their architectural simplicity, the arrays are well suited for direct VLSI implementation. Applications to some pattern recognition and sequence comparison problems are given. For example, it is shown that the set of (k + 2)-tuples of strings (x 1 , . . . , x k+1 , Y) such that y is a shuffle of x 1 ,. . . , x k+1 can be recognized by a one-way k-dimensional systolic array in (k + 1)n - k time. The longest common subsequence (LCS) problem and the string-to-string correction problem are also considered: the length of an LCS of k + 1 sequences can be computed by a one-way k-dimensional systolic array in (k + 1) n - k time; the edit distance between two strings can be computed by a one-way dimensional systolic array in 2n - 1 time. Applications to other related problems, e.g., dynamic time warping and optimum generalized alignment, as well as optimal-time simulations of multihead acceptors and multitape transducers are also given.

28 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139