scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The methodology developed is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings.
Abstract: It is known that users of internet search engines often enter queries with misspellings in one or more search terms. Several web search engines make suggestions for correcting misspelled words, but the methods used are proprietary and unpublished to our knowledge. Here we describe the methodology we have developed to perform spelling correction for the PubMed search engine. Our approach is based on the noisy channel model for spelling correction and makes use of statistics harvested from user logs to estimate the probabilities of different types of edits that lead to misspellings. The unique problems encountered in correcting search engine queries are discussed and our solutions are outlined.

36 citations

Book ChapterDOI
24 May 1992
TL;DR: For a noisy clock-controlled shift register statistically optimal probabilistic constrained edit distance a recursive algorithm for its efficient computation are derived and corresponding generalized correlation attack is proposed.
Abstract: For a noisy clock-controlled shift register statistically optimal probabilistic constrained edit distance a recursive algorithm for its efficient computation are derived. corresponding generalized correlation attack is proposed.

35 citations

Book ChapterDOI
11 Sep 2002
TL;DR: This paper shows how to obtain an O(n/m) average time string matching algorithm, using a super-alphabet for simulating suffix automaton and adopting a similar technique to the shift-or algorithm, extending its bit-parallelism in another direction.
Abstract: Given a text T[1 . . . n] and a pattern P[1 . . . m] over some alphabet ? of size ?, finding the exact occurrences of P in T requires at least ?(n log? m/m) character comparisons on average, as shown in [19]. Consequently, it is believed that this lower bound implies also an ?(n log? m/m) lower bound for the execution time of an optimal algorithm. However, in this paper we show how to obtain an O(n/m) average time algorithm. This is achieved by slightly changing the model of computation, and with a modification of an existing algorithm. Our technique uses a super-alphabet for simulating suffix automaton. The space usage of the algorithm is O(?m). The technique can be applied to many other string matching algorithms, including dictionary matching, which is also solved in expected time O(n/m), and approximate matching allowing k edit operations (mismatches, insertions or deletions of characters). This is solved in expected time O(nk/m) for k ? O(m/log? m). The known lower bound for this problem is ?(n(k + log? m)/m), given in [6]. Finally we show how to adopt a similar technique to the shift-or algorithm, extending its bit-parallelism in another direction. This gives a speed-up by a factor s, where s is the number of characters processed simultaneously. Some of the algorithms are implemented, and we show that the methods work well in practice too. This is especially true for the shift-or algorithm, which in some cases works faster than predicted by the theory. The result is the fastest known algorithm for exact string matching for short patterns and small alphabets. All the methods and analyses assume the RAM model of computation, and that each symbol is coded in b = ?log2 ?? bits. They work for larger b too, but the speed-up is decreased.

35 citations

Book ChapterDOI
29 Apr 1992
TL;DR: This work considers a string matching problem where the pattern is a template that matches many different strings with various degrees of perfection, and shows that the structure of Pn can be exploited and the problem reduced to essentially solving a dynamic programming of size O(mn).
Abstract: We consider a string matching problem where the pattern is a template that matches many different strings with various degrees of perfection. The quality of a match is given by a penalty matrix that assigns each pair of characters a score that characterizes how well the characters match. Superfluous characters in the text and superfluous characters in the pattern may also occur and the respective penalties for such gaps in the alignment are also given by the penalty matrix. For a text T of length n, and a template P of length m, we wish to find the best alignment of T with Pn, which is the concatenation of n copies of P, (m will typically be much smaller than n). Such an alignment can simply be obtained by solving a dynamic programming problem of size O(n2m), and ignoring the periodic character of Pn. We show that the structure of Pn can be exploited and the problem reduced to essentially solving a dynamic programming of size O(mn). If the complexity of computing gap penalties is O(1), (which is frequently the case), our algorithm runs in O(mn) time. The problem was motivated by a protein structure problem.

35 citations

Journal ArticleDOI
TL;DR: This work proves the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings, and provides the first setting in which the complexity of computing the edit Distance is provably larger than that of Hamming distance.
Abstract: We prove the first nontrivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. To the best of our knowledge, this is the first computational setting in which the complexity of estimating the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, that protocols with $O(1)$ bits of communication can obtain only approximation $\alpha\geq\Omega(\log d/\log\log d)$, where $d$ is the length of the input strings. This case of $O(1)$ communication is of particular importance since it captures constant-size sketches as well as embeddings into spaces like $l_1$ and squared-$l_2$, two prevailing algorithmic approaches for dealing with edit distance. Indeed, the known nontrivial communication upper bounds are all derived from embeddings into $l_1$. By excluding low-communication protocols for edit distance, we rule out a strictly richer class of algorithms than previous results. Furthermore, our lower bound holds not only for strings over a binary alphabet but also for strings that are permutations (aka the Ulam metric). For this case, our bound nearly matches an upper bound known via embedding the Ulam metric into $l_1$. Our proof uses a new technique that relies on Fourier analysis in a rather elementary way.

35 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139