scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: It is proved that the optimal partitioning of the data set is an NP-hard problem, and therefore a heuristic approach for selecting the reference strings greedily is proposed and an optimal partition assignment strategy is presented to minimize the expected number of strings that need to be verified during the query evaluation.
Abstract: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B + -tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B + -tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B + -tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B + -tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.

34 citations

Proceedings Article
01 Dec 2012
TL;DR: This work semi-automatically develops a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus and improves the error model and language model.
Abstract: A spelling error detection and correction application is based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We semi-automatically develop a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus. We improve the error model by analysing error types and creating an edit distance based re-ranker. We also improve the language model by analysing the level of noise in different sources of data and selecting the optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2010, OpenOffice Ayaspell and Google Docs.

34 citations

01 Jan 2000
TL;DR: This work gives provably better algorithms for normalized edit distance computation with proven complexity bounds: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an - time algorithms when the weights are rational.
Abstract: A common model for computing the similarity of two strings and of lengths and respectively, with , is to transform into through a sequence of edit operations, called an edit sequence. The edit operations are of three types: insertion, deletio n, and substitu- tion. A given cost function assigns a weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, a nd the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require time in the worst-case. We give provably better algorithms: an -time algorithm when the cost function is uniform, i.e, the weights of edit operations depend only on the type but not on the individual symbols involved, and an -time algorithm when the weights are rational.

34 citations

01 Dec 2004
TL;DR: In this paper, the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family, is considered.
Abstract: We consider the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family. We define a notion of breakpoint distance for this problem, based on matching genes from the same family between G and H. We show that this distance is a good approximation of the edit distance, but NP-hard to compute, even when just one family of genes is non-trivial. We also propose a branch-and-cut exact algorithm for the computation of the breakpoint distance.

34 citations

Proceedings ArticleDOI
01 Jan 2016
TL;DR: In this paper, the authors considered the problem of interactive communication in which two remote parties perform a computation while their communication channel is (adversarially) noisy, and they obtained the first interactive coding scheme that has a constant rate and tolerates noise rates of up to 1/18 - epsilon.
Abstract: We consider the question of interactive communication, in which two remote parties perform a computation while their communication channel is (adversarially) noisy. We extend here the discussion into a more general and stronger class of noise, namely, we allow the channel to perform insertions and deletions of symbols. These types of errors may bring the parties "out of sync", so that there is no consensus regarding the current round of the protocol. In this more general noise model, we obtain the first interactive coding scheme that has a constant rate and tolerates noise rates of up to 1/18 - epsilon. To this end we develop a novel primitive we name edit distance tree code. The edit distance tree code is designed to replace the Hamming distance constraints in Schulman's tree codes (STOC 93), with a stronger edit distance requirement. However, the straightforward generalization of tree codes to edit distance does not seem to yield a primitive that suffices for communication in the presence of synchronization problems. Giving the "right" definition of edit distance tree codes is a main conceptual contribution of this work.

34 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139