scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The Edit distance and the DNFP methods have the highest discrimination powers and can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2.
Abstract: DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbc L and mat K regions have been proposed as the “core barcode” for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses. First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups. The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at http://psba-trnh-plantidit.dnsalias.org .

33 citations

Proceedings ArticleDOI
21 Oct 2007
TL;DR: This work proves the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings, and provides the first setting in which the complexity of computing the edit Distance is provably larger than that of Hamming distance.
Abstract: We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, thai protocols with O(1) bits of communication can only obtain approximation a ges Omega(log d/log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embaddings into spaces like L1 and squared-L2. two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma= {0, 1}, but also for strings that are permu-tations (called the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at. least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.

33 citations

Proceedings ArticleDOI
23 Jun 2019
TL;DR: In this article, fast-decodable indexing schemes for edit distance were introduced, which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I.
Abstract: We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every e >0, one can in near linear time construct a string I ∈ Σ′n with |Σ′| = Oe(1), such that, indexing any string S ∈ Σn, symbol-by-symbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+e)-approximation of the edit distance between S′ and any other string in O(n (logn)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes.

33 citations

Proceedings ArticleDOI
17 Jan 2010
TL;DR: Near-tight bounds are given for estimating the edit distance between two non-repetitive strings (Ulam distance) with constant approximation, in sub-linear time and a matching lower bound is proved.
Abstract: We give near-tight bounds for estimating the edit distance between two non-repetitive strings (Ulam distance) with constant approximation, in sub-linear time. For two strings of length d and at edit distance R, our algorithm runs in time O(d/R + √d) and outputs a constant approximation to R. We also prove a matching lower bound (up to logarithmic terms). Both upper and lower bounds are improvements over previous results from, respectively, [Andoni-Indyk-Krauthgamer, SODA'09] and [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami, STOC'03].

33 citations

Journal ArticleDOI
TL;DR: This work presents a local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate, and applies the SWIFT algorithm for lossless filtering.
Abstract: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. We present here the local pairwise aligner STELLAR that has full sensitivity for e-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar . The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de .

33 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139