Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region.

[...]

Chang Liu¹, Dong Liang², Ting Gao³, Xiaohui Pang¹, Jingyuan Song¹, Hui Yao¹, Jianping Han¹, Zhihua Liu¹, Xiaojun Guan⁴, Kun Jiang, Huan Li², Shilin Chen¹ - Show less +8 more•Institutions (4)

Peking Union Medical College¹, Beihang University², Qingdao Agricultural University³, University of North Carolina at Chapel Hill⁴

30 Nov 2011-BMC Bioinformatics

TL;DR: The Edit distance and the DNFP methods have the highest discrimination powers and can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2.

...read moreread less

Abstract: DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbc L and mat K regions have been proposed as the “core barcode” for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses. First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups. The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at http://psba-trnh-plantidit.dnsalias.org .

...read moreread less

33 citations

Proceedings Article•DOI•

The Computational Hardness of Estimating Edit Distance [Extended Abstract]

[...]

A. Andon¹, R. Krauthgamer•Institutions (1)

Massachusetts Institute of Technology¹

21 Oct 2007

TL;DR: This work proves the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings, and provides the first setting in which the complexity of computing the edit Distance is provably larger than that of Hamming distance.

...read moreread less

Abstract: We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, thai protocols with O(1) bits of communication can only obtain approximation a ges Omega(log d/log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embaddings into spaces like L1 and squared-L2. two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma= {0, 1}, but also for strings that are permu-tations (called the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at. least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.

...read moreread less

33 citations

Proceedings Article•DOI•

Near-linear time insertion-deletion codes and (1+ε)-approximating edit distance via indexing

[...]

Bernhard Haeupler¹, Aviad Rubinstein², Amirbehshad Shahrasbi¹•Institutions (2)

Carnegie Mellon University¹, Stanford University²

23 Jun 2019

TL;DR: In this article, fast-decodable indexing schemes for edit distance were introduced, which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I.

...read moreread less

Abstract: We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every e >0, one can in near linear time construct a string I ∈ Σ′n with |Σ′| = Oe(1), such that, indexing any string S ∈ Σn, symbol-by-symbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+e)-approximation of the edit distance between S′ and any other string in O(n (logn)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes.

...read moreread less

33 citations

Proceedings Article•DOI•

Near-optimal sublinear time algorithms for Ulam distance

[...]

Alexandr Andoni¹, Huy Nguyen¹•Institutions (1)

Princeton University¹

17 Jan 2010

TL;DR: Near-tight bounds are given for estimating the edit distance between two non-repetitive strings (Ulam distance) with constant approximation, in sub-linear time and a matching lower bound is proved.

...read moreread less

Abstract: We give near-tight bounds for estimating the edit distance between two non-repetitive strings (Ulam distance) with constant approximation, in sub-linear time. For two strings of length d and at edit distance R, our algorithm runs in time O(d/R + √d) and outputs a constant approximation to R. We also prove a matching lower bound (up to logarithmic terms). Both upper and lower bounds are improvements over previous results from, respectively, [Andoni-Indyk-Krauthgamer, SODA'09] and [Batu-Ergun-Kilian-Magen-Raskhodnikova-Rubinfeld-Sami, STOC'03].

...read moreread less

33 citations

Journal Article•DOI•

STELLAR: fast and exact local alignments

[...]

Birte Kehr¹, David Weese¹, Knut Reinert¹•Institutions (1)

Free University of Berlin¹

05 Oct 2011-BMC Bioinformatics

TL;DR: This work presents a local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate, and applies the SWIFT algorithm for lossless filtering.

...read moreread less

Abstract: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. We present here the local pairwise aligner STELLAR that has full sensitivity for e-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar . The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de .

...read moreread less

33 citations

Collapse

Network Information

Performance

Metrics

3,030

Papers

78,281

Citations

No. of papers in the topic in previous years
Year	Papers
2023	39
2022	96
2021	111
2020	149
2019	145
2018	139

Edit distance

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics