scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The approach to compute distances between two arbitrary genomes is generalized, but focus on approximating the true evolutionary distance rather than the edit distance, and the distances produced are good enough to enable the simple neighbor-joining procedure to reconstruct the authors' test trees with high accuracy.
Abstract: As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, duplications, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multiset of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications); in turn we extended it to compute a nearly optimal edit sequence between an arbitrary genome and the identity permutation. In this paper we generalize our approach to compute distances between two arbitrary genomes, but focus on approximating the true evolutionary distance rather than the edit distance. We present experimental results showing that our algorithm produces excellent estimates of the true evolutionary distance up to a (high) threshold of saturation; indeed, the distances thus produced are good enough to enable the simple neighbor-joining procedure to reconstruct our test trees with high accuracy.

79 citations

Proceedings Article
01 Jan 2007
TL;DR: The idea is to use a fast but suboptimal bipartite graph matching algorithm as a heuristic function that estimates the future costs so that it is guaranteed to return the exact graph edit distance of two given graphs.
Abstract: Graph edit distance is a dissimilarity measure for arbitrarily structured and arbitrarily labeled graphs. In contrast with other approaches, it does not suffer from any restrictions and can be applied to any type of graph, including hypergraphs [1]. Graph edit distance can be used to address various graph classification problems with different methods, for instance, k-nearest-neighbor classifier (k-NN), graph embedding classifier [2], or classification with graph kernel machines [3]. The main drawback of graph edit distance is its computational complexity which is exponential in the number of nodes of the involved graphs. Consequently, computation of graph edit distance is feasible for graphs of rather small size only. In order to overcome this restriction, a number of fast but suboptimal methods have been proposed in the literature (e.g. [4]). In the present paper we aim at speeding up the computation of exact graph edit distance. We propose to combine the standard tree search approach to graph edit distance computation with the suboptimal procedure described in [4]. The idea is to use a fast but suboptimal bipartite graph matching algorithm as a heuristic function that estimates the future costs. The overhead for computing this heuristic function is small, and easily compensated by the speed-up achieved in tree traversal. Since the heuristic function provides us with a lower bound of the future costs, it is guaranteed to return the exact graph edit distance of two given graphs.

77 citations

Journal ArticleDOI
TL;DR: Improvements to previously published methods for similarity searching with reduced graphs are described, with a particular focus on ligand-based virtual screening, and a novel use of reduced graphs in the clustering of high-throughput screening data is described.
Abstract: Virtual screening and high-throughput screening are two major components of lead discovery within the pharmaceutical industry. In this paper we describe improvements to previously published methods for similarity searching with reduced graphs, with a particular focus on ligand-based virtual screening, and describe a novel use of reduced graphs in the clustering of high-throughput screening data. Literature methods for reduced graph similarity searching encode the reduced graphs as binary fingerprints, which has a number of issues. In this paper we extend the definition of the reduced graph to include positively and negatively ionizable groups and introduce a new method for measuring the similarity of reduced graphs based on a weighted edit distance. Moving beyond simple similarity searching, we show how more flexible queries can be built using reduced graphs and describe a database system that allows iterative querying with multiple representations. Reduced graphs capture many important features of ligand...

77 citations

Journal ArticleDOI
TL;DR: In this paper, an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance was introduced, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction.
Abstract: This paper studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and nonbinary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance of the code), yielding a number of distinct traces to be used for reconstruction. We wish to know the minimum number of traces needed for exact reconstruction. This is a general version of a problem tackled by Levenshtein for uncoded sequences. We introduce an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction. Without specific knowledge of the code words, this upper bound is tight. We apply our results to the famous single deletion/insertion-correcting Varshamov–Tenengolts (VT) codes and show that a significant number of VT code word pairs achieve the worst case number of outputs needed for exact reconstruction. We also consider extensions to other channels, such as adversarial deletion and insertion/deletion channels and probabilistic channels.

77 citations

Book ChapterDOI
10 Sep 2012
TL;DR: The problem of secure outsourcing of sequence comparisons by a client to remote servers, which given two strings λ and μ of respective lengths n and m, consists of finding a minimum-cost sequence of insertions, deletions, and substitutions that transform λ into μ is treated.
Abstract: We treat the problem of secure outsourcing of sequence comparisons by a client to remote servers, which given two strings λ and μ of respective lengths n and m, consists of finding a minimum-cost sequence of insertions, deletions, and substitutions (also called an edit script) that transform λ into μ. In our setting a client owns λ and μ and outsources the computation to two servers without revealing to them information about either the input strings or the output sequence. Our solution is non-interactive for the client (who only sends information about the inputs and receives the output) and the client’s work is linear in its input/output. The servers’ performance is O(σmn) computation (which is optimal) and communication, where σ is the alphabet size, and the solution is designed to work when the servers have only O(σ(m + n)) memory. By utilizing garbled circuit evaluation in a novel way, we completely avoid public-key cryptography, which makes our solution particularly efficient.

77 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139