scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: It is shown that the approximate matching problem with swap andmismatch as the edit operations, can be computed in timeO(n √m logm).
Abstract: There is no known algorithm that solves the general case of theapproximate string matching problem with the extended edit distance, where the edit operations are: insertion, deletion, mismatch and swap, in timeo(nm), wheren is the length of the text andm is the length of the pattern. In an effort to study this problem, the edit operations were analysed independently. It turns out that the approximate matching problem with only the mismatch operation can be solved in timeO(n √m logm). If the only edit operation allowed is swap, then the problem can be solved in timeO(n logm logσ), whereσ=min(m, |Σ|). In this paper we show that theapproximate string matching problem withswap andmismatch as the edit operations, can be computed in timeO(n √m logm).

30 citations

Journal ArticleDOI
TL;DR: This paper presents a fixed-parameter algorithm for the tree edit distance problem for unordered trees under the unit cost model that works in O(2.62^[email protected]?poly(n) time and O(n^2) space, where the parameter k is the maximum bound of the edit distance and n is themaximum size of input trees.

30 citations

Posted ContentDOI
08 Nov 2017-bioRxiv
TL;DR: An algorithm is introduced to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V, E) in O( |V |+m|E|) time and O(|V |) space.
Abstract: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Given the wide range of applications of this basic problem, it is surprising that algorithms with optimal runtime are, to the best of our knowledge, yet unknown. In particular, aligning sequences to cyclic graphs currently represents a challenge both in theory and practice. Here, we introduce an algorithm to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V,E) in O(V+m|E|) time and O(|V|) space. The corresponding alignment can be obtained in the same runtime using O(√m|V|) space. The time complexity depends only on the length of the sequence and the size of the graph. In particular, it does not depend on the cyclicity of the graph, or any other topological features.

30 citations

Journal ArticleDOI
01 Sep 2013
TL;DR: This paper devise a pipeline framework over a two-level index for searching KNN in the sequence database using the edit distance and brings various enticing advantages over existing works, including huge reduction on false positive candidates to avoid large overheads on candidate verifications.
Abstract: In this paper, we address the problem of finding k-nearest neighbors (KNN) in sequence databases using the edit distance. Unlike most existing works using short and exact n-gram matchings together with a filter-and-refine framework for KNN sequence search, our new approach allows us to use longer but approximate n-gram matchings as a basis of KNN candidates pruning. Based on this new idea, we devise a pipeline framework over a two-level index for searching KNN in the sequence database. By coupling this framework together with several efficient filtering strategies, i.e. the frequency queue and the well-known Combined Algorithm (CA), our proposal brings various enticing advantages over existing works, including 1) huge reduction on false positive candidates to avoid large overheads on candidate verifications; 2) progressive result update and early termination; and 3) good extensibility to parallel computation. We conduct extensive experiments on three real datasets to verify the superiority of the proposed framework.

30 citations

Patent
Jilei Tian1, Sunil Sivadas1, Tommi Lahti1
09 Jun 2005
TL;DR: In this article, a model-based approach is proposed for measuring confusability or similarity between given entry pairs, including text string pairs and acoustic model pairs, in systems such as speech recognition and synthesis systems.
Abstract: A system and method are proposed for measuring confusability or similarity between given entry pairs, including text string pairs and acoustic model pairs, in systems such as speech recognition and synthesis systems. A string edit distance (Levenshiten distance) can be applied to measure distance between any pair of text strings. It also can be used to calculate a confusion measurement between acoustic model pairs of different words and a model-driven method can be used to calculate a HMM model confusion matrix. This model-based approach can be efficiently calculated with low memory and low computational resources. Thus it can improve the speech recognition performance and models trained from text corpus.

30 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139