scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings Article
30 Aug 2005
TL;DR: The approach presented is based on representing sets of strings at higher levels of the index structure as tries suitably compressed in a way that reasoning about edit distance between a query string and a compressed trie at index nodes is still feasible.
Abstract: In various applications such as data cleansing, being able to retrieve categorical or numerical attributes based on notions of approximate match (e.g., edit distance, numerical distance) is of profound importance. Commonly, approximate match predicates are specified on combinations of attributes in conjunction. Existing database techniques for approximate retrieval, however, limit their applicability to single attribute retrieval through B-trees and their variants. In this paper, we propose a methodology that utilizes known multidimensional indexing structures for the problem of approximate multi-attribute retrieval. Our method enables indexing of a collection of string and/or numeric attributes to facilitate approximate retrieval using edit distance as an approximate match predicate for strings and numeric distance for numeric attributes. The approach presented is based on representing sets of strings at higher levels of the index structure as tries suitably compressed in a way that reasoning about edit distance between a query string and a compressed trie at index nodes is still feasible. We propose and evaluate various techniques to generate the compressed trie representation and fully specify our indexing methodology. Our experimental results show the benefits of our proposal when compared with various alternate strategies for the same problem.

24 citations

Journal Article
TL;DR: In this paper, the problem of computing the transposition invariant distance for various distance functions d, that are different versions of the edit distance, was studied, and algorithms whose time complexities are close to the known upper bounds were given.
Abstract: Given strings A and B over an alphabet Σ C U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A, B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is min t ∈ U {d(A + t,B)}, where A + t = (a 1 + t)(a 2 + t)... (a m + t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, that are different versions of the edit distance. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems.

24 citations

Lei Chen1
01 Jan 2005
TL;DR: Various similarity models are proposed to capture the similarities among time series and trajectory data under various circumstances and requirements, such as the appearance of noise and local time shifting.
Abstract: Time series data have been used in many applications, such as financial data analysis and weather forecasting. Similarly, trajectories of moving objects are often used to perform movement pattern analysis in surveillance video and sensor monitoring systems. All these applications are closely related to similarity-based time series or trajectory data retrieval. In this dissertation, various similarity models are proposed to capture the similarities among time series and trajectory data under various circumstances and requirements, such as the appearance of noise and local time shifting. A novel representation, called multi-scale time series histograms , is proposed to answer pattern existence queries and shape match queries. Earlier proposals generally address one or the other; multi-scale time series histograms can answer both types, which offers users more flexibility. A metric distance function, called Edit distance with Real Penalty (ERP), is proposed that can support local time shifting in time series and trajectory data. A second distance function, Edit Distance on Real sequence (EDR) is proposed to measure the similarity between time series or trajectories with local time shifting and noise. Since the proposed similarity models are computationally expensive, several indexing and pruning methods are proposed to improve the retrieval efficiency. For multi-scale time series histograms, A multi-step filtering process is introduced to improve the retrieval efficiency without introducing false dismissals. For ERP, a framework is developed to index time series or trajectory data under a metric distance function, which exploits the pruning power of lower bounding and triangle inequality. For EDR, three pruning techniques—mean value Q-grams, near triangle inequality, and trajectory histograms—are developed to improve the retrieval efficiency.

24 citations

Book ChapterDOI
30 Jun 2009
TL;DR: The new version ELKI 0.2 now is extended to time series data and offers a selection of specialized distance measures, which can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time seriesData.
Abstract: ELKI is a unified software framework, designed as a tool suitable for evaluation of different algorithms on high dimensional real-valued feature-vectors. A special case of high dimensional real-valued feature-vectors are time series data where traditional distance measures like L p -distances can be applied. However, also a broad range of specialized distance measures like, e.g., dynamic time-warping, or generalized distance measures like second order distances, e.g., shared-nearest-neighbor distances, have been proposed. The new version ELKI 0.2 now is extended to time series data and offers a selection of these distance measures. It can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time series data.

23 citations

Journal ArticleDOI
TL;DR: This paper presents an algorithm running in O(nNlg(N/n) time for computing the edit-distance of these two strings under any rational scoring function, and an O( n2/3N4/3) time algorithm for arbitrary scoring functions.
Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n2/3N4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

23 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139