scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Patent
09 Feb 2004
TL;DR: In this paper, a process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold, using dynamic programming.
Abstract: A process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold. The process uses dynamic programming on a grid with search string characters corresponding to rows and text characters corresponding to columns. For each text string, computation proceeds by columns. If successive text strings share a prefix, then the columns corresponding to the prefix are re-used. If the minimum value in a column is at least the threshold, then the prefix corresponding to that and previous columns causes edit distance to be at least the threshold. So the computation for the present text is abandoned, and computations for any other texts that share the prefix are avoided.

17 citations

Journal ArticleDOI
01 Nov 2010
TL;DR: This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length, and proposes an algorithm that runs in time almost linear in the input/output size.
Abstract: Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.

17 citations

Book ChapterDOI
30 Mar 2005
TL;DR: This paper explores distance measures based on genetic operators for genetic programming using tree structures using subtree crossover operator and makes progress toward improved algorithmic analysis by using appropriate measures of distance and similarity.
Abstract: This paper explores distance measures based on genetic operators for genetic programming using tree structures. The consistency between genetic operators and distance measures is a crucial point for analytical measures of problem difficulty, such as fitness distance correlation, and for measures of population diversity, such as entropy or variance. The contribution of this paper is the exploration of possible definitions and approximations of operator-based edit distance measures. In particular, we focus on the subtree crossover operator. An empirical study is presented to illustrate the features of an operator-based distance. This paper makes progress toward improved algorithmic analysis by using appropriate measures of distance and similarity.

17 citations

Proceedings ArticleDOI
07 Apr 2008
TL;DR: The new algorithm, called P2P fast similarity search (P2PFastSS), finds similar keys in any distributed hash table (DHT) using the edit distance metric, and is independent of the underlying P1P routing algorithm.
Abstract: Peer-to-peer (P2P) systems show numerous advantages over centralized systems, such as load balancing, scalability, and fault tolerance, and they require certain functionality, such as search, repair, and message and data transfer. In particular, structured P2P networks perform an exact search in logarithmic time proportional to the number of peers. However, keyword similarity search in a structured P2P network remains a challenge. Similarity search for service discovery can significantly improve service management in a distributed environment. As services are often described informally in text form, keyword similarity search can find the required services or data items more reliably. This paper presents a fast similarity search algorithm for structured P2P systems. The new algorithm, called P2P fast similarity search (P2PFastSS), finds similar keys in any distributed hash table (DHT) using the edit distance metric, and is independent of the underlying P2P routing algorithm. Performance analysis shows that P2PFastSS carries out a similarity search in time proportional to the logarithm of the number of peers. Simulations on PlanetLab confirm these results and show that a similarity search with 34,000 peers performs in less than three seconds on average. Thus, P2PFastSS is suitable for similarity search in large-scale network infrastructures, such as service description matching in service discovery or searching for similar terms in P2P storage networks.

17 citations

01 Jan 2001
TL;DR: A measure of the similarity of the long-term structure of musical pieces is presented, which can be matched to other similar scores using a generalized edit distance, in order to assess structural similarity.
Abstract: We present a measure of the similarity of the long-term structure of musical pieces. The system deals with raw polyphonic data. Through unsupervised learning, we generate an abstract representation of music the “texture score”. This “texture score” can be matched to other similar scores using a generalized edit distance, in order to assess structural similarity. We notably apply this algorithm to the retrieval of different interpretations of the same song within a music database.

17 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139