scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Book ChapterDOI
18 Jun 2009
TL;DR: This paper proposes the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most " 1" from query string) in time linear in the length of query string.
Abstract: In the approximate dictionary search problem we have to construct a data structure on a set of strings so that we can answer to queries of the kind: find all strings of the set that are similar (according to some string distance) to a given string. In this paper we propose the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most "1" from query string) in time linear in the length of query string. Based on our new dictionary we propose a full-text index for approximate queries with edit distance "1" (report all positions of all sub-strings of the text that are at edit distance at most "1" from query string) answering to a query in time linear in the length of query string using space $O(n(\lg(n)\lg\lg(n))^2)$ in the worst case on a text of length n . Our index is the first index that answers queries in time linear in the length of query string while using space O (n ·poly (log (n ))) in the worst case and for any alphabet size.

26 citations

Journal ArticleDOI
TL;DR: This analysis is extended and a first approach to a robust statistical test is developed that addresses the central issue whether two groups of songs belong to the same population of songs or are significantly different.
Abstract: The Levenshtein or string edit distance is an objective measure of the difference between two strings of elements. Levenshtein distance analysis has previously been applied to humpback whale songs, where it provided a quantitative measure of song change from year to year. This analysis is extended and a first approach to a robust statistical test is developed. The statistical test addresses the central issue whether two groups of songs (either from different individuals, different groups or different years) belong to the same population of songs or are significantly different. This is accomplished through derivation of the Kohonen median song sequence, which has the smallest possible summed Levenshtein distance to all songs of the group. By a simple t-test or nonparametric equivalent it is tested whether the median distance to the Kohonen median song sequence of a second group is significantly larger, which indicates that the groups are different. The test is expanded to handle multiple comparisons among several groups of songs.

26 citations

Proceedings ArticleDOI
Jasha Droppo1, Alex Acero1
14 Mar 2010
TL;DR: It is shown how this phonetic string edit distance can be learned from data, and that including context in the model is essential for good performance, and improved accuracy on a business search task is demonstrated.
Abstract: An automatic speech recognition system searches for the word transcription with the highest overall score for a given acoustic observation sequence. This overall score is typically a weighted combination of a language model score and an acoustic model score. We propose including a third score, which measures the similarity of the word transcription's pronunciation to the output of a less constrained phonetic recognizer. We show how this phonetic string edit distance can be learned from data, and that including context in the model is essential for good performance. We demonstrate improved accuracy on a business search task.

26 citations

Journal ArticleDOI
TL;DR: This paper proposes a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short), which has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of sequences.
Abstract: Comparison functions for sequences (of symbols) are important components of many applications, for example, clustering, data cleansing, and integration. For years, many efforts have been made to improve the performance of such comparison functions. Improvements have been done either at the cost of reducing the accuracy of the comparison, or by compromising certain basic characteristics of the functions, such as the triangular inequality. In this paper, we propose a new distance for sequences of symbols (or strings) called Optimal Symbol Alignment distance (OSA distance, for short). This distance has a very low cost in practice, which makes it a suitable candidate for computing distances in applications with large amounts of (very long) sequences. After providing a mathematical proof that the OSA distance is a real distance, we present some experiments for different scenarios (DNA sequences, record linkage, etc.), showing that the proposed distance outperforms, in terms of execution time and/or accuracy, other well-known comparison functions such as the Edit or Jaro-Winkler distances.

26 citations

Patent
18 May 2007
TL;DR: The authors provided a character string updated degree evaluation program that enables quantitative grasping of an amount of intellectual work through editing and updating of character strings, where a text subjected to comparison is divided into common part character strings each having a length greater than or equal to a threshold value, and non-common part character string strings.
Abstract: There is provided a character string updated degree evaluation program that enables quantitative grasping of an amount of intellectual work through editing and updating of character strings. A text subjected to comparison is divided into common part character strings each having a length greater than or equal to a threshold value, and non-common part character strings. A number of edited points from the original text and a context edit distance are calculated based on the rate of the common part character strings and the occurrence pattern thereof. A number of edited point is acquired from a number of elements contained in a common part character string set, and a context edit distance is acquired from a change in an order of occurrence of the common part character strings. Calculation of a new creation percentage and analysis by an N-gram are performed on the non-common part character string. The new creation percentage is acquired from the total length of the elements contained in a non-common part character string set, and a new creation novelty degree is acquired from a non-partial matching rate between a non-common part character string set and an element contained in the non-common part character string set. Calculations for the common part character string set and for the non-common part character string set are united, thereby calculating a text updated degree.

26 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139