scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Patent
02 Jan 2007
TL;DR: In this article, a system and method of machine learning that uses an inverse matrix of a reference similarity matrix as a transformation matrix is proposed. But the transformation matrix cannot be used to improve the performance of query vectors in classifying or identifying digital representations of an unknown object.
Abstract: A system and method of machine learning that uses an inverse matrix of a reference similarity matrix as a transformation matrix. The reference similarity matrix relates a reference set of objects to themselves using a distance metric such as an image edit distance. The transformation matrix is used to improve the performance of query vectors in classifying or identifying digital representations of an unknown object. The query vector is a measure of similarity between the unknown object and the members of the reference set. Multiplying the query vector by the transformation matrix produces an improved query vector having improved similarity scores. The highest improved similarity score indicates the best match member of the reference set If the similarity score is high enough, the unknown object may either be classified as belonging to the same class, or recognized as being the same object, as the best match object.

20 citations

Patent
30 Sep 2005
TL;DR: In this article, a system and method of approximating edit distance for a set of character strings in a database includes producing a representative sketch for each of the character strings; and approximating an edit distance between two selected character strings based only on the representative sketch.
Abstract: A system and method of approximating edit distance for a set of character strings in a database includes producing a representative sketch for each of the character strings; and approximating an edit distance between two selected character strings based only on the representative sketch for each of the selected character strings. The character strings may comprise text, wherein the method further comprises encoding positions of substrings in the text using anchors, wherein the anchors comprise identical substrings occurring in two input character strings at a nearby position. A set of anchors may be used in a correlated manner, wherein character strings with a sufficiently small edit distance are likely to use a same sequence of anchors. The character strings may be substantially non-repetitive. The representative sketch of a first character string is preferably constructed absent knowledge of a second character string. A size of the representative sketch may be constant.

20 citations

Proceedings ArticleDOI
29 Nov 2001
TL;DR: The FlExPat algorithm is designed to satisfactorily cope with the trade-off between flexibility, particularly in sequence data representation and in associated similarity metrics, and computational efficiency, and some experimental results obtained with FlExpat on music data are presented and commented.
Abstract: This paper addresses sequential data mining, a sub-area of data mining where the data to be analyzed is organized in sequences. In many problem domains a natural ordering exists over data. Examples of sequential databases (SDBs) include: (a) collections of temporal data sequences, such as chronological series of daily stock indices or multimedia data (sound, music, video, etc.); and (b) macromolecule banks, where amino acid or proteic sequences are represented as strings. In a SDB it is often valuable to detect regularities through one or several sequences. In particular, finding exact or approximate repetitions of segments can be utilized directly (e.g. for determining the biochemical activity of a protein region) or indirectly, e.g. for prediction in finance. To this end, we present concepts and an algorithm for automatically extracting sequential patterns from a sequential database. Such a pattern is defined as a group of significantly similar segments from one or several sequences. Appropriate functions for measuring similarity between sequence segments are proposed, generalizing the edit distance framework. There is a trade off between flexibility, particularly in sequence data representation and in associated similarity metrics, and computational efficiency. We designed the FlExPat algorithm to satisfactorily cope with this trade-off. FlExPat's complexity is in practice lesser than quadratic in the total length of the SDB analyzed, while allowing high flexibility. Some experimental results obtained with FlExPat on music data are presented and commented.

20 citations

Journal ArticleDOI
TL;DR: The improved method is obtained by introducing a dynamic programming scheme and heuristic techniques to the previous clique-based method for the tree edit distance problem for unordered trees, and is much faster than the previous method.
Abstract: Many kinds of tree-structured data, such as RNA secondary structures, have become available due to the progress of techniques in the field of molecular biology. To analyze the tree-structured data, various measures for computing the similarity between them have been developed and applied. Among them, tree edit distance is one of the most widely used measures. However, the tree edit distance problem for unordered trees is NP-hard. Therefore, it is required to develop efficient algorithms for the problem. Recently, a practical method called clique-based algorithm has been proposed, but it is not fast for large trees. This article presents an improved clique-based method for the tree edit distance problem for unordered trees. The improved method is obtained by introducing a dynamic programming scheme and heuristic techniques to the previous clique-based method. To evaluate the efficiency of the improved method, we applied the method to comparison of real tree structured data such as glycan structure...

20 citations

Journal ArticleDOI
Hao Liu1, Qingjie Zhao1, Hao Wang1, Peng Lv1, Yanming Chen1 
TL;DR: An image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization and a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts.
Abstract: The rapid development of social network in recent years has spurred enormous growth of near-duplicate videos. The existence of huge volumes of near-duplicates shows a rising demand on effective near-duplicate video retrieval technique in copyright violation and search result reranking. In this paper, we propose an image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization. By regarding video sequences as strings, Edit distance is used and extended to retrieve and localize near-duplicate videos. Firstly, bag-of-words (BOW) model is utilized to measure the frame similarities, which is robust to spatial transformations. Then, non-near-duplicate videos are filtered out by computing the proposed relative Edit distance similarity (REDS). Next, a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts. Experiments on CC_WEB_VIDEO and TREC CBCD 2011 datasets demonstrated the effectiveness and robustness of the proposed method in retrieval and localization tasks.

20 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139