scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The experimental results proved that the participation of the proposed aligner in STS is effective, and the proposed UESTS outperforms the state-of-the-art unsupervised approaches, which is a promising result.
Abstract: Semantic textual similarity (STS) is the task of assessing the degree of similarity between two texts in terms of meaning. Several approaches have been proposed in the literature to determine the semantic similarity between texts. The most promising work recently presented in the literature was supervised approaches. Unsupervised STS approaches are characterized by the fact that they do not require learning data, but they still suffer from some limitations. Word alignment has been widely used in the state-of-the-art approaches. From this point, this paper has three contributions. First, a new synset-oriented word aligner is presented, which relies on a huge multilingual semantic network named BabelNet. Second, three unsupervised STS approaches are proposed: string kernel-based (SK), alignment-based (AL), and weighted alignment-based (WAL). Third, some limitations of the state-of-the-art approaches are tackled, and different similarity methods are demonstrated to be complementary with each other by proposing an unsupervised ensemble STS (UESTS) approach. The UESTS incorporates the merits of four similarity measures: proposed alignment-based, surface-based, corpus-based, and enhanced edit distance. The experimental results proved that the participation of the proposed aligner in STS is effective. Over all the evaluation data sets, the proposed UESTS outperforms the state-of-the-art unsupervised approaches, which is a promising result.

16 citations

01 Jan 2013
TL;DR: In this paper, a new algorithm was proposed to compute an approximation to the median of a set of strings, which is obtained through the successive improvements of a partial solution, thus accounting for the frequency of each of the edit operations in all the positions of the approximate median.
Abstract: This paper presents a new algorithm that can be used to compute an approximation to the median of a set of strings. The approximate median is obtained through the successive improvements of a partial solution. The edit distance from the partial solution to all the strings in the set is computed in each iteration, thus accounting for the frequency of each of the edit operations in all the positions of the approximate median. A goodness index for edit operations is later computed by multiplying their frequency by the cost. Each operation is tested, starting from that with the highest index, in order to verify whether applying it to the partial solution leads to an improvement. If successful, a new iteration begins from the new approximate median. The algorithm finishes when all the operations have been examined without a better solution being found. Comparative experiments involving Freeman chain codes encoding 2D shapes and the Copenhagen chromosome database show that the quality of the approximate median string is similar to benchmark approaches but achieves a much faster convergence.

16 citations

Proceedings Article
01 Jan 1996
TL;DR: In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text.
Abstract: In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text. The OCR word correction technique is based on statistical word bigram modeling (Tong & Evans 1996). The variants of a query term are terms similar to the query term, as measured by the edit distance (Wagner 1974). While the official runs were based on the first approach, in the follow-up experiments they tested the second approach as well. In this report, they give a brief description of the OCR correction and query expansion techniques, and then discuss the results of the experiments

16 citations

Journal ArticleDOI
TL;DR: This paper presents a new method of human action recognition, which is based on ℜ transform and template matching after the key frame is extracted from a cycle, and utilizes a novel string matching scheme based on edit distance to analyze different human actions.

16 citations

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.
Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.

16 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139