scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: The result suggests that the proposed method can efficiently compute the edit distance for moderate size unordered trees and has the accuracy comparative to those by the edit Distance for ordered trees and by an existing method for glycan search.
Abstract: Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees. In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search. The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request.

24 citations

Journal ArticleDOI
Tatsuya Akutsu1
TL;DR: It is shown that the editdistance between trees is at least half of the edit distance between the Euler strings and is at most 2h+1 times the edit Distance between theEuler strings, where h is the minimum height of two trees.

24 citations

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This paper explores the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records, and reduces the impact of noisy records on the canonical representation.
Abstract: It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization.In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

24 citations

Proceedings Article
05 Nov 2015
TL;DR: A method that combines techniques based on edit distance and frequency counts with a contextual similarity-based method for detecting and correcting orthographic errors, including misspellings, word breaks, and punctuation errors is proposed.
Abstract: Orthographic and grammatical errors are a common feature of informal texts written by lay people. Health-related questions asked by consumers are a case in point. Automatic interpretation of consumer health questions is hampered by such errors. In this paper, we propose a method that combines techniques based on edit distance and frequency counts with a contextual similarity-based method for detecting and correcting orthographic errors, including misspellings, word breaks, and punctuation errors. We evaluate our method on a set of spell-corrected questions extracted from the NLM collection of consumer health questions. Our method achieves a F1 score of 0.61, compared to an informed baseline of 0.29, achieved using ESpell, a spelling correction system developed for biomedical queries. Our results show that orthographic similarity is most relevant in spelling error correction in consumer health questions and that frequency and contextual information are complementary to orthographic features.

24 citations

01 Jan 2005
TL;DR: A novel kernel-based method for relation extraction between named entities from Chinese texts that needs less manual efforts in feature transformation and achieves a better performance than traditional feature-based learning methods.
Abstract: In this paper, a novel kernel-based method is presented for the problem of relation extraction between named entities from Chinese texts The kernel is defined over the original Chinese string representations around particular entities As a kernel function, the Improved-Edit-Distance (IED) is used to calculate the similarity between two Chinese strings By employing the Voted Perceptron and Support Vector Machine (SVM) kernel machines with the IED kernel as the classifiers, we tested the method by extracting person-affiliation relation from Chinese texts By comparing with traditional feature-based learning methods, we conclude that our method needs less manual efforts in feature transformation and achieves a better performance

24 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139