Topic
Edit distance
About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: The result suggests that the proposed method can efficiently compute the edit distance for moderate size unordered trees and has the accuracy comparative to those by the edit Distance for ordered trees and by an existing method for glycan search.
Abstract: Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees. In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search. The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request.
24 citations
••
TL;DR: It is shown that the editdistance between trees is at least half of the edit distance between the Euler strings and is at most 2h+1 times the edit Distance between theEuler strings, where h is the minimum height of two trees.
24 citations
••
12 Aug 2007TL;DR: This paper explores the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records, and reduces the impact of noisy records on the canonical representation.
Abstract: It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization.In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.
24 citations
•
05 Nov 2015TL;DR: A method that combines techniques based on edit distance and frequency counts with a contextual similarity-based method for detecting and correcting orthographic errors, including misspellings, word breaks, and punctuation errors is proposed.
Abstract: Orthographic and grammatical errors are a common feature of informal texts written by lay people. Health-related questions asked by consumers are a case in point. Automatic interpretation of consumer health questions is hampered by such errors. In this paper, we propose a method that combines techniques based on edit distance and frequency counts with a contextual similarity-based method for detecting and correcting orthographic errors, including misspellings, word breaks, and punctuation errors. We evaluate our method on a set of spell-corrected questions extracted from the NLM collection of consumer health questions. Our method achieves a F1 score of 0.61, compared to an informed baseline of 0.29, achieved using ESpell, a spelling correction system developed for biomedical queries. Our results show that orthographic similarity is most relevant in spelling error correction in consumer health questions and that frequency and contextual information are complementary to orthographic features.
24 citations
01 Jan 2005
TL;DR: A novel kernel-based method for relation extraction between named entities from Chinese texts that needs less manual efforts in feature transformation and achieves a better performance than traditional feature-based learning methods.
Abstract: In this paper, a novel kernel-based method is presented for the problem of relation extraction between named entities from Chinese texts The kernel is defined over the original Chinese string representations around particular entities As a kernel function, the Improved-Edit-Distance (IED) is used to calculate the similarity between two Chinese strings By employing the Voted Perceptron and Support Vector Machine (SVM) kernel machines with the IED kernel as the classifiers, we tested the method by extracting person-affiliation relation from Chinese texts By comparing with traditional feature-based learning methods, we conclude that our method needs less manual efforts in feature transformation and achieves a better performance
24 citations