scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2004"


Proceedings ArticleDOI
23 Aug 2004
TL;DR: Investigation of unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources shows that edit distance data is cleaner and more easily-aligned than the heuristic data.
Abstract: We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

895 citations


Book ChapterDOI
31 Aug 2004
TL;DR: A new distance function, which is a marriage of L1- norm and the edit distance, ERP, which can support local time shifting, and is a metric, and dominates all existing strategies.
Abstract: Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP ("Edit distance with Real Penalty"). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies.

790 citations


Proceedings ArticleDOI
17 Oct 2004
TL;DR: Algorithms are developed that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than /spl lscr/, decide which of the two holds and develop an n/sup 3/7/-approximation quasilinear time algorithm.
Abstract: Edit distance has been extensively studied for the past several years. Nevertheless, no linear-time algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as low-distortion embeddings into normed spaces, approximate nearest-neighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than /spl lscr/, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the k vs. (kn)/sup 2/3/ gap problem, using a constant size sketch. A more involved algorithm solves the stronger k vs. /spl lscr/ gap problem, where /spl lscr/ can be as small as O(k/sup 2/) - still with a constant sketch - but works only for strings that are mildly "nonrepetitive". Finally, we develop an n/sup 3/7/-approximation quasilinear time algorithm for edit distance, improving the previous best factor of n/sup 3/4/ (Cole and Hariharan, 2002); if the input strings are assumed to be nonrepetitive, then the approximation factor can be strengthened to n/sup 1/3/.

161 citations


Book ChapterDOI
20 Oct 2004
TL;DR: An algorithm is presented that attempts to select the best choice among all possible corrections for a misspelled term, and its implementation based on a ternary search tree data structure is discussed.
Abstract: Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.

84 citations


Book ChapterDOI
14 Mar 2004
TL;DR: In this article, a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria are presented.
Abstract: Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data. As a key feature, database systems have to support the search for similar objects where it is important to take into account both the structure and the content features of the objects. A successful approach is to use the edit distance for tree structured data. As the computation of this measure is NP-complete, constrained edit distances have been successfully applied to trees. While yielding good results, they are still computationally complex and, therefore, of limited benefit for searching in large databases. In this paper, we propose a filter and refinement architecture to overcome this problem. We present a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria. The efficiency of our methods, resulting from the good selectivity of the filters is demonstrated in extensive experiments with real-world applications.

84 citations


Book ChapterDOI
TL;DR: An efficient algorithm is proposed for edit distance computation of planar graphs given graphs embedded in the plane by iteratively match small subgraphs by locally optimizing structural correspondences to obtain a valid edit path and hence an upper bound of the edit distance.
Abstract: Graph edit distance is a powerful error-tolerant similarity measure for graphs. For pattern recognition problems involving large graphs, however, the high computational complexity makes it sometimes impossible to apply edit distance algorithms. In the present paper we propose an efficient algorithm for edit distance computation of planar graphs. Given graphs embedded in the plane, we iteratively match small subgraphs by locally optimizing structural correspondences. Eventually we obtain a valid edit path and hence an upper bound of the edit distance. To demonstrate the efficiency of our approach, we apply the proposed algorithm to the problem of fingerprint classification.

83 citations


Journal ArticleDOI
TL;DR: Improvements to previously published methods for similarity searching with reduced graphs are described, with a particular focus on ligand-based virtual screening, and a novel use of reduced graphs in the clustering of high-throughput screening data is described.
Abstract: Virtual screening and high-throughput screening are two major components of lead discovery within the pharmaceutical industry. In this paper we describe improvements to previously published methods for similarity searching with reduced graphs, with a particular focus on ligand-based virtual screening, and describe a novel use of reduced graphs in the clustering of high-throughput screening data. Literature methods for reduced graph similarity searching encode the reduced graphs as binary fingerprints, which has a number of issues. In this paper we extend the definition of the reduced graph to include positively and negatively ionizable groups and introduce a new method for measuring the similarity of reduced graphs based on a weighted edit distance. Moving beyond simple similarity searching, we show how more flexible queries can be built using reduced graphs and describe a database system that allows iterative querying with multiple representations. Reduced graphs capture many important features of ligand...

77 citations


Journal ArticleDOI
01 Mar 2004
TL;DR: A similarity-based variants of grouping and join operators that produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition is presented.
Abstract: Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of this operator, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples of application from the context of a data reconciliation project for looted art.

72 citations


Proceedings Article
01 Jan 2004
TL;DR: A rolling parallel printer in which a pressure element is driven through a swiveling motion each printing cycle and a pressure segment thereof rolls off a line of type.
Abstract: A rolling parallel printer in which a pressure element is driven through a swiveling motion each printing cycle and a pressure segment thereof rolls off a line of type. The pressure element is connected to a mechanical linkage which minimizes the sweep of travel of the pressure element, while maintaining the pressure element sufficiently far from the type in a rest position to facilitate reading of the printed matter.

68 citations


Journal ArticleDOI
TL;DR: Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.
Abstract: Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate comparators included the modified Jaro-Winkler method, the longest common substring, and the Levenshtein edit distance. We also calculated the combined root-mean square of all three. We tested each name comparison method using a deterministic record linkage algorithm. Results were consistent across both hospitals. At a threshold comparator score of 0.8, the Jaro-Winkler comparator achieved the highest linkage sensitivities of 97.4% and 97.7%. The combined root-mean square method achieved sensitivities higher than the Levenshtein edit distance or longest common substring while sustaining high linkage specificity. Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.

64 citations


Proceedings ArticleDOI
18 Jun 2004
TL;DR: This paper presents a domain-independent algorithm that effectively identifies duplicates in an XML document that adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level.
Abstract: The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.

Journal ArticleDOI
TL;DR: Inspired by the success of the Markov Random Field theory, a new edit distance called Markov edit distance (MED) within the dynamic programming framework is proposed to take full advantage of the local statistical dependencies in the pattern in order to arrive at enhanced matching performance.
Abstract: Edit distance was originally developed by Levenstein several decades ago to measure the distance between two strings. It was found that this distance can be computed by an elegant dynamic programming procedure. The edit distance has played important roles in a wide array of applications due to its representational efficacy and computational efficiency. To effect a more reasonable distance measure, the normalized edit distance was proposed. Many algorithms and studies have been dedicated along this line with impressive performances in recent years. There is, however, a fundamental problem with the original definition of edit distance that has remained elusive: its context-free nature. In determining the possible actions, i.e., insertion, deletion, and substitution, no systematic consideration was given to the local behaviors of the string/pattern in question that indeed encompass great amount of useful information regarding its content. In this paper, inspired by the success of the Markov Random Field theory, a new edit distance called Markov edit distance (MED) within the dynamic programming framework is proposed to take full advantage of the local statistical dependencies in the pattern in order to arrive at enhanced matching performance. Within this framework, two specialized distance measures are developed: The reshuffling MED to handle cases where a subpattern in the target pattern is the reshuffles of that in the source pattern, and the coherence MED which is able to incur local content based substitution, insertion, and deletion. The applications based on these two MEDs in string matching are then explored, whereof encouraging empirical results have been observed.

01 Jan 2004
TL;DR: An approach for answer selection in a free form question answering task by representing both questions and candidate passages using dependency trees, augmented with semantic information such as named entities, and computing a generalized edit distance between a candidate passage representation and the question representation, a distance which aims to capture some level of meaning similarity.
Abstract: We describe an approach for answer selection in a free form question answering task. In order to go beyond a key-word based matching in selecting answers to questions, one would like to develop a principled way for the answer selection process that incorporates both syntactic and semantic information. We achieve this goal by (1) representing both questions and candidate passages using dependency trees, augmented with semantic information such as named entities, and (2) computing a generalized edit distance between a candidate passage representation and the question representation, a distance which aims to capture some level of meaning similarity. The sentence that best answers a question is determined to be the one that minimizes the generalized edit distance we define, computed via a dynamic programming based approximate tree matching algorithm. We evaluate the approach on question-answer pairs taken from previous TREC Q/A competitions. Preliminary experiments show its potential by significantly outperforming common bag-of-word scoring methods.

Journal ArticleDOI
TL;DR: This paper addresses the problem of computing the length of the longest common subsequence (LCS) between run-length-encoded (RLE) strings, and exploits RLE to reduce the complexity of LCS computation from O(M × N) to O(mN + Mn - mn).

Proceedings ArticleDOI
23 Aug 2004
TL;DR: This work proposes a cost inference method that is based on a distribution estimation of edit operations that employs an expectation maximization algorithm to learn mixture densities from a labeled sample of graphs and derive edit costs that are subsequently applied in the context of a graph edit distance computation framework.
Abstract: Graph edit distance provides an error-tolerant way to measure distances between attributed graphs. The effectiveness of edit distance based graph classification algorithms relies on the adequate definition of edit operation costs. We propose a cost inference method that is based on a distribution estimation of edit operations. For this purpose, we employ an expectation maximization algorithm to learn mixture densities from a labeled sample of graphs and derive edit costs that are subsequently applied in the context of a graph edit distance computation framework. We evaluate the performance of the proposed distance model in comparison to another recently introduced learning model for edit costs.

Book ChapterDOI
05 Jul 2004
TL;DR: This paper gives a solution using O(n) bits indexing data structure with O(mlog2 n) query time to the k-difference problem with k≥1, the first result which requires linear indexing space.
Abstract: Let T be a text of length n and P be a pattern of length m, both strings over a fixed finite alphabet A. The k-difference (k-mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P. In this paper we investigate a well-studied case in which k=1 and T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster [16-19]. This paper gives a solution using O(n) bits indexing data structure with O(mlog2 n) query time. To the best of our knowledge, this is the first result which requires linear indexing space. The results can be extended for the k-difference problem with k≥1.

Patent
John M. Carnahan1
16 Jan 2004
TL;DR: In this paper, a method is provided for increasing relevance of database search issues by determining a trained edit distance (130) between the subject query string (110) and a candidate string (120) using trained cost factors (150) derived from a training set of labeled query transformations (140).
Abstract: In one implementation, a method is provided for increasing relevance of database search issues. The method includes receiving a subject query string (110) and determining a trained edit distance (130) between the subject query string (110) and a candidate string (120) using trained cost factors (150) derived from a training set of labeled query transformations (140). A trained cost factor (150) includes a conditional probability for mutations in labeled non-relevant query transformations and a conditional probability for mutations in labeled relevant query transformations. The candidate string (120) is evaluated the for selection based on the trained edit distance (130). In some implementations, the cost factors (150) may take into account the context of a mutation. In some implementations multi-dimensional matrices (160) are utilized which include the trained cost factors (150).

Proceedings ArticleDOI
11 Jan 2004
TL;DR: To the knowledge, this is the first data structure for this problem with both query time and storage subexponential in d and the space requirement of this data structure is roughlyO, i.e., strongly subexp exponential.
Abstract: We present a data structure for the approximate nearest neighbor problem under edit metric (which is defined as the minimum number of insertions, deletions and character substitutions needed to transform one string into another). For any l ≥ 1 and a set of n strings of length d, the data structure reports a 3l-approximate Nearest Neighbor for any given query string q in O(d) time. The space requirement of this data structure is roughly O(nd1/(l+1)), i.e., strongly subexponential. To our knowledge, this is the first data structure for this problem with both o(n) query time and storage subexponential in d.

01 Dec 2004
TL;DR: In this paper, the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family, is considered.
Abstract: We consider the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family. We define a notion of breakpoint distance for this problem, based on matching genes from the same family between G and H. We show that this distance is a good approximation of the edit distance, but NP-hard to compute, even when just one family of genes is non-trivial. We also propose a branch-and-cut exact algorithm for the computation of the breakpoint distance.

Book ChapterDOI
29 Aug 2004
TL;DR: In this article, the authors show how testers and correctors for regular trees can be used to estimate distances between a document and a set of DTDs, a useful operation to rank XML documents.
Abstract: A corrector takes an invalid XML file F as input and produces a valid file F′ which is not far from F when F is e –close to its DTD, using the classical Tree Edit distance between a tree T and a language L defined by a DTD or a tree-automaton We show how testers and correctors for regular trees can be used to estimate distances between a document and a set of DTDs, a useful operation to rank XML documents

Journal ArticleDOI
TL;DR: As a solution for the edit distance between A and B, the difference representation of the D -table is defined, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.

Proceedings ArticleDOI
01 Dec 2004
TL;DR: This work presents a measure of contextual similarity for biomedical terms, which augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.
Abstract: We present a measure of contextual similarity for biomedical terms. The contextual features need to be explored, because newly coined terms are not explicitly described and efficiently stored in biomedical ontologies and their inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts. The context of each term can be represented as a sequence of syntactic elements annotated with biomedical information retrieved from an ontology. The sequences of contextual elements may be matched approximately by edit distance defined as the minimal cost incurred by the changes (including insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.

Book ChapterDOI
12 Jul 2004
TL;DR: In the complete version of the paper, it is shown that the distance problem is NP-complete on ordered trees.
Abstract: We consider the Edit distance with moves on the class of words and the class of ordered trees We first exhibit a simple tester for the class of regular languages on words and generalize it to the class of ranked regular trees In the complete version of the paper, we show that the distance problem is NP-complete on ordered trees

Patent
16 Nov 2004
TL;DR: In this paper, a system for effectively collecting, without omissions, spelling variations centering on particular technical terms occurring in documents is presented. But the system requires a large-scale collection of terms.
Abstract: A system for effectively collecting, without omissions, spelling variations centering on particular technical terms occurring in documents. In advance, the system sorts technical terms considered to be potential spelling variations from among a large-scale collection of terms. By measuring the edit distance adjusted for the cost of the terms that are potential spelling variations, the system can collect terms considered spelling variations from among the potential spelling variation terms with a high degree of accuracy.

Book ChapterDOI
21 Jul 2004
TL;DR: A novel solution is proposed for error tolerant graph matching by extending the original edit distance based framework so as to account for a new operator to support node merging during the matching process.
Abstract: In this paper a novel solution is proposed for error tolerant graph matching. The solution belongs to the class of edit distance based techniques. In particular, the original edit distance based framework is extended so as to account for a new operator to support node merging during the matching process.

Proceedings ArticleDOI
26 Oct 2004
TL;DR: The aim of this work is to correct recognition and segmentation errors using lexical information from a lexicon using a new approach to automatically learn an edit distance specifically adapted to the properties of the on-line handwritten word recognition.
Abstract: This paper presents an optimized lexical post-processing designed for handwritten word recognition. The aim of this work is to correct recognition and segmentation errors using lexical information from a lexicon. The presented lexical post-processing is based on two phases: in the first phase a lexicon organization is made to reduce the search space into sub-lexicons during the recognition process. The second phase develops a specific edit distance to identify the handwritten word using a selection of the sub-lexicons. The paper exposes two original strategies of lexicon reduction and a new approach to automatically learn an edit distance specifically adapted to the properties of the on-line handwritten word recognition. Experimental results are reported to compare the two lexicon reduction strategies and first results emphasize the impact of the learning process of the new edit distance.

01 Jan 2004
TL;DR: A string comparator based on edit distance that uses variable edit-step costs derived from training data and is compared with the JaroWinkler string comparators and with the Census Bureau’s record linkage software.
Abstract: We develop a string comparator based on edit distance that uses variable edit-step costs derived from training data. Using first and last name data from Census files, we compare the performance of this string comparator with one without variable edit step costs and with the JaroWinkler string comparator, which is standardly used in the Census Bureau’s record linkage software.

Patent
09 Feb 2004
TL;DR: In this paper, a process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold, using dynamic programming.
Abstract: A process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold. The process uses dynamic programming on a grid with search string characters corresponding to rows and text characters corresponding to columns. For each text string, computation proceeds by columns. If successive text strings share a prefix, then the columns corresponding to the prefix are re-used. If the minimum value in a column is at least the threshold, then the prefix corresponding to that and previous columns causes edit distance to be at least the threshold. So the computation for the present text is abandoned, and computations for any other texts that share the prefix are avoided.

Proceedings Article
01 May 2004
TL;DR: This paper explores the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores, and shows that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text.
Abstract: Automatic methods for MT evaluation are often based on the assumption that MT quality is related to some kind of distance between the evaluated text and a professional human translation (e.g., an edit distance or the precision of matched N-grams). However, independently produced human translations are necessarily different, conveying the same content by dissimilar means. Such legitimate translation variation is a serious problem for distance-based evaluation methods, because mismatches do not necessarily mean degradation in MT quality. In this paper we explore the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores. We show that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text, such as translation adequacy and fluency. However legitimate translation variation also reveals fundamental limits on the applicability of distance-based MT evaluation methods and on data-driven architectures for MT.

Book ChapterDOI
26 May 2004
TL;DR: In this article, an evolutionary approach was used to optimize the parameter values of cost functions of the edit distance for music performance annotation, and the validity of the optimized parameter settings was shown by assessing their error-percentage on a test set.
Abstract: In this paper we present an enhancement of edit distance based music performance annotation. The annotation captures musical expressivity not only in terms of timing deviations but also represents e.g. spontaneous note ornamentation. To reduce the number of errors in automatic performance annotation, some optimization is essential. We have taken an evolutionary approach to optimize the parameter values of cost functions of the edit distance. Automatic optimization is desirable since manual parameter tuning is unfeasible when more than a few performances are taken into account. The validity of the optimized parameter settings is shown by assessing their error-percentage on a test set.