Showing papers on "Edit distance published in 2004"

PDF

Open Access

Proceedings Article•DOI•

Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources

[...]

Bill Dolan¹, Chris Quirk¹, Chris Brockett¹•Institutions (1)

23 Aug 2004

TL;DR: Investigation of unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources shows that edit distance data is cleaner and more easily-aligned than the heuristic data.

...read moreread less

Abstract: We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

...read moreread less

895 citations

Book Chapter•DOI•

On the marriage of Lp-norms and edit distance

[...]

Lei Chen¹, Raymond T. Ng²•Institutions (2)

University of Waterloo¹, University of British Columbia²

31 Aug 2004

TL;DR: A new distance function, which is a marriage of L1- norm and the edit distance, ERP, which can support local time shifting, and is a metric, and dominates all existing strategies.

...read moreread less

Abstract: Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP ("Edit distance with Real Penalty"). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies.

...read moreread less

790 citations

Proceedings Article•DOI•

Approximating edit distance efficiently

[...]

Ziv Bar-Yossef¹, T. S. Jayram², Robert Krauthgamer², Ravi Kumar²•Institutions (2)

Technion – Israel Institute of Technology¹, IBM²

17 Oct 2004

TL;DR: Algorithms are developed that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than /spl lscr/, decide which of the two holds and develop an n/sup 3/7/-approximation quasilinear time algorithm.

...read moreread less

Abstract: Edit distance has been extensively studied for the past several years. Nevertheless, no linear-time algorithm is known to compute the edit distance between two strings, or even to approximate it to within a modest factor. Furthermore, for various natural algorithmic problems such as low-distortion embeddings into normed spaces, approximate nearest-neighbor schemes, and sketching algorithms, known results for the edit distance are rather weak. We develop algorithms that solve gap versions of the edit distance problem: given two strings of length n with the promise that their edit distance is either at most k or greater than /spl lscr/, decide which of the two holds. We present two sketching algorithms for gap versions of edit distance. Our first algorithm solves the k vs. (kn)/sup 2/3/ gap problem, using a constant size sketch. A more involved algorithm solves the stronger k vs. /spl lscr/ gap problem, where /spl lscr/ can be as small as O(k/sup 2/) - still with a constant sketch - but works only for strings that are mildly "nonrepetitive". Finally, we develop an n/sup 3/7/-approximation quasilinear time algorithm for edit distance, improving the previous best factor of n/sup 3/4/ (Cole and Hariharan, 2002); if the input strings are assumed to be nonrepetitive, then the approximation factor can be strengthened to n/sup 1/3/.

...read moreread less

161 citations

Book Chapter•DOI•

Spelling Correction for Search Engine Queries

[...]

Bruno Martins¹, Mário J. Silva¹•Institutions (1)

University of Lisbon¹

20 Oct 2004

TL;DR: An algorithm is presented that attempts to select the best choice among all possible corrections for a misspelled term, and its implementation based on a ternary search tree data structure is discussed.

...read moreread less

Abstract: Search engines have become the primary means of accessing information on the Web. However, recent studies show misspelled words are very common in queries to these systems. When users misspell query, the results are incorrect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search engine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.

...read moreread less

84 citations

Book Chapter•DOI•

[...]

Karin Kailing¹, Hans-Peter Kriegel¹, Stefan Schönauer¹, Thomas Seidl²•Institutions (2)

Ludwig Maximilian University of Munich¹, RWTH Aachen University²

14 Mar 2004

TL;DR: In this article, a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria are presented.

...read moreread less

Abstract: Structured and semi-structured object representations are getting more and more important for modern database applications. Examples for such data are hierarchical structures including chemical compounds, XML data or image data. As a key feature, database systems have to support the search for similar objects where it is important to take into account both the structure and the content features of the objects. A successful approach is to use the edit distance for tree structured data. As the computation of this measure is NP-complete, constrained edit distances have been successfully applied to trees. While yielding good results, they are still computationally complex and, therefore, of limited benefit for searching in large databases. In this paper, we propose a filter and refinement architecture to overcome this problem. We present a set of new filter methods for structural and for content-based information in tree-structured data as well as ways to flexibly combine different filter criteria. The efficiency of our methods, resulting from the good selectivity of the filters is demonstrated in extensive experiments with real-world applications.

...read moreread less

84 citations

Book Chapter•DOI•

An Error-Tolerant Approximate Matching Algorithm for Attributed Planar Graphs and Its Application to Fingerprint Classification

[...]

Michel Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

18 Aug 2004-Lecture Notes in Computer Science

TL;DR: An efficient algorithm is proposed for edit distance computation of planar graphs given graphs embedded in the plane by iteratively match small subgraphs by locally optimizing structural correspondences to obtain a valid edit path and hence an upper bound of the edit distance.

...read moreread less

Abstract: Graph edit distance is a powerful error-tolerant similarity measure for graphs. For pattern recognition problems involving large graphs, however, the high computational complexity makes it sometimes impossible to apply edit distance algorithms. In the present paper we propose an efficient algorithm for edit distance computation of planar graphs. Given graphs embedded in the plane, we iteratively match small subgraphs by locally optimizing structural correspondences. Eventually we obtain a valid edit path and hence an upper bound of the edit distance. To demonstrate the efficiency of our approach, we apply the proposed algorithm to the problem of fingerprint classification.

...read moreread less

83 citations

Journal Article•DOI•

The reduced graph descriptor in virtual screening and data-driven clustering of high-throughput screening data.

[...]

Gavin Harper¹, Gianpaolo Bravi¹, Stephen D. Pickett¹, Jameed Hussain¹, Darren V. S. Green¹ - Show less +1 more•Institutions (1)

GlaxoSmithKline¹

04 Sep 2004-Journal of Chemical Information and Computer Sciences

TL;DR: Improvements to previously published methods for similarity searching with reduced graphs are described, with a particular focus on ligand-based virtual screening, and a novel use of reduced graphs in the clustering of high-throughput screening data is described.

...read moreread less

Abstract: Virtual screening and high-throughput screening are two major components of lead discovery within the pharmaceutical industry. In this paper we describe improvements to previously published methods for similarity searching with reduced graphs, with a particular focus on ligand-based virtual screening, and describe a novel use of reduced graphs in the clustering of high-throughput screening data. Literature methods for reduced graph similarity searching encode the reduced graphs as binary fingerprints, which has a number of issues. In this paper we extend the definition of the reduced graph to include positively and negatively ionizable groups and introduce a new method for measuring the similarity of reduced graphs based on a weighted edit distance. Moving beyond simple similarity searching, we show how more flexible queries can be built using reduced graphs and describe a database system that allows iterative querying with multiple representations. Reduced graphs capture many important features of ligand...

...read moreread less

77 citations

Journal Article•DOI•

[...]

Eike Schallehn¹, Kai-Uwe Sattler¹, Gunter Saake¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

01 Mar 2004

TL;DR: A similarity-based variants of grouping and join operators that produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition is presented.

...read moreread less

Abstract: Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators. The extended grouping operator produces groups of similar tuples, the extended join combines tuples satisfying a given similarity condition. We describe the semantics of this operator, discuss efficient implementations for the edit distance similarity and present evaluation results. Finally, we give examples of application from the context of a data reconciliation project for looted art.

...read moreread less

72 citations

Proceedings Article•

On the Marriage of Edit Distance and Lp Norms

[...]

Lei Chen, R. Ng

01 Jan 2004

TL;DR: A rolling parallel printer in which a pressure element is driven through a swiveling motion each printing cycle and a pressure segment thereof rolls off a line of type.

...read moreread less

Abstract: A rolling parallel printer in which a pressure element is driven through a swiveling motion each printing cycle and a pressure segment thereof rolls off a line of type. The pressure element is connected to a mechanical linkage which minimizes the sweep of travel of the pressure element, while maintaining the pressure element sufficiently far from the type in a rest position to facilitate reading of the printed matter.

...read moreread less

68 citations

Journal Article•DOI•

Real world performance of approximate string comparators for use in patient matching

[...]

Shaun J. Grannis, J. Marc Overhage¹, Clement J. McDonald¹•Institutions (1)

Indiana University – Purdue University Indianapolis¹

01 Jan 2004-Studies in health technology and informatics

TL;DR: Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.

...read moreread less

Abstract: Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate comparators included the modified Jaro-Winkler method, the longest common substring, and the Levenshtein edit distance. We also calculated the combined root-mean square of all three. We tested each name comparison method using a deterministic record linkage algorithm. Results were consistent across both hospitals. At a threshold comparator score of 0.8, the Jaro-Winkler comparator achieved the highest linkage sensitivities of 97.4% and 97.7%. The combined root-mean square method achieved sensitivities higher than the Levenshtein edit distance or longest common substring while sustaining high linkage specificity. Approximate string comparators increase deterministic linkage sensitivity by up to 10% compared to exact match comparisons and represent an accurate method of linking to vital statistics data.

...read moreread less

64 citations

Proceedings Article•DOI•

Detecting duplicate objects in XML documents

[...]

Melanie Weis¹, Felix Naumann¹•Institutions (1)

Humboldt University of Berlin¹

18 Jun 2004

TL;DR: This paper presents a domain-independent algorithm that effectively identifies duplicates in an XML document that adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level.

...read moreread less

Abstract: The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.

...read moreread less

Journal Article•DOI•

Markov edit distance

[...]

Jie Wei¹•Institutions (1)

City University of New York¹

01 Mar 2004-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Inspired by the success of the Markov Random Field theory, a new edit distance called Markov edit distance (MED) within the dynamic programming framework is proposed to take full advantage of the local statistical dependencies in the pattern in order to arrive at enhanced matching performance.

...read moreread less

Abstract: Edit distance was originally developed by Levenstein several decades ago to measure the distance between two strings. It was found that this distance can be computed by an elegant dynamic programming procedure. The edit distance has played important roles in a wide array of applications due to its representational efficacy and computational efficiency. To effect a more reasonable distance measure, the normalized edit distance was proposed. Many algorithms and studies have been dedicated along this line with impressive performances in recent years. There is, however, a fundamental problem with the original definition of edit distance that has remained elusive: its context-free nature. In determining the possible actions, i.e., insertion, deletion, and substitution, no systematic consideration was given to the local behaviors of the string/pattern in question that indeed encompass great amount of useful information regarding its content. In this paper, inspired by the success of the Markov Random Field theory, a new edit distance called Markov edit distance (MED) within the dynamic programming framework is proposed to take full advantage of the local statistical dependencies in the pattern in order to arrive at enhanced matching performance. Within this framework, two specialized distance measures are developed: The reshuffling MED to handle cases where a subpattern in the target pattern is the reshuffles of that in the source pattern, and the coherence MED which is able to incur local content based substitution, insertion, and deletion. The applications based on these two MEDs in string matching are then explored, whereof encouraging empirical results have been observed.

...read moreread less

Natural Language Inference via Dependency Tree Mapping: An Application to Question Answering

[...]

Vasin Punyakanok, Dan Roth, Wen-tau Yih

01 Jan 2004

TL;DR: An approach for answer selection in a free form question answering task by representing both questions and candidate passages using dependency trees, augmented with semantic information such as named entities, and computing a generalized edit distance between a candidate passage representation and the question representation, a distance which aims to capture some level of meaning similarity.

...read moreread less

Abstract: We describe an approach for answer selection in a free form question answering task. In order to go beyond a key-word based matching in selecting answers to questions, one would like to develop a principled way for the answer selection process that incorporates both syntactic and semantic information. We achieve this goal by (1) representing both questions and candidate passages using dependency trees, augmented with semantic information such as named entities, and (2) computing a generalized edit distance between a candidate passage representation and the question representation, a distance which aims to capture some level of meaning similarity. The sentence that best answers a question is determined to be the one that minimizes the generalized edit distance we define, computed via a dynamic programming based approximate tree matching algorithm. We evaluate the approach on question-answer pairs taken from previous TREC Q/A competitions. Preliminary experiments show its potential by significantly outperforming common bag-of-word scoring methods.

...read moreread less

Journal Article•DOI•

Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism

[...]

Valerio Freschi¹, Alessandro Bogliolo¹•Institutions (1)

University of Urbino¹

31 May 2004-Information Processing Letters

TL;DR: This paper addresses the problem of computing the length of the longest common subsequence (LCS) between run-length-encoded (RLE) strings, and exploits RLE to reduce the complexity of LCS computation from O(M × N) to O(mN + Mn - mn).

...read moreread less

Proceedings Article•DOI•

A probabilistic approach to learning costs for graph edit distance

[...]

Michel Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

23 Aug 2004

TL;DR: This work proposes a cost inference method that is based on a distribution estimation of edit operations that employs an expectation maximization algorithm to learn mixture densities from a labeled sample of graphs and derive edit costs that are subsequently applied in the context of a graph edit distance computation framework.

...read moreread less

Abstract: Graph edit distance provides an error-tolerant way to measure distances between attributed graphs. The effectiveness of edit distance based graph classification algorithms relies on the adequate definition of edit operation costs. We propose a cost inference method that is based on a distribution estimation of edit operations. For this purpose, we employ an expectation maximization algorithm to learn mixture densities from a labeled sample of graphs and derive edit costs that are subsequently applied in the context of a graph edit distance computation framework. We evaluate the performance of the proposed distance model in comparison to another recently introduced learning model for edit costs.

...read moreread less

Book Chapter•DOI•

Approximate String Matching Using Compressed Suffix Arrays

[...]

Trinh N. D. Huynh¹, Wing-Kai Hon², Tak-Wah Lam², Wing-Kin Sung¹•Institutions (2)

National University of Singapore¹, University of Hong Kong²

05 Jul 2004

TL;DR: This paper gives a solution using O(n) bits indexing data structure with O(mlog2 n) query time to the k-difference problem with k≥1, the first result which requires linear indexing space.

...read moreread less

Abstract: Let T be a text of length n and P be a pattern of length m, both strings over a fixed finite alphabet A. The k-difference (k-mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P. In this paper we investigate a well-studied case in which k=1 and T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster [16-19]. This paper gives a solution using O(n) bits indexing data structure with O(mlog2 n) query time. To the best of our knowledge, this is the first result which requires linear indexing space. The results can be extended for the k-difference problem with k≥1.

...read moreread less

Patent•

Query string matching method and apparatus

[...]

John M. Carnahan¹•Institutions (1)

Yahoo!¹

16 Jan 2004

TL;DR: In this paper, a method is provided for increasing relevance of database search issues by determining a trained edit distance (130) between the subject query string (110) and a candidate string (120) using trained cost factors (150) derived from a training set of labeled query transformations (140).

...read moreread less

Abstract: In one implementation, a method is provided for increasing relevance of database search issues. The method includes receiving a subject query string (110) and determining a trained edit distance (130) between the subject query string (110) and a candidate string (120) using trained cost factors (150) derived from a training set of labeled query transformations (140). A trained cost factor (150) includes a conditional probability for mutations in labeled non-relevant query transformations and a conditional probability for mutations in labeled relevant query transformations. The candidate string (120) is evaluated the for selection based on the trained edit distance (130). In some implementations, the cost factors (150) may take into account the context of a mutation. In some implementations multi-dimensional matrices (160) are utilized which include the trained cost factors (150).

...read moreread less

Proceedings Article•DOI•

Approximate Nearest Neighbor under edit distance via product metrics

[...]

Piotr Indyk¹•Institutions (1)

Massachusetts Institute of Technology¹

11 Jan 2004

TL;DR: To the knowledge, this is the first data structure for this problem with both query time and storage subexponential in d and the space requirement of this data structure is roughlyO, i.e., strongly subexp exponential.

...read moreread less

Abstract: We present a data structure for the approximate nearest neighbor problem under edit metric (which is defined as the minimum number of insertions, deletions and character substitutions needed to transform one string into another). For any l ≥ 1 and a set of n strings of length d, the data structure reports a 3l-approximate Nearest Neighbor for any given query string q in O(d) time. The space requirement of this data structure is roughly O(nd1/(l+1)), i.e., strongly subexponential. To our knowledge, this is the first data structure for this problem with both o(n) query time and storage subexponential in d.

...read moreread less

The breakpoint distance for signed sequences

[...]

Guillaume Blin, Guillaume Fertin, Cedric Chauve

01 Dec 2004

TL;DR: In this paper, the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family, is considered.

...read moreread less

Abstract: We consider the problem of estimating the rearrangement distance in terms of reversals, insertion and deletion between two genomes, G and H with possibly multiple genes from the same gene family. We define a notion of breakpoint distance for this problem, based on matching genes from the same family between G and H. We show that this distance is a good approximation of the edit distance, but NP-hard to compute, even when just one family of genes is non-trivial. We also propose a branch-and-cut exact algorithm for the computation of the breakpoint distance.

...read moreread less

Book Chapter•DOI•

Correctors for XML Data

[...]

Utsav Boobna, Michel de Rougemont

29 Aug 2004

TL;DR: In this article, the authors show how testers and correctors for regular trees can be used to estimate distances between a document and a set of DTDs, a useful operation to rank XML documents.

...read moreread less

Abstract: A corrector takes an invalid XML file F as input and produces a valid file F′ which is not far from F when F is e –close to its DTD, using the classical Tree Edit distance between a tree T and a language L defined by a DTD or a tree-automaton We show how testers and correctors for regular trees can be used to estimate distances between a document and a set of DTDs, a useful operation to rank XML documents

...read moreread less

Journal Article•DOI•

A dynamic edit distance table

[...]

Sung-Ryul Kim¹, Kunsoo Park²•Institutions (2)

Konkuk University¹, Seoul National University²

01 Jun 2004-Journal of Discrete Algorithms

TL;DR: As a solution for the edit distance between A and B, the difference representation of the D -table is defined, which leads to a simple and intuitive algorithm for the incremental/decremental edit distance problem.

...read moreread less

Proceedings Article•DOI•

A flexible measure of contextual similarity for biomedical terms.

[...]

Irena Spasic¹, Sophia Ananiadou•Institutions (1)

University of Manchester¹

01 Dec 2004

TL;DR: This work presents a measure of contextual similarity for biomedical terms, which augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.

...read moreread less

Abstract: We present a measure of contextual similarity for biomedical terms. The contextual features need to be explored, because newly coined terms are not explicitly described and efficiently stored in biomedical ontologies and their inner features (e.g. morphologic or orthographic) do not always provide sufficient information about the properties of the underlying concepts. The context of each term can be represented as a sequence of syntactic elements annotated with biomedical information retrieved from an ontology. The sequences of contextual elements may be matched approximately by edit distance defined as the minimal cost incurred by the changes (including insertion, deletion and replacement) needed to transform one sequence into the other. Our approach augments the traditional concept of edit distance by elements of linguistic and biomedical knowledge, which together provide flexible selection of contextual features and their comparison.

...read moreread less

Book Chapter•DOI•

Property Testing of Regular Tree Languages

[...]

Frédéric Magniez¹, Michel de Rougemont²•Institutions (2)

University of Paris-Sud¹, University of Paris²

12 Jul 2004

TL;DR: In the complete version of the paper, it is shown that the distance problem is NP-complete on ordered trees.

...read moreread less

Abstract: We consider the Edit distance with moves on the class of words and the class of ordered trees We first exhibit a simple tester for the class of regular languages on words and generalize it to the class of ranked regular trees In the complete version of the paper, we show that the distance problem is NP-complete on ordered trees

...read moreread less

Patent•

Spelling variation dictionary generation system

[...]

Hiroko Ohi¹, Osamu Imaichi¹, Yoshiki Niwa¹•Institutions (1)

Hitachi¹

16 Nov 2004

TL;DR: In this paper, a system for effectively collecting, without omissions, spelling variations centering on particular technical terms occurring in documents is presented. But the system requires a large-scale collection of terms.

...read moreread less

Abstract: A system for effectively collecting, without omissions, spelling variations centering on particular technical terms occurring in documents. In advance, the system sorts technical terms considered to be potential spelling variations from among a large-scale collection of terms. By measuring the edit distance adjusted for the cost of the terms that are potential spelling variations, the system can collect terms considered spelling variations from among the potential spelling variation terms with a high degree of accuracy.

...read moreread less

Book Chapter•DOI•

A Graph Edit Distance Based on Node Merging

[...]

Stefano Berretti¹, A. Del Bimbo¹, Pietro Pala¹•Institutions (1)

University of Florence¹

21 Jul 2004

TL;DR: A novel solution is proposed for error tolerant graph matching by extending the original edit distance based framework so as to account for a new operator to support node merging during the matching process.

...read moreread less

Abstract: In this paper a novel solution is proposed for error tolerant graph matching. The solution belongs to the class of edit distance based techniques. In particular, the original edit distance based framework is extended so as to account for a new operator to support node merging during the matching process.

...read moreread less

Proceedings Article•DOI•

Lexicon organization and string edit distance learning for lexical post-processing in handwriting recognition

[...]

S. Carbonnel¹, Eric Anquetil¹•Institutions (1)

Intelligence and National Security Alliance¹

26 Oct 2004

TL;DR: The aim of this work is to correct recognition and segmentation errors using lexical information from a lexicon using a new approach to automatically learn an edit distance specifically adapted to the properties of the on-line handwritten word recognition.

...read moreread less

Abstract: This paper presents an optimized lexical post-processing designed for handwritten word recognition. The aim of this work is to correct recognition and segmentation errors using lexical information from a lexicon. The presented lexical post-processing is based on two phases: in the first phase a lexicon organization is made to reduce the search space into sub-lexicons during the recognition process. The second phase develops a specific edit distance to identify the handwritten word using a selection of the sub-lexicons. The paper exposes two original strategies of lexicon reduction and a new approach to automatically learn an edit distance specifically adapted to the properties of the on-line handwritten word recognition. Experimental results are reported to compare the two lexicon reduction strategies and first results emphasize the impact of the learning process of the new edit distance.

...read moreread less

An Adaptive String Comparator for Record Linkage

[...]

William E. Yancey

01 Jan 2004

TL;DR: A string comparator based on edit distance that uses variable edit-step costs derived from training data and is compared with the JaroWinkler string comparators and with the Census Bureau’s record linkage software.

...read moreread less

Abstract: We develop a string comparator based on edit distance that uses variable edit-step costs derived from training data. Using first and last name data from Census files, we compare the performance of this string comparator with one without variable edit step costs and with the JaroWinkler string comparator, which is standardly used in the Census Bureau’s record linkage software.

...read moreread less

Patent•

Edit distance string search

[...]

Eric Theodore Bax¹, Ian Douglas Swett¹•Institutions (1)

Avaya¹

09 Feb 2004

TL;DR: In this paper, a process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold, using dynamic programming.

...read moreread less

Abstract: A process determines for a search string which, if any, of the strings in a text list have edit distance from the search string less than a threshold. The process uses dynamic programming on a grid with search string characters corresponding to rows and text characters corresponding to columns. For each text string, computation proceeds by columns. If successive text strings share a prefix, then the columns corresponding to the prefix are re-used. If the minimum value in a column is at least the threshold, then the prefix corresponding to that and previous columns causes edit distance to be at least the threshold. So the computation for the present text is abandoned, and computations for any other texts that share the prefix are avoided.

...read moreread less

Proceedings Article•

Modelling Legitimate Translation Variation for Automatic Evaluation of MT Quality

[...]

Bogdan Babych¹, Anthony Hartley¹•Institutions (1)

University of Leeds¹

01 May 2004

TL;DR: This paper explores the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores, and shows that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text.

...read moreread less

Abstract: Automatic methods for MT evaluation are often based on the assumption that MT quality is related to some kind of distance between the evaluated text and a professional human translation (e.g., an edit distance or the precision of matched N-grams). However, independently produced human translations are necessarily different, conveying the same content by dissimilar means. Such legitimate translation variation is a serious problem for distance-based evaluation methods, because mismatches do not necessarily mean degradation in MT quality. In this paper we explore the link between legitimate translation variation and statistical measures of a words salience within a given document, such as tf.idf scores. We show that the use of such scores extends the N-gram distance measures in a way that allows us to accurately predict multiple quality parameters of the text, such as translation adequacy and fluency. However legitimate translation variation also reveals fundamental limits on the applicability of distance-based MT evaluation methods and on data-driven architectures for MT.

...read moreread less

Book Chapter•DOI•

Evolutionary optimization of music performance annotation

[...]

Maarten Grachten¹, Josep Lluis Arcos¹, Ramon López de Mántaras¹•Institutions (1)

Spanish National Research Council¹

26 May 2004

TL;DR: In this article, an evolutionary approach was used to optimize the parameter values of cost functions of the edit distance for music performance annotation, and the validity of the optimized parameter settings was shown by assessing their error-percentage on a test set.

...read moreread less

Abstract: In this paper we present an enhancement of edit distance based music performance annotation. The annotation captures musical expressivity not only in terms of timing deviations but also represents e.g. spontaneous note ornamentation. To reduce the number of errors in automatic performance annotation, some optimization is essential. We have taken an evolutionary approach to optimize the parameter values of cost functions of the edit distance. Automatic optimization is desirable since manual parameter tuning is unfeasible when more than a few performances are taken into account. The validity of the optimized parameter settings is shown by assessing their error-percentage on a test set.

...read moreread less