Showing papers on "Edit distance published in 2003"

PDF

Open Access

Proceedings Article•

A comparison of string distance metrics for name-matching tasks

[...]

William W. Cohen, Pradeep Ravikumar¹, Stephen E. Fienberg¹•Institutions (1)

09 Aug 2003

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.

...read moreread less

Abstract: Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community

...read moreread less

1,355 citations

Proceedings Article•DOI•

Adaptive duplicate detection using learnable string similarity measures

[...]

Mikhail Bilenko¹, Raymond J. Mooney¹•Institutions (1)

University of Texas at Austin¹

24 Aug 2003

TL;DR: This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain.

...read moreread less

Abstract: The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

...read moreread less

1,020 citations

A Comparison of String Metrics for Matching Names and Records

[...]

W. W. Cohen and P. Ravikumar and S. Fienberg

01 Jan 2003

TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.

...read moreread less

Abstract: We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

...read moreread less

552 citations

Proceedings Article•DOI•

Secure and private sequence comparisons

[...]

Mikhail J. Atallah¹, Florian Kerschbaum¹, Wenliang Du²•Institutions (2)

Purdue University¹, Syracuse University²

30 Oct 2003

TL;DR: An efficient protocol for sequence comparisons of the edit-distance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences - which is unavoidable because computing that distance is the purpose of the protocol).

...read moreread less

Abstract: We give an efficient protocol for sequence comparisons of the edit-distance kind, such that neither party reveals anything about their private sequence to the other party (other than what can be inferred from the edit distance between their two sequences - which is unavoidable because computing that distance is the purpose of the protocol). The amount of communication done by our protocol is proportional to the time complexity of the best-known algorithm for performing the sequence comparison.The problem of determining the similarity between two sequences arises in a large number of applications, in particular in bioinformatics. In these application areas, the edit distance is one of the most widely used notions of sequence similarity: It is the least-cost set of insertions, deletions, and substitutions required to transform one string into the other. The generalizations of edit distance that are solved by the same kind of dynamic programming recurrence relation as the one for edit distance, cover an even wider domain of applications.

...read moreread less

214 citations

Journal Article•DOI•

Edit-distance of weighted automata: general definitions and algorithms

[...]

Mehryar Mohri¹•Institutions (1)

AT&T¹

01 Dec 2003-International Journal of Foundations of Computer Science

TL;DR: The edit-distance of two distributions over strings is defined and algorithms for computing it when these distributions are given by automata are presented, including the general algorithm of composition of weighted transducers combined with a single-source shortest-paths algorithm.

...read moreread less

Abstract: The problem of computing the similarity between two sequences arises in many areas such as computational biology and natural language processing. A common measure of the similarity of two strings is their edit-distance, that is the minimal cost of a series of symbol insertions, deletions, or substitutions transforming one string into the other. In several applications such as speech recognition or computational biology, the objects to compare are distributions over strings, i.e., sets of strings representing a range of alternative hypotheses with their associated weights or probabilities. We define the edit-distance of two distributions over strings and present algorithms for computing it when these distributions are given by automata. In the particular case where two sets of strings are given by unweighted automata, their edit-distance can be computed using the general algorithm of composition of weighted transducers combined with a single-source shortest-paths algorithm. In the general case, we show tha...

...read moreread less

126 citations

Proceedings Article•DOI•

A sublinear algorithm for weakly approximating edit distance

[...]

Tugkan Batu¹, Funda Ergün², Joe Kilian, Avner Magen³, Sofya Raskhodnikova⁴, Ronitt Rubinfeld, Rahul Sami⁵ - Show less +3 more•Institutions (5)

University of Pennsylvania¹, Case Western Reserve University², University of Toronto³, Massachusetts Institute of Technology⁴, Yale University⁵

09 Jun 2003

TL;DR: The algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance and shows a lower bound of Ω(nΑ/2) on the query complexity of every algorithm that distinguishes pairs of strings with edit distance at most nΑ from those with edit Distance at least n/6.

...read moreread less

Abstract: We show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings A and B, runs in time o(n) and with high probability returns "CLOSE" if their edit distance is O(nΑ), and "FAR" if their edit distance is Ω(n), where Α is a fixed parameter less than 1. Our algorithm for testing the edit distance works by recursively subdividing the strings A and B into smaller substrings and looking for pairs of substrings in A, B with small edit distance. To do this, we query both strings at random places using a special technique for economizing on the samples which does not pick the samples independently and provides better query and overall complexity. As a result, our test runs in time O(nmax(Α/2, 2Α - 1\)) for any fixed Α

...read moreread less

116 citations

Mapping Dependencies Trees: An Application to Question Answering

[...]

Vasin Punyakanok, Dan Roth, Wen-tau Yih¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2003

TL;DR: An approach for answer selection in a free form question answering task is described, representing both questions and candidate passages using dependency trees, and incorporating semantic information such as named entities in this representation.

...read moreread less

Abstract: We describe an approach for answer selection in a free form question answering task. In order to go beyond the key-word based matching in selecting answers to questions, one would like to incorporate both syntactic and semantic information in the question answering process. We achieve this goal by representing both questions and candidate passages using dependency trees, and incorporating semantic information such as named entities in this representation. The sentence that best answers a question is determined to be the one that minimizes the generalized edit distance between it and the question tree, computed via an approximate tree matching algorithm. We evaluate the approach on question-answer pairs taken from previous TREC Q/A competitions. Preliminary experiments show its potential by significantly outperforming common bag-of-word scoring methods.

...read moreread less

95 citations

A novel string-to-string distance measure with applications to machine translation evaluation

[...]

Gregor Leusch, Nicola Ueffing, Hermann Ney

23 Sep 2003

TL;DR: The authors introduced a string-to-string distance measure which extends the edit distance by block transpositions as constant cost edit operation and demonstrated how this distance measure can be used as an evaluation criterion in machine translation.

...read moreread less

Abstract: We introduce a string-to-string distance measure which extends the edit distance by block transpositions as constant cost edit operation. An algorithm for the calculation of this distance measure in polynomial time is presented. We then demonstrate how this distance measure can be used as an evaluation criterion in machine translation. The correlation between this evaluation criterion and human judgment is systematically compared with that of other automatic evaluation measures on two translation tasks. In general, like other automatic evaluation measures, the criterion shows low correlation at sentence level, but good correlation at system level.

...read moreread less

92 citations

Journal Article•DOI•

Genomic distances under deletions and insertions

[...]

Mark Marron¹, Krister M. Swenson¹, Bernard M. E. Moret¹•Institutions (1)

University of New Mexico¹

25 Jul 2003

TL;DR: This paper extends El-Mabrouk's work to handle duplications as well as insertions and presents an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions.

...read moreread less

Abstract: As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multi-set of genes; Hannenhalli and Pevzner showed that the edit distance between two signed permutations of the same set can be computed in polynomial time when all operations are inversions. El-Mabrouk extended that result to allow deletions and a limited form of insertions (which forbids duplications). In this paper we extend El-Mabrouk's work to handle duplications as well as insertions and present an alternate framework for computing (near) minimal edit sequences involving insertions, deletions, and inversions. We derive an error bound for our polynomial-time distance computation under various assumptions and present preliminary experimental results that suggest that performance in practice may be excellent, within a few percent of the actual distance.

...read moreread less

84 citations

Dissertation•

Sequence distance embeddings

[...]

Graham Cormode

01 Jan 2003

TL;DR: The embeddings are shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

...read moreread less

Abstract: Sequences represent a large class of fundamental objects in Computer Science sets, strings, vectors and permutations are considered to be sequences. Distances between sequences measure their similarity, and computations based on distances are ubiquitous: either to compute the distance, or to use distance computation as part of a more complex problem. This thesis takes a very specific approach to solving questions of sequence distance: sequences are embedded into other distance measures, so that distance in the new space approximates the original distance. This allows the solution of a variety of problems including: Fast computation of short sketches in a variety of computing models, which allow sequences to be compared in constant time and space irrespective of the size of the original sequences. Approximate nearest neighbor and clustering problems, significantly faster than the naive exact solutions. Algorithms to find approximate occurrences of pattern sequences in long text sequences in near linear time. Efficient communication schemes to approximate the distance between, and exchange, sequences in close to the optimal amount of communication. Solutions are given for these problems for a variety of distances, including fundamental distances on sets and vectors; distances inspired by biological problems for permutations; and certain text editing distances for strings. Many of these embeddings are computable in a streaming model where the data is too large to store in memory, and instead has to be processed as and when it arrives, piece by piece. The embeddings are also shown to be practical, with a series of large scale experiments which demonstrate that given only a small space, approximate solutions to several similarity and clustering problems can be found that are as good as or better than those found with prior methods.

...read moreread less

72 citations

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

[...]

Kjell Lemström, Esko Ukkonen

01 Jan 2003

TL;DR: A general framework for defining distance functions for monophonic music sequences is presented and transposition invariant versions of the edit distance and the Hamming distance are constructed directly, without an explicit conversion of the sequences into interval encoding.

...read moreread less

Abstract: A general framework for defining distance functions for monophonic music sequences is presented. The distance functions given by the framework have a similar structure, based on local transformations, as the well-known edit distance (Levenshtein distance) and can be evaluated using dynamic programming. The costs of the local transformations are allowed to be context-sensitive, a natural property when dealing with music. In order to understand transposition invariance in music comparison, the effect of interval encoding on some distance functions is analyzed. Then transposition invariant versions of the edit distance and the Hamming distance are constructed directly, without an explicit conversion of the sequences into interval encoding. A transposition invariant generalization of the Longest Common Subsequence measure is introduced and an efficient evaluation algorithm is developed. Finally, the necessary modifications of the distance functions for music information retrieval are sketched.

...read moreread less

Book Chapter•DOI•

Analysis of tree edit distance algorithms

[...]

Serge Dulucq¹, Hélène Touzet²•Institutions (2)

University of Bordeaux¹, university of lille²

25 Jun 2003

TL;DR: This analysis allows us to define a new tree edit distance algorithm, that is optimal for cover strategies, that may be described in a more general framework of cover strategies.

...read moreread less

Abstract: In this article, we study the behaviour of dynamic programming methods for the tree edit distance problem, such as [4] and [2]. We show that those two algorithms may be described in a more general framework of cover strategies. This analysis allows us to define a new tree edit distance algorithm, that is optimal for cover strategies.

...read moreread less

Proceedings Article•DOI•

Lower bounds for embedding edit distance into normed spaces

[...]

Alexandr Andoni¹, Michel Deza, Anupam Gupta², Piotr Indyk¹, Sofya Raskhodnikova¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Bell Labs²

12 Jan 2003

TL;DR: A low-distortion embedding of edit distance into lp norm would be very useful, for the following reasons:

...read moreread less

Abstract: 1 In t roduc t ion The edit distance (also called Levenshtein metric) between two strings is the minimum number of operations (insertions, deletions and character substitutions) needed to transform one string into another. This distance is of key importance in computational biology, as well as text processing and other areas. Algorithms for problems involving this metric have been extensively investigated. In particular, the quadratic-time dynamic programming algorithm for computing the edit distance between two strings is one of the most investigated and used algorithms in computational biology. Recently, a new approach to problems involving edit distance has been proposed. Its basic component is construction of a mapping ff (called an embedding), which maps any string s into a vector ff(s) E ~d, so that for any pair of strings s, s', the Ip distance IIf(s)-f(s')lip is approximately equal to the edit distance between s and s ~. The approximation factor is called distortion of the embedding f. A low-distortion embedding of edit distance into lp norm would be very useful, for the following reasons:

...read moreread less

Journal Article•DOI•

Computing approximate tree edit distance using relaxation labeling

[...]

Andrea Torsello¹, Edwin R. Hancock¹•Institutions (1)

University of York¹

01 May 2003-Pattern Recognition Letters

TL;DR: In this paper, the problem of computing tree edit distance was transformed into a series of maximum weight clique problems, and a relaxation labeling method was used to find an approximation to the tree cut distance.

...read moreread less

Proceedings Article•DOI•

Distance based indexing for string proximity search

[...]

S.C. Sahinalp¹, Murat Tasan¹, J. Macker¹, Z.M. Ozsoyoglu¹•Institutions (1)

Case Western Reserve University¹

01 Mar 2003

TL;DR: It is shown that several distance measures, such as the compression distance and weighted character edit distance are almost metrics, and how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space.

...read moreread less

Abstract: In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The most popular string distance measures are based on (a weighted) count of (i) character edit or (ii) block edit operations to transform one string into the other. Examples include the Levenshtein edit distance and the recently introduced compression distance. The main goal is to develop efficient near(est) neighbor search tools that work for both character and block edit distances. Our premise is that distance-based indexing methods, which are originally designed for metric distances can be modified for string distance measures, provided that they form almost metrics. We show that several distance measures, such as the compression distance and weighted character edit distance are almost metrics. In order to analyze the performance of distance based indexing methods (in particular VP trees) for strings, we then develop a model based on distribution of pairwise distances. Based on this model we show how to modify VP trees to improve their performance on string data, providing tradeoffs between search time and space. We test our theoretical results on synthetic data sets and protein strings.

...read moreread less

Book Chapter•DOI•

Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

[...]

Heikki Keskustalo¹, Ari Pirkola¹, Kari Visala¹, Erkka Leppänen¹, Kalervo Järvelin¹ - Show less +1 more•Institutions (1)

University of Tampere¹

08 Oct 2003

TL;DR: This paper established the best method among six baseline matching methods for each language pair and tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words that consistently outperformed all baseline methods.

...read moreread less

Abstract: Untranslatable query keys pose a problem in dictionary-based cross-language information retrieval (CLIR). One solution consists of using approximate string matching methods for finding the spelling variants of the source key among the target database index. In such a setting, it is important to select a matching method suited especially for CLIR. This paper focuses on comparing the effectiveness of several matching methods in a cross-lingual setting. Search words from five domains were expressed in six languages (French, Spanish, Italian, German, Swedish, and Finnish). The target data consisted of the index of an English full-text database. In this setting, we first established the best method among six baseline matching methods for each language pair. Secondly, we tested novel matching methods based on binary digrams formed of both adjacent and non-adjacent characters of words. The latter methods consistently outperformed all baseline methods.

...read moreread less

Journal Article•DOI•

A fast technique for comparing graph representations with applications to performance evaluation

[...]

Daniel P. Lopresti¹, Gordon Wilfong²•Institutions (2)

Lehigh University¹, Alcatel-Lucent²

01 Apr 2003-International Journal on Document Analysis and Recognition

TL;DR: This paper presents a formalism showing that graph probing provides a lower bound on the true edit distance between two graphs, and examines in detail the graph probing paradigm first put forth in the context of table understanding and later extended to HTML-coded Web pages.

...read moreread less

Abstract: Finding efficient, effective ways to compare graphs arising from recognition processes with their corresponding ground-truth graphs is an important step toward more rigorous performance evaluation.In this paper, we examine in detail the graph probing paradigm we first put forth in the context of our work on table understanding and later extended to HTML-coded Web pages. We present a formalism showing that graph probing provides a lower bound on the true edit distance between two graphs. From an empirical standpoint, the results of two simulation studies and an experiment using scanned pages show that graph probing correlates well with the latter measure. Moreover, our technique is very fast; graphs with tens or hundreds of thousands of vertices can be compared in mere seconds. Ease of implementation, scalability, and speed of execution make graph probing an attractive alternative for graph comparison.

...read moreread less

Proceedings Article•

Edit Distance From Graph Spectra

[...]

Antonio Robles-Kelly¹, Edwin R. Hancock¹•Institutions (1)

University of York¹

13 Oct 2003

TL;DR: The aim is to convert graphs to string sequences so that standard string edit distance techniques can be used, and uses graph spectral seriation method to convert the adjacency matrix into a string or sequence order.

...read moreread less

Abstract: This paper is concerned with computing graph edit distance. One of the criticisms that can be leveled at existing methods for computing graph edit distance is that it lacks the formality and rigour of the computation of string edit distance. Hence, our aim is to convert graphs to string sequences so that standard string edit distance techniques can be used. To do this we use graph spectral seriation method to convert the adjacency matrix into a string or sequence order. We pose the problem of graph-matching as maximum aposteriori probability alignment of the seriation sequences for pairs of graphs. This treatment leads to an expression for the edit costs. We compute the edit distance by finding the sequence of string edit operations which minimise the cost of the path traversing the edit lattice. The edit costs are defined in terms of the a posteriori probability of visiting a site on the lattice. We demonstrate the method with results on a data-set of Delaunay graphs.

...read moreread less

Book Chapter•DOI•

Formalizing translation memories

[...]

Emmanuel Planas, Osamu Furuse

01 Jan 2003

TL;DR: It is shown how the introduction of the MSSM algorithm based on dynamic programming techniques leads to a real gain in recall and precision, and allows the extension of TM towards rudimentary, yet useful Example-Based Machine Translation (EBMT) that is called ‘Shallow Translation’.

...read moreread less

Abstract: The TELA structure — a set of layered and linked lattices — the notion of similarity between TELA structures based on the notion of Edit Distance, and the MSSM algorithm based on dynamic programming techniques are all introduced in order to formalize Translation Memories (TM). We show how this approach leads to a real gain in recall and precision, and allows the extension of TM towards rudimentary, yet useful Example-Based Machine Translation (EBMT) that we call ‘Shallow Translation’.

...read moreread less

Proceedings Article•

A bit-vector algorithm for computing Levenshtein and Damerau edit distances

[...]

Heikki Hyyrö¹•Institutions (1)

University of Tampere¹

01 Mar 2003

TL;DR: In this paper, the authors present an algorithm that runs in time O(d/w⌉m + ⌈n/w ⌉σ) or O(⌈d/m/w+n +m/m+n) where m and n are the lengths of the two strings, w is the computer word size and σ is the size of the alphabet.

...read moreread less

Abstract: The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(⌈m/w⌉(n + σ)), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(⌈d/w⌉m + ⌈n/w⌉σ) or O(⌈d/w⌉n + ⌈m/w⌉σ). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers.

...read moreread less

Journal Article•DOI•

Comparison of AESA and LAESA search algorithms using string and tree-edit-distances

[...]

Juan Ramón Rico-Juan¹, Luisa Micó¹•Institutions (1)

University of Alicante¹

01 Jun 2003-Pattern Recognition Letters

TL;DR: This paper compares the behaviour of AESA and LAESA when string and tree-edit-distances are used and finds that the average number of distances computed by these algorithms is very low and does not depend on the number of prototypes in the training set.

...read moreread less

Journal Article•DOI•

Approximate matching of run-length compressed strings

[...]

Veli Mäkinen, Gonzalo Navarro¹, Esko Ukkonen•Institutions (1)

University of Chile¹

01 Apr 2003-Algorithmica

TL;DR: This work extends an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity, and extends this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily.

...read moreread less

Abstract: We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

...read moreread less

Journal Article•DOI•

Distance transforms for real-valued functions

[...]

Ilya Molchanov¹, Pedro Terán²•Institutions (2)

University of Bern¹, University of Oviedo²

15 Feb 2003-Journal of Mathematical Analysis and Applications

TL;DR: This work answers the uniqueness problem whether two different functions may share the same distance transform in a generality completely sufficient for all practical applications in imaging sciences, the full-scale problem remains open.

...read moreread less

Employing Trainable String Similarity Metrics for Information Integration

[...]

Mikhail Bilenko, Raymond J. Mooney

01 Jan 2003

TL;DR: This paper presents a framework for improving duplicate detection using trainable measures of textual similarity, and proposes to employ learnable text distance functions for each data field, and introduces an extended variant of learnable string edit distance based on an Expectation-Maximization (EM) training algorithm.

...read moreread less

Abstract: The problem of identifying approximately duplicate objects in databases is an essential step for the information integration process. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each data field, and introduce an extended variant of learnable string edit distance based on an Expectation-Maximization (EM) training algorithm. Experimental results on a range of datasets show that this similarity metric is capable of adapting to the specific notions of similarity that are appropriate for different domains. Our overall system, MARLIN, utilizes support vector machines to combine multiple similarity metrics, which are shown to perform better than ensembles of decisions trees, which were employed for this task in previous work.

...read moreread less

Book Chapter•DOI•

Complexities of the centre and median string problems

[...]

François Nicolas¹, Eric Rivals¹•Institutions (1)

Centre national de la recherche scientifique¹

25 Jun 2003

TL;DR: This work provides an answer to the question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets and gives the complexity of the related CENTRESTRING problem.

...read moreread less

Abstract: Given a finite set of strings, the MEDIAN STRING problem consists in finding a string that minimizes the sum of the distances to the strings in the set. Approximations of the median string are used in a very broad range of applications where one needs a representative string that summarizes common information to the strings of the set. It is the case in Classification, in Speech and Pattern Recognition, and in Computational Biology. In the latter, MEDIAN STRING is related to the key problem of Multiple Alignment. In the recent literature, one finds a theorem stating the NP-completeness of the MEDIAN STRING for unbounded alphabets. However, in the above mentioned areas, the alphabet is often finite. Thus, it remains a crucial question whether the MEDIAN STRING problem is NP-complete for finite and even binary alphabets. In this work, we provide an answer to this question and also give the complexity of the related CENTRE STRING problem. Moreover, we study the parametrized complexity of both problems with respect to the number of input strings.

...read moreread less

Journal Article•DOI•

An Edit Distance between Quotiented Trees

[...]

Pascal Ferraro, Christophe Godin

01 May 2003-Algorithmica

TL;DR: A dynamic programming algorithm to compare two quotiented trees using a constrained edit distance using an adaptation of an algorithm recently proposed by Zhang for comparing unordered rooted trees.

...read moreread less

Abstract: In this paper we propose a dynamic programming algorithm to compare two quotiented trees using a constrained edit distance. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on an adaptation of an algorithm recently proposed by Zhang for comparing unordered rooted trees. This method is currently being used in plant architecture modelling to quantify different types of variability between plants represented by quotiented trees.

...read moreread less

Book Chapter•DOI•

Similarity join in metric spaces

[...]

Vlastislav Dohnal¹, Claudio Gennaro², Pasquale Savino², Pavel Zezula¹•Institutions (2)

Masaryk University¹, Istituto di Scienza e Tecnologie dell'Informazione²

14 Apr 2003

TL;DR: The underlying principles of similarity joins are studied and three categories of implementation strategies based on filtering, partitioning, or similarity range searching are suggested; an application of the D-index is studied to implement the most promising alternative of range searching.

...read moreread less

Abstract: Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbors search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We first study the underlying principles of such joins and suggest three categories of implementation strategies based on filtering, partitioning, or similarity range searching. Then we study an application of the D-index to implement the most promising alternative of range searching. Though also this approach is not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments.

...read moreread less

Journal Article•

Algorithms for transposition invariant string matching

[...]

Veli Mäkinen, Gonzalo Navarro, Esko Ukkonen

01 Jan 2003-Lecture Notes in Computer Science

TL;DR: In this paper, the problem of computing the transposition invariant distance for various distance functions d, that are different versions of the edit distance, was studied, and algorithms whose time complexities are close to the known upper bounds were given.

...read moreread less

Abstract: Given strings A and B over an alphabet Σ C U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A, B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is min t ∈ U {d(A + t,B)}, where A + t = (a 1 + t)(a 2 + t)... (a m + t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, that are different versions of the edit distance. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems.

...read moreread less

Fuzzy Extractors and Cryptography, or How to Use Your Fingerprints

[...]

Yevgeniy Dodis¹, Leonid Reyzin², Adam W. Smith³•Institutions (3)

New York University¹, Boston University², Massachusetts Institute of Technology³

01 Jan 2003

TL;DR: This work proposes two primitives: a fuzzy extractor extracts nearly uniform randomness R from its biometric input; the extraction is error-tolerant in the sense that R will be the same even if the input changes, as long as it remains reasonably close to the original.

...read moreread less

Abstract: We provide formal definitions and efficient secure techniques for • turning biometric information into keys usable for any cryptographic application, and • reliably and securely authenticating biometric data. Our techniques apply not just to biometric information, but to any keying material that, unlike traditional cryptographic keys, is (1) not reproducible precisely and (2) not distributed uniformly. We propose two primitives: a fuzzy extractor extracts nearly uniform randomness R from its biometric input; the extraction is error-tolerant in the sense that R will be the same even if the input changes, as long as it remains reasonably close to the original. Thus, R can be used as a key in any cryptographic application. A fuzzy fingerprint produces public information about its biometric input w that does not reveal w, and yet allows exact recovery of w given another value that is close to w. Thus, it can be used to reliably reproduce error-prone biometric inputs without incurring the security risk inherent in storing them. In addition to formally introducing our new primitives, we provide nearly optimal constructions of both primitives for various measures of “closeness” of input data, such as Hamming distance, edit distance, and set difference.

...read moreread less

Proceedings Article•

Algorithms for Transposition Invariant String Matching

[...]

Veli Mäkinen, Gonzalo Navarro, Esko Ukkonen

27 Feb 2003

TL;DR: In this article, the problem of computing the transposition invariant distance for various distance functions d, that are different versions of the edit distance, was studied, and algorithms whose time complexities are close to the known upper bounds were given.

...read moreread less

Abstract: Given strings A and B over an alphabet ? ? U, where U is some numerical universe closed under addition and subtraction, and a distance function d(A,B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is mint?U{d(A + t, B)}, where A + t = (a1 + t)(a2 + t) ... (am + t). We study the problem of computing the transposition invariant distance for various distance (and similarity) functions d, that are different versions of the edit distance. For all these problems we give algorithms whose time complexities are close to the known upper bounds without transposition invariance. In particular, we show how sparse dynamic programming can be used to solve transposition invariant problems.

...read moreread less