scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2005"


Proceedings ArticleDOI
14 Jun 2005
TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.
Abstract: An important consideration in similarity-based retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imperfections. Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning techniques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increasing the pruning power significantly. The experimental results confirm the superior efficiency of the combined methods.

1,225 citations


Journal ArticleDOI
TL;DR: This work surveys the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes and presents one or more of the central algorithms for solving the problem.

831 citations


Book ChapterDOI
02 Nov 2005
TL;DR: A family of word similarity measures based on n-grams are formulated, and the results of experiments suggest that the new measures outperform their unigram equivalents.
Abstract: In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

296 citations


Journal ArticleDOI
TL;DR: The aim is to convert graphs to string sequences so that string matching techniques can be used and to compute the edit distance by finding the sequence of string edit operations which minimizes the cost of the path traversing the edit lattice.
Abstract: This paper is concerned with computing graph edit distance. One of the criticisms that can be leveled at existing methods for computing graph edit distance is that they lack some of the formality and rigor of the computation of string edit distance. Hence, our aim is to convert graphs to string sequences so that string matching techniques can be used. To do this, we use a graph spectral seriation method to convert the adjacency matrix into a string or sequence order. We show how the serial ordering can be established using the leading eigenvector of the graph adjacency matrix. We pose the problem of graph-matching as a maximum a posteriori probability (MAP) alignment of the seriation sequences for pairs of graphs. This treatment leads to an expression in which the edit cost is the negative logarithm of the a posteriori sequence alignment probability. We compute the edit distance by finding the sequence of string edit operations which minimizes the cost of the path traversing the edit lattice. The edit costs are determined by the components of the leading eigenvectors of the adjacency matrix and by the edge densities of the graphs being matched. We demonstrate the utility of the edit distance on a number of graph clustering problems.

191 citations


Proceedings ArticleDOI
14 Jun 2005
TL;DR: This paper proposes to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information and proves that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees.
Abstract: Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filter-and-refine framework to process similarity search on tree-structured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets.

174 citations


Proceedings ArticleDOI
06 Oct 2005
TL;DR: This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.
Abstract: Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

120 citations


ReportDOI
26 Jul 2005
TL;DR: This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings, trained on both positive and negative instances of string pairs.
Abstract: The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets.

120 citations


01 Jan 2005
TL;DR: A novel automatic sentence segmentation method for evaluating machine translation output with possibly erroneous sentence boundaries that efficiently produces an optimal automatic segmentation of the hypotheses and thus allows application of existing well-established evaluation measures.
Abstract: This paper presents a novel automatic sentence segmentation method for evaluating machine translation output with possibly erroneous sentence boundaries. The algorithm can process translation hypotheses with segment boundaries which do not correspond to the reference segment boundaries, or a completely unsegmented text stream. Thus, the method is especially useful for evaluating translations of spoken language. The evaluation procedure takes advantage of the edit distance algorithm and is able to handle multiple reference translations. It efficiently produces an optimal automatic segmentation of the hypotheses and thus allows application of existing well-established evaluation measures. Experiments show that the evaluation measures based on the automatically produced segmentation correlate with the human judgement at least as well as the evaluation measures which are based on manual sentence boundaries.

98 citations


Journal ArticleDOI
01 Jun 2005
TL;DR: A system of self-organizing maps (SOMs) that represent the distance measuring spaces of node and edge labels are proposed that adapts the edit costs in such a way that the similarity of graphs from the same class is increased, whereas the similarity from different classes decreases.
Abstract: Although graph matching and graph edit distance computation have become areas of intensive research recently, the automatic inference of the cost of edit operations has remained an open problem. In the present paper, we address the issue of learning graph edit distance cost functions for numerically labeled graphs from a corpus of sample graphs. We propose a system of self-organizing maps (SOMs) that represent the distance measuring spaces of node and edge labels. Our learning process is based on the concept of self-organization. It adapts the edit costs in such a way that the similarity of graphs from the same class is increased, whereas the similarity of graphs from different classes decreases. The learning procedure is demonstrated on two different applications involving line drawing graphs and graphs representing diatoms, respectively.

90 citations


Proceedings ArticleDOI
30 Aug 2005
TL;DR: The pq-gram distance between ordered labeled trees is defined as an effective and efficient approximation of the well-known tree edit distance and the properties of the pq -gram distance are analyzed to compare it with the edit Distance and alternative approximations.
Abstract: When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

87 citations


Patent
14 Jul 2005
TL;DR: In this article, a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema.
Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

01 Jan 2005
TL;DR: The use of annotated datasets and Support Vector Machines are described to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web, which dramatically reduces the Alignment Error Rate of the extracted corpora.
Abstract: The lack of readily-available large corpora of aligned monolingual sentence pairs is a major obstacle to the development of Statistical Machine Translation-based paraphrase models. In this paper, we describe the use of annotated datasets and Support Vector Machines to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web. Features include: morphological variants; WordNet synonyms and hypernyms; loglikelihood-based word pairings dynamically obtained from baseline sentence alignments; and formal string features such as word-based edit distance. Use of this technique dramatically reduces the Alignment Error Rate of the extracted corpora over heuristic methods based on position of the sentences in the text.

Book ChapterDOI
TL;DR: The main contribution is the definition of modified directional variance in orientation vector fields, which allows us to extract regions from fingerprints that are relevant for the classification in the Henry scheme.
Abstract: In the present paper we address the fingerprint classification problem with a structural pattern recognition approach. Our main contribution is the definition of modified directional variance in orientation vector fields. The new directional variance allows us to extract regions from fingerprints that are relevant for the classification in the Henry scheme. After processing the regions of interest, the resulting structures are converted into attributed graphs. The classification is finally performed with an efficient graph edit distance algorithm. The performance of the proposed classification method is evaluated on the NIST-4 database of fingerprints.

01 Jan 2005
TL;DR: Variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator are compared to Census data to see which are better classifiers for matches and nonmatches.
Abstract: We compare variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator. We apply the comparators to Census data to see which are better classifiers for matches and nonmatches, first by comparing their classification abilities using a ROC curve based analysis, then by considering a direct comparison between two candidate comparators in record linkage results.

Book ChapterDOI
28 Aug 2005
TL;DR: In this article, the authors present a scalable and distributed access structure for similarity search in metric spaces based on the Content-addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space.
Abstract: In this paper we present a scalable and distributed access structure for similarity search in metric spaces. The approach is based on the Content-addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space. We have extended the CAN structure to support storage and retrieval of generic metric space objects. We use pivots for projecting objects of the metric space in an N-dimensional vector space, and exploit the CAN organization for distributing the objects among the computing nodes of the structure. We obtain a Peer-to-Peer network, called the MCAN, which is able to search metric space objects by means of the similarity range queries. Experiments conducted on our prototype system confirm full scalability of the approach.

Journal ArticleDOI
TL;DR: In this paper, the authors improved the time complexity of the problem from O(rn2m2) to O(rnm, where r, n, and m are the lengths of P, S1, S2, and S2 respectively.
Abstract: Given strings S1, S2, and P, the constrained longest common subsequence problem for S1 and S2 with respect to P is to find a longest common subsequence lcs of S1 and S2 which contains P as a subsequence. We present an algorithm which improves the time complexity of the problem from the previously known O(rn2m2) to O(rnm) where r, n, and m are the lengths of P, S1, and S2, respectively. As a generalization of this, we extend the definition of the problem so that the lcs sought contains a subsequence whose edit distance from P is less than a given parameter d. For the latter problem, we propose an algorithm whose time complexity is O(drnm).

Posted Content
TL;DR: This work shows how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.
Abstract: We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.

Journal ArticleDOI
TL;DR: It is shown how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search.

Proceedings ArticleDOI
22 May 2005
TL;DR: Efficient implementations of the embedding that yield solutions to various computational problems involving edit distance, including sketching, communication complexity, nearest neighbor search are shown.
Abstract: We show that 0,1d endowed with edit distance embeds into l1 with distortion 2O(√log dlog log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds.

Journal ArticleDOI
TL;DR: This work provides an algorithm to compute an optimal center under a weighted edit distance in polynomial time when the number of input strings is fixed and gives the complexity of the related Center String problem.

Journal ArticleDOI
TL;DR: This paper introduces a new technique for analyzing card sort data that uses quantitative measures to discover rich qualitative results and is based upon a distance metric between sorts that allows one to measure the similarity of groupings and then look for clusters of closely related sorts across individuals.
Abstract: Card sorts are a knowledge elicitation technique in which participants are given a collection of items and are asked to partition them into groups based on their own criteria. Information about the participant's knowledge structure is inferred from the groups formed and the names used to describe the groups through various methods ranging from simple quantitative statistical measures (e.g. co-occurrence frequencies) to complex qualitative methods (e.g. content analysis on the group names). This paper introduces a new technique for analyzing card sort data that uses quantitative measures to discover rich qualitative results. This method is based upon a distance metric between sorts that allows one to measure the similarity of groupings and then look for clusters of closely related sorts across individuals. By using software for computing these clusters, it is possible to identify common concepts across individuals, despite the use of different terminology.

Proceedings Article
30 Aug 2005
TL;DR: This paper develops a novel technique, called SEPIA, which groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database and discusses how to extend the techniques to other similarity functions.
Abstract: Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964." Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called SEPIA, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.

01 Jan 2005
TL;DR: 1st Prize in the MIREX Symbolic Melodic Similarity Contest.
Abstract: 1st Prize in the MIREX Symbolic Melodic Similarity Contest To appear in on-line proceedings: http://wwwmusic-irorg/evaluation/mirex-results/sym-melody/indexhtml

Book ChapterDOI
17 Apr 2005
TL;DR: Experimental results show that the proposed method for indexing the DNA sequences efficiently based on q-grams is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.
Abstract: We have observed in recent years a growing interest in similarity search on large collections of biological sequences Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity

Journal ArticleDOI
TL;DR: This analysis allows for a new tree edit distance algorithm, that is optimal for cover strategies, and provides an exact characterization of the complexity of cover strategies.

Journal ArticleDOI
TL;DR: It is shown that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent and the more familiar distance properties appear to have much less effect on the performance of distances.
Abstract: Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure. Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. Availability: R code GADIST and Supplementary material are available at http://research.stowers-institute.org/bioinfo/ Contact: gvg@stowers-institute.org

Journal ArticleDOI
TL;DR: This paper shows how multiple patterns can be packed into a single computer word so as to search for all them simultaneously, and how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences.
Abstract: Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w−m bits in the computer words are unused. In this paper, we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m≤w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m≤w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives, especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

Book ChapterDOI
19 Jun 2005
TL;DR: A linear algorithm for comparing two similar ordered rooted trees with node labels which uses at most k insertions or deletions can be constructed in O(nk3) where n is the size of the trees.
Abstract: We describe a linear algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping which uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang-Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space.

Journal ArticleDOI
TL;DR: A new bit-parallel technique for approximate string matching based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast is presented, and is the fastest algorithm for several combinations of m, k and alphabet sizes.
Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM (Myers, 1999), searches for a pattern of length m in a text of length n permitting k differences in $O(\lceil m/w \rceil n)$ time, where w is the width of the computer word. The second one, ABNDM (Navarro and Raffinot, 2000), extends a sublinear-time exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA (Wu and Manber, 1992), which makes use of an $O(k \lceil m/w \rceil n)$ time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is average-optimal in number of inspected characters, although the overall complexity is higher because of the $O(k \lceil m/w \rceil )$ work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast. The resulting algorithm is average-optimal for m ≤ w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent variant of BPM. The use of witnesses greatly improves the running time of this algorithm too.

Book ChapterDOI
02 Nov 2005
TL;DR: It is shown that unlike the general edit distance between RNA secondary structures, the conservative edit distance can be computed in polynomial time and space, and an algorithm for this problem is described, which can be used in the more general problem of completeRNA secondary structures comparison.
Abstract: We introduce the notion of conservative edit distance and mapping between two RNA stem-loops. We show that unlike the general edit distance between RNA secondary structures, the conservative edit distance can be computed in polynomial time and space, and we describe an algorithm for this problem. We show how this algorithm can be used in the more general problem of complete RNA secondary structures comparison.