Showing papers on "Edit distance published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Robust and fast similarity search for moving object trajectories

[...]

Lei Chen¹, M. Tamer Özsu¹, Vincent Oria²•Institutions (2)

University of Waterloo¹, New Jersey Institute of Technology²

14 Jun 2005

TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.

...read moreread less

Abstract: An important consideration in similarity-based retrieval of moving object trajectories is the definition of a distance function. The existing distance functions are usually sensitive to noise, shifts and scaling of data that commonly occur due to sensor failures, errors in detection techniques, disturbance signals, and different sampling rates. Cleaning data to eliminate these is not always possible. In this paper, we introduce a novel distance function, Edit Distance on Real sequence (EDR) which is robust against these data imperfections. Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences (LCSS), indicate that EDR is more robust than Euclidean distance, DTW and ERP, and it is on average 50% more accurate than LCSS. We also develop three pruning techniques to improve the retrieval efficiency of EDR and show that these techniques can be combined effectively in a search, increasing the pruning power significantly. The experimental results confirm the superior efficiency of the combined methods.

...read moreread less

1,225 citations

Journal Article•DOI•

A survey on tree edit distance and related problems

[...]

Philip Bille¹•Institutions (1)

IT University of Copenhagen¹

09 Jun 2005-Theoretical Computer Science

TL;DR: This work surveys the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes and presents one or more of the central algorithms for solving the problem.

...read moreread less

831 citations

Book Chapter•DOI•

N -gram similarity and distance

[...]

Grzegorz Kondrak¹•Institutions (1)

University of Alberta¹

02 Nov 2005

TL;DR: A family of word similarity measures based on n-grams are formulated, and the results of experiments suggest that the new measures outperform their unigram equivalents.

...read moreread less

Abstract: In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.

...read moreread less

296 citations

Journal Article•DOI•

Graph edit distance from spectral seriation

[...]

Antonio Robles-Kelly, Edwin R. Hancock¹•Institutions (1)

University of York¹

01 Mar 2005-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The aim is to convert graphs to string sequences so that string matching techniques can be used and to compute the edit distance by finding the sequence of string edit operations which minimizes the cost of the path traversing the edit lattice.

...read moreread less

Abstract: This paper is concerned with computing graph edit distance. One of the criticisms that can be leveled at existing methods for computing graph edit distance is that they lack some of the formality and rigor of the computation of string edit distance. Hence, our aim is to convert graphs to string sequences so that string matching techniques can be used. To do this, we use a graph spectral seriation method to convert the adjacency matrix into a string or sequence order. We show how the serial ordering can be established using the leading eigenvector of the graph adjacency matrix. We pose the problem of graph-matching as a maximum a posteriori probability (MAP) alignment of the seriation sequences for pairs of graphs. This treatment leads to an expression in which the edit cost is the negative logarithm of the a posteriori sequence alignment probability. We compute the edit distance by finding the sequence of string edit operations which minimizes the cost of the path traversing the edit lattice. The edit costs are determined by the components of the leading eigenvectors of the adjacency matrix and by the edge densities of the graphs being matched. We demonstrate the utility of the edit distance on a number of graph clustering problems.

...read moreread less

191 citations

Proceedings Article•DOI•

[...]

Rui Yang¹, Panos Kalnis¹, Anthony K. H. Tung¹•Institutions (1)

National University of Singapore¹

14 Jun 2005

TL;DR: This paper proposes to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information and proves that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees.

...read moreread less

Abstract: Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform tree-structured data into an approximate numerical multidimensional vector which encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(|T1| + |T2|), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filter-and-refine framework to process similarity search on tree-structured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets.

...read moreread less

174 citations

Proceedings Article•DOI•

Learning a Spelling Error Model from Search Query Logs

[...]

Farooq Ahmad¹, Grzegorz Kondrak¹•Institutions (1)

University of Alberta¹

06 Oct 2005

TL;DR: This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

...read moreread less

Abstract: Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

...read moreread less

120 citations

Report•DOI•

A conditional random field for discriminatively-trained finite-state string edit distance

[...]

Andrew McCallum¹, Kedar Bellare¹, Fernando Pereira²•Institutions (2)

University of Massachusetts Amherst¹, University of Pennsylvania²

26 Jul 2005

TL;DR: This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings, trained on both positive and negative instances of string pairs.

...read moreread less

Abstract: The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets.

...read moreread less

120 citations

Evaluating machine translation output with automatic sentence segmentation.

[...]

Evgeny Matusov¹, Gregor Leusch¹, Oliver Bender¹, Hermann Ney¹•Institutions (1)

RWTH Aachen University¹

01 Jan 2005

TL;DR: A novel automatic sentence segmentation method for evaluating machine translation output with possibly erroneous sentence boundaries that efficiently produces an optimal automatic segmentation of the hypotheses and thus allows application of existing well-established evaluation measures.

...read moreread less

Abstract: This paper presents a novel automatic sentence segmentation method for evaluating machine translation output with possibly erroneous sentence boundaries. The algorithm can process translation hypotheses with segment boundaries which do not correspond to the reference segment boundaries, or a completely unsegmented text stream. Thus, the method is especially useful for evaluating translations of spoken language. The evaluation procedure takes advantage of the edit distance algorithm and is able to handle multiple reference translations. It efficiently produces an optimal automatic segmentation of the hypotheses and thus allows application of existing well-established evaluation measures. Experiments show that the evaluation measures based on the automatically produced segmentation correlate with the human judgement at least as well as the evaluation measures which are based on manual sentence boundaries.

...read moreread less

98 citations

Journal Article•DOI•

Self-organizing maps for learning the edit costs in graph matching

[...]

M. Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

01 Jun 2005

TL;DR: A system of self-organizing maps (SOMs) that represent the distance measuring spaces of node and edge labels are proposed that adapts the edit costs in such a way that the similarity of graphs from the same class is increased, whereas the similarity from different classes decreases.

...read moreread less

Abstract: Although graph matching and graph edit distance computation have become areas of intensive research recently, the automatic inference of the cost of edit operations has remained an open problem. In the present paper, we address the issue of learning graph edit distance cost functions for numerically labeled graphs from a corpus of sample graphs. We propose a system of self-organizing maps (SOMs) that represent the distance measuring spaces of node and edge labels. Our learning process is based on the concept of self-organization. It adapts the edit costs in such a way that the similarity of graphs from the same class is increased, whereas the similarity of graphs from different classes decreases. The learning procedure is demonstrated on two different applications involving line drawing graphs and graphs representing diatoms, respectively.

...read moreread less

90 citations

Proceedings Article•DOI•

Approximate matching of hierarchical data using pq -grams

[...]

Nikolaus Augsten¹, Michael H. Böhlen¹, Johann Gamper¹•Institutions (1)

Free University of Bozen-Bolzano¹

30 Aug 2005

TL;DR: The pq-gram distance between ordered labeled trees is defined as an effective and efficient approximation of the well-known tree edit distance and the properties of the pq -gram distance are analyzed to compare it with the edit Distance and alternative approximations.

...read moreread less

Abstract: When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. As a running example we use residential address information. Addresses are hierarchical structures and are present in many databases. Often they are the best, if not only, relationship between autonomous data sources. Typically the matching has to be approximate since the representations in the sources differ.We propose pq-grams to approximately match hierarchical information from autonomous sources. We define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the well-known tree edit distance. We analyze the properties of the pq-gram distance and compare it with the edit distance and alternative approximations. Experiments with synthetic and real world data confirm the analytic results and the scalability of our approach.

...read moreread less

87 citations

Patent•

Detecting duplicate records in databases

[...]

Surajit Chaudhuri¹, Venkatesh Ganti¹, Rohit Ananthakrishna¹•Institutions (1)

Microsoft¹

14 Jul 2005

TL;DR: In this article, a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema.

...read moreread less

Abstract: The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

...read moreread less

Support Vector Machines for Paraphrase Identification and Corpus Construction

[...]

Chris Brockett, William B. Dolan¹•Institutions (1)

Microsoft¹

01 Jan 2005

TL;DR: The use of annotated datasets and Support Vector Machines are described to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web, which dramatically reduces the Alignment Error Rate of the extracted corpora.

...read moreread less

Abstract: The lack of readily-available large corpora of aligned monolingual sentence pairs is a major obstacle to the development of Statistical Machine Translation-based paraphrase models. In this paper, we describe the use of annotated datasets and Support Vector Machines to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web. Features include: morphological variants; WordNet synonyms and hypernyms; loglikelihood-based word pairings dynamically obtained from baseline sentence alignments; and formal string features such as word-based edit distance. Use of this technique dramatically reduces the Alignment Error Rate of the extracted corpora over heuristic methods based on position of the sentences in the text.

...read moreread less

Book Chapter•DOI•

A graph matching based approach to fingerprint classification using directional variance

[...]

Michel Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

20 Jul 2005-Lecture Notes in Computer Science

TL;DR: The main contribution is the definition of modified directional variance in orientation vector fields, which allows us to extract regions from fingerprints that are relevant for the classification in the Henry scheme.

...read moreread less

Abstract: In the present paper we address the fingerprint classification problem with a structural pattern recognition approach. Our main contribution is the definition of modified directional variance in orientation vector fields. The new directional variance allows us to extract regions from fingerprints that are relevant for the classification in the Henry scheme. After processing the regions of interest, the resulting structures are converted into attributed graphs. The classification is finally performed with an efficient graph edit distance algorithm. The performance of the proposed classification method is evaluated on the NIST-4 database of fingerprints.

...read moreread less

Evaluating String Comparator Performance for Record Linkage

[...]

William E. Yancey

01 Jan 2005

TL;DR: Variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator are compared to Census data to see which are better classifiers for matches and nonmatches.

...read moreread less

Abstract: We compare variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator. We apply the comparators to Census data to see which are better classifiers for matches and nonmatches, first by comparing their classification abilities using a ROC curve based analysis, then by considering a direct comparison between two candidate comparators in record linkage results.

...read moreread less

Book Chapter•DOI•

A content-addressable network for similarity search in metric spaces

[...]

Fabrizio Falchi¹, Claudio Gennaro¹, Pavel Zezula²•Institutions (2)

Istituto di Scienza e Tecnologie dell'Informazione¹, Masaryk University²

28 Aug 2005

TL;DR: In this article, the authors present a scalable and distributed access structure for similarity search in metric spaces based on the Content-addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space.

...read moreread less

Abstract: In this paper we present a scalable and distributed access structure for similarity search in metric spaces. The approach is based on the Content-addressable Network (CAN) paradigm, which provides a Distributed Hash Table (DHT) abstraction over a Cartesian space. We have extended the CAN structure to support storage and retrieval of generic metric space objects. We use pivots for projecting objects of the metric space in an N-dimensional vector space, and exploit the CAN organization for distributing the objects among the computing nodes of the structure. We obtain a Peer-to-Peer network, called the MCAN, which is able to search metric space objects by means of the similarity range queries. Experiments conducted on our prototype system confirm full scalability of the approach.

...read moreread less

Journal Article•DOI•

Algorithms for the constrained longest common subsequence problems

[...]

Abdullah N. Arslan¹, Ömer Eğecioğlu²•Institutions (2)

University of Vermont¹, University of California, Santa Barbara²

01 Dec 2005-International Journal of Foundations of Computer Science

TL;DR: In this paper, the authors improved the time complexity of the problem from O(rn2m2) to O(rnm, where r, n, and m are the lengths of P, S1, S2, and S2 respectively.

...read moreread less

Abstract: Given strings S1, S2, and P, the constrained longest common subsequence problem for S1 and S2 with respect to P is to find a longest common subsequence lcs of S1 and S2 which contains P as a subsequence. We present an algorithm which improves the time complexity of the problem from the previously known O(rn2m2) to O(rnm) where r, n, and m are the lengths of P, S1, and S2, respectively. As a generalization of this, we extend the definition of the problem so that the lcs sought contains a subsequence whose edit distance from P is less than a given parameter d. For the latter problem, we propose an algorithm whose time complexity is O(drnm).

...read moreread less

Posted Content•

Fast and Compact Regular Expression Matching

[...]

Philip Bille¹, Martin Farach-Colton²•Institutions (2)

IT University of Copenhagen¹, Rutgers University²

22 Sep 2005-arXiv: Data Structures and Algorithms

TL;DR: This work shows how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.

...read moreread less

Abstract: We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way.

...read moreread less

Journal Article•DOI•

Transposition invariant string matching

[...]

Veli Mäkinen¹, Gonzalo Navarro², Esko Ukkonen¹•Institutions (2)

University of Helsinki¹, University of Chile²

01 Aug 2005-Journal of Algorithms

TL;DR: It is shown how sparse dynamic programming can be used to solve transposition invariant problems, and its connection with multidimensional range-minimum search.

...read moreread less

Proceedings Article•DOI•

Low distortion embeddings for edit distance

[...]

Rafail Ostrovsky¹, Yuval Rabani²•Institutions (2)

University of California, Los Angeles¹, Technion – Israel Institute of Technology²

22 May 2005

TL;DR: Efficient implementations of the embedding that yield solutions to various computational problems involving edit distance, including sketching, communication complexity, nearest neighbor search are shown.

...read moreread less

Abstract: We show that 0,1d endowed with edit distance embeds into l1 with distortion 2O(√log dlog log d). We further show efficient implementations of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds.

...read moreread less

Journal Article•DOI•

Hardness Results for the Center and Median String Problems under the Weighted and Unweighted Edit Distances

[...]

François Nicolas¹, Eric Rivals¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Jun 2005-Journal of Discrete Algorithms

TL;DR: This work provides an algorithm to compute an optimal center under a weighted edit distance in polynomial time when the number of input strings is fixed and gives the complexity of the related Center String problem.

...read moreread less

Journal Article•DOI•

Using edit distance to analyze card sorts

[...]

Katherine Deibel¹, Richard Anderson¹, Ruth Anderson²•Institutions (2)

University of Washington¹, University of Virginia²

01 Jul 2005-Expert Systems

TL;DR: This paper introduces a new technique for analyzing card sort data that uses quantitative measures to discover rich qualitative results and is based upon a distance metric between sorts that allows one to measure the similarity of groupings and then look for clusters of closely related sorts across individuals.

...read moreread less

Abstract: Card sorts are a knowledge elicitation technique in which participants are given a collection of items and are asked to partition them into groups based on their own criteria. Information about the participant's knowledge structure is inferred from the groups formed and the names used to describe the groups through various methods ranging from simple quantitative statistical measures (e.g. co-occurrence frequencies) to complex qualitative methods (e.g. content analysis on the group names). This paper introduces a new technique for analyzing card sort data that uses quantitative measures to discover rich qualitative results. This method is based upon a distance metric between sorts that allows one to measure the similarity of groupings and then look for clusters of closely related sorts across individuals. By using software for computing these clusters, it is possible to identify common concepts across individuals, despite the use of different terminology.

...read moreread less

Proceedings Article•

Selectivity estimation for fuzzy string predicates in large data sets

[...]

Liang Jin¹, Chen Li¹•Institutions (1)

University of California, Irvine¹

30 Aug 2005

TL;DR: This paper develops a novel technique, called SEPIA, which groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database and discusses how to extend the techniques to other similarity functions.

...read moreread less

Abstract: Many database applications have the emerging need to support fuzzy queries that ask for strings that are similar to a given string, such as "name similar to smith" and "telephone number similar to 412-0964." Query optimization needs the selectivity of such a fuzzy predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of fuzzy string predicates. We develop a novel technique, called SEPIA, to solve the problem. It groups strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram for the database. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance function. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of fuzzy string predicates.

...read moreread less

Melody retrieval using the Implication/Realization Model

[...]

Maarten Grachten, Josep Lluís Arcos Rosell, Ramon López de Mántaras

01 Jan 2005

TL;DR: 1st Prize in the MIREX Symbolic Melodic Similarity Contest.

...read moreread less

Abstract: 1st Prize in the MIREX Symbolic Melodic Similarity Contest To appear in on-line proceedings: http://wwwmusic-irorg/evaluation/mirex-results/sym-melody/indexhtml

...read moreread less

Book Chapter•DOI•

Indexing DNA sequences using q-grams

[...]

Xia Cao¹, Shuai Cheng Li¹, Anthony K. H. Tung¹•Institutions (1)

National University of Singapore¹

17 Apr 2005

TL;DR: Experimental results show that the proposed method for indexing the DNA sequences efficiently based on q-grams is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.

...read moreread less

Abstract: We have observed in recent years a growing interest in similarity search on large collections of biological sequences Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity

...read moreread less

Journal Article•DOI•

Decomposition algorithms for the tree edit distance problem

[...]

Serge Dulucq¹, Hélène Touzet²•Institutions (2)

University of Bordeaux¹, university of lille²

01 Jun 2005-Journal of Discrete Algorithms

TL;DR: This analysis allows for a new tree edit distance algorithm, that is optimal for cover strategies, and provides an exact characterization of the complexity of cover strategies.

...read moreread less

Journal Article•DOI•

The choice of optimal distance measure in genome-wide datasets

[...]

Galina V. Glazko¹, Alexander Y. Gordon², Arcady Mushegian¹•Institutions (2)

Stowers Institute for Medical Research¹, University of Rochester Medical Center²

01 Feb 2005-Bioinformatics

TL;DR: It is shown that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent and the more familiar distance properties appear to have much less effect on the performance of distances.

...read moreread less

Abstract: Motivation: Many types of genomic data are naturally represented as binary vectors. Numerous tasks in computational biology can be cast as analysis of relationships between these vectors, and the first step is, frequently, to compute their pairwise distance matrix. Many distance measures have been proposed in the literature, but there is no theory justifying the choice of distance measure. Results: We examine the approaches to measuring distances between binary vectors and study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. We show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. Availability: R code GADIST and Supplementary material are available at http://research.stowers-institute.org/bioinfo/ Contact: gvg@stowers-institute.org

...read moreread less

Journal Article•DOI•

Increased bit-parallelism for approximate and multiple string matching

[...]

Heikki Hyyrö¹, Kimmo Fredriksson², Gonzalo Navarro³•Institutions (3)

University of Tampere¹, University of Eastern Finland², University of Chile³

31 Dec 2005-ACM Journal of Experimental Algorithms

TL;DR: This paper shows how multiple patterns can be packed into a single computer word so as to search for all them simultaneously, and how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences.

...read moreread less

Abstract: Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w−m bits in the computer words are unused. In this paper, we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m≤w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m≤w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives, especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

...read moreread less

Book Chapter•DOI•

A linear tree edit distance algorithm for similar ordered trees

[...]

Hélène Touzet¹•Institutions (1)

university of lille¹

19 Jun 2005

TL;DR: A linear algorithm for comparing two similar ordered rooted trees with node labels which uses at most k insertions or deletions can be constructed in O(nk3) where n is the size of the trees.

...read moreread less

Abstract: We describe a linear algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping which uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang-Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space.

...read moreread less

Journal Article•DOI•

Bit-Parallel Witnesses and Their Applications to Approximate String Matching

[...]

Heikki Hyyrö¹, Gonzalo Navarro²•Institutions (2)

University of Tampere¹, University of Chile²

01 Jan 2005-Algorithmica

TL;DR: A new bit-parallel technique for approximate string matching based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast is presented, and is the fastest algorithm for several combinations of m, k and alphabet sizes.

...read moreread less

Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one, BPM (Myers, 1999), searches for a pattern of length m in a text of length n permitting k differences in $O(\lceil m/w \rceil n)$ time, where w is the width of the computer word. The second one, ABNDM (Navarro and Raffinot, 2000), extends a sublinear-time exact algorithm to approximate searching. ABNDM relies on another algorithm, BPA (Wu and Manber, 1992), which makes use of an $O(k \lceil m/w \rceil n)$ time algorithm for its internal workings. BPA is slow but flexible enough to support all operations required by ABNDM. We improve previous ABNDM analyses, showing that it is average-optimal in number of inspected characters, although the overall complexity is higher because of the $O(k \lceil m/w \rceil )$ work done per inspected character. We then show that the faster BPM can be adapted to support all the operations required by ABNDM. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The solution to those challenges is based on the concept of a witness, which permits sampling some dynamic programming matrix values to bound, deduce or compute others fast. The resulting algorithm is average-optimal for m ≤ w, assuming the alphabet size is constant. In practice, it performs better than the original ABNDM and is the fastest algorithm for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology. To show that the concept of witnesses can be used in further scenarios, we also improve a recent variant of BPM. The use of witnesses greatly improves the running time of this algorithm too.

...read moreread less

Book Chapter•DOI•

An edit distance between RNA stem-loops

[...]

Valentin Guignon¹, Cedric Chauve², Sylvie Hamel¹•Institutions (2)

Université de Montréal¹, Université du Québec à Montréal²

02 Nov 2005

TL;DR: It is shown that unlike the general edit distance between RNA secondary structures, the conservative edit distance can be computed in polynomial time and space, and an algorithm for this problem is described, which can be used in the more general problem of completeRNA secondary structures comparison.

...read moreread less

Abstract: We introduce the notion of conservative edit distance and mapping between two RNA stem-loops. We show that unlike the general edit distance between RNA secondary structures, the conservative edit distance can be computed in polynomial time and space, and we describe an algorithm for this problem. We show how this algorithm can be used in the more general problem of complete RNA secondary structures comparison.

...read moreread less