Showing papers on "Edit distance published in 2002"

PDF

Open Access

Book Chapter•DOI•

Eliminating fuzzy duplicates in data warehouses

[...]

Rohit Ananthakrishna¹, Surajit Chaudhuri², Venkatesh Ganti²•Institutions (2)

20 Aug 2002

TL;DR: An algorithm for eliminating duplicates in dimensional tables in a data warehouse, which is usually associated with hierarchies is developed and evaluated on real datasets from an operational data warehouse.

...read moreread less

Abstract: The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

...read moreread less

465 citations

Journal Article•DOI•

A general edit distance between RNA structures.

[...]

Tao Jiang¹, Guohui Lin, Bin Ma, Kaizhong Zhang•Institutions (1)

University of California, Riverside¹

01 Jan 2002-Journal of Computational Biology

TL;DR: The notion of edit distance is proposed to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs).

...read moreread less

Abstract: Arc-annotated sequences are useful in representing the structural information of RNA sequences. In general, RNA secondary and tertiary structures can be represented as a set of nested arcs and a set of crossing arcs, respectively. Since RNA functions are largely determined by molecular confirmation and therefore secondary and tertiary structures, the comparison between RNA secondary and tertiary structures has received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs). Several algorithms are presented to compute the edit distance between two RNA sequences with various arc structures and under various score schemes, either exactly or approximately, with provably good performance. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.

...read moreread less

218 citations

Journal Article•DOI•

Fast string correction with Levenshtein automata

[...]

Klaus U. Schulz¹, Stoyan Mihov²•Institutions (2)

Ludwig Maximilian University of Munich¹, Bulgarian Academy of Sciences²

01 Nov 2002-International Journal on Document Analysis and Recognition

TL;DR: This work shows how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W, which leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries.

...read moreread less

Abstract: The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein automata of degree n for a word W are defined as finite state automata that recognize the set of all words V where the Levenshtein distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W. Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein automata and leads to even improved efficiency. Evaluation results are given that also address variants of both methods that are based on modified Levenshtein distances where further primitive edit operations (transpositions, merges and splits) are used.

...read moreread less

192 citations

Journal Article•DOI•

Approximate String Matching: A Simpler Faster Algorithm

[...]

Richard Cole, Ramesh Hariharan¹•Institutions (1)

Indian Institute of Science¹

01 Jun 2002-SIAM Journal on Computing

TL;DR: Two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k, are given.

...read moreread less

Abstract: We give two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k. The first algorithm, which is quite simple, runs in time $O(\frac{nk^3}{m}+n+m)$ on all patterns except k-break periodic strings (defined later). The second algorithm runs in time $O(\frac{nk^4}{m}+n+m)$ on k-break periodic patterns. The two classes of patterns are easily distinguished in O(m)time.

...read moreread less

126 citations

Proceedings Article•DOI•

Unsupervised discovery of morphologically related words based on orthographic and semantic similarity

[...]

Marco Baroni, Johannes Matiasek, Harald Trost

11 Jul 2002

TL;DR: In this paper, an algorithm that takes an unannotated corpus as its input, and returns a ranked list of probable morphologically related pairs as its output is presented, where orthographic similarity is measured in terms of minimum edit distance, and semantic similarity is calculated by mutual information.

...read moreread less

Abstract: We present an algorithm that takes an unannotated corpus as its input, and returns a ranked list of probable morphologically related pairs as its output. The algorithm tries to discover morphologically related pairs by looking for pairs that are both orthographically and semantically similar, where orthographic similarity is measured in terms of minimum edit distance, and semantic similarity is measured in terms of mutual information. The procedure does not rely on a morpheme concatenation model, nor on distributional properties of word substrings (such as affix frequency). Experiments with German and English input give encouraging results, both in terms of precision (proportion of good pairs found at various cutoff points of the ranked list), and in terms of a qualitative analysis of the types of morphological patterns discovered by the algorithm.

...read moreread less

109 citations

Journal Article•

Edit distance with move operations

[...]

Dana Shapira¹, James A. Storer¹•Institutions (1)

Brandeis University¹

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: This work considers the more general problem of strings being represented by a singly linked list and being able to apply these operations to the pointer associated with a vertex as well as the character associated with the vertex, and shows that this problem is NP-complete.

...read moreread less

Abstract: The traditional edit-distance problem is to find the minimum number of insert-character and delete-character (and sometimes change character) operations required to transform one string into another. Here we consider the more general problem of strings being represented by a singly linked list (one character per node) and being able to apply these operations to the pointer associated with a vertex as well as the character associated with the vertex. That is, in O(1) time, not only can characters be inserted or deleted, but also substrings can be moved or deleted. We limit our attention to the ability to move substrings and leave substring deletions for future research. Note that O(1) time substring move operations imply O(1) substring exchange operations as well, a form of transformation that has been of interest in molecular biology. We show that this problem is NP-complete, show that a recursive sequence of moves can be simulated with at most a constant factor increase by a non-recursive sequence, and present a polynomial time greedy algorithm for non-recursive moves with a worst-case log factor approximation to optimal. The development of this greedy algorithm shows how to reduce moves of substrings to moves of characters, and how to convert moves with characters to only insert and deletes of characters.

...read moreread less

106 citations

Journal Article•DOI•

Evaluating the performance of table processing algorithms

[...]

Jianying Hu¹, Ramanujan S. Kashi¹, Daniel P. Lopresti², Gordon Wilfong²•Institutions (2)

Avaya¹, Alcatel-Lucent²

01 Mar 2002-International Journal on Document Analysis and Recognition

TL;DR: An intuitive, easy-to-implement evaluation schemes for the related problems of table detection and table structure recognition are introduced and a new paradigm, “graph probing,” is described for comparing the results returned by the recognition system and the representation created during ground-truthing.

...read moreread less

Abstract: While techniques for evaluating the performance of lower-level document analysis tasks such as optical character recognition have gained acceptance in the literature, attempts to formalize the problem for higher-level algorithms, while receiving a fair amount of attention in terms of theory, have generally been less successful in practice, perhaps owing to their complexity. In this paper, we introduce intuitive, easy-to-implement evaluation schemes for the related problems of table detection and table structure recognition. We also present the results of several small experiments, demonstrating how well the methodologies work and the useful sorts of feedback they provide. We first consider the table detection problem. Here algorithms can yield various classes of errors, including non-table regions improperly labeled as tables (insertion errors), tables missed completely (deletion errors), larger tables broken into a number of smaller ones (splitting errors), and groups of smaller tables combined to form larger ones (merging errors). This leads naturally to the use of an edit distance approach for assessing the results of table detection. Next we address the problem of evaluating table structure recognition. Our model is based on a directed acyclic attribute graph, or table DAG. We describe a new paradigm, “graph probing,” for comparing the results returned by the recognition system and the representation created during ground-truthing. Probing is in fact a general concept that could be applied to other document recognition tasks as well.

...read moreread less

92 citations

Proceedings Article•DOI•

The Influence of Minimum Edit Distance on Reference Resolution

[...]

Michael Strube, Stefan Rapp, Christoph Müller

06 Jul 2002

TL;DR: A cheap, language and domain independent feature based on the minimum edit distance between strings yielded a significant improvement for data sets consisting of definite noun phrases and proper names, respectively.

...read moreread less

Abstract: We report on experiments in reference resolution using a decision tree approach. We started with a standard feature set used in previous work, which led to moderate results. A closer examination of the performance of the features for different forms of anaphoric expressions showed good results for pronouns, moderate results for proper names, and poor results for definite noun phrases. We then included a cheap, language and domain independent feature based on the minimum edit distance between strings. This feature yielded a significant improvement for data sets consisting of definite noun phrases and proper names, respectively. When applied to the whole data set the feature produced a smaller but still significant improvement.

...read moreread less

90 citations

Proceedings Article•DOI•

The string edit distance matching problem with moves

[...]

Graham Cormode¹, S. Muthukrishnan²•Institutions (2)

University of Warwick¹, AT&T²

06 Jan 2002

TL;DR: In this article, a significantly subquadratic algorithm for string edit distance matching with nontrivial alignments is presented. But the algorithm requires O(log n log*n) time to compute the edit distance.

...read moreread less

Abstract: The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. A well known dynamic programming algorithm takes time O(nm) to solve this problem, and it is an important open problem in Combinatorial Pattern Matching to significantly improve this bound.We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we approximate the string edit distance upto a factor of O(log n log*n). Our result is a near linear time deterministic algorithm for this version of the problem. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit Sensitive Parsing (ESP). This embedding is approximately distance preserving, and we show many applications of this embedding to string proximity problems including nearest neighbors, outliers, and streaming computations with strings.

...read moreread less

82 citations

Proceedings Article•

A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances.

[...]

Heikki Hyyrö¹•Institutions (1)

University of Tampere¹

01 Jan 2002

TL;DR: The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold and works faster than the original algorithm of Myers.

...read moreread less

Abstract: The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(⌈m/w⌉(n + σ)), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(⌈d/w⌉m + ⌈n/w⌉σ) or O(⌈d/w⌉n + ⌈m/w⌉σ). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers.

...read moreread less

61 citations

Book Chapter•DOI•

Flexible Pattern Matching in Strings: Approximate matching

[...]

Gonzalo Navarro, Mathieu Raffinot

01 Jan 2002

Proceedings Article•

A Probabilistic Model of Melodic Similarity

[...]

Ning Hu, Roger B. Dannenberg, Ann L. Lewis

01 Jan 2002

TL;DR: This approach computes an “edit distance” as a measure of melodic dissimilarity and demonstrates how it can be used to search a database of melodies.

...read moreread less

Abstract: Melodic similarity is an important concept for music databases, musicological studies, and interactive music systems. Dynamic programming is commonly used to compare melodies, often with a distance function based on pitch differences measured in semitones. This approach computes an “edit distance” as a measure of melodic dissimilarity. The problem can also be viewed in probabilistic terms: What is the probability that a melody is a “mutation” of another melody, given a table of mutation probabilities? We explain this approach and demonstrate how it can be used to search a database of melodies. Our experiments show that the probabilistic model performs better than a typical “edit distance” comparison.

...read moreread less

Book Chapter•DOI•

String Matching with Metric Trees Using an Approximate Distance

[...]

Ilaria Bartolini¹, Paolo Ciaccia¹, Marco Patella¹•Institutions (1)

University of Bologna¹

11 Sep 2002

TL;DR: This paper investigates the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings, and shows an improvement in performance up to 90% with respect to the basic case.

...read moreread less

Abstract: Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90% with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pre-test and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem.

...read moreread less

Book Chapter•DOI•

A Metric Index for Approximate String Matching

[...]

Edgar Chávez¹, Gonzalo Navarro²•Institutions (2)

Universidad Michoacana de San Nicolás de Hidalgo¹, University of Chile²

03 Apr 2002

TL;DR: A radically new indexing approach for approximate string matching where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space.

...read moreread less

Abstract: We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(mlog2 n+m2+R), using O(n log n) space and O(n log2 n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.

...read moreread less

Patent•

Method of performing approximate substring indexing

[...]

H. V. Jagadish¹, Nikolaos Koudas¹, S. Muthukrishnan¹, Divesh Srivastava¹•Institutions (1)

AT&T¹

17 Jun 2002

TL;DR: The authors decompose each string in a database into overlapping "positional q-grams", sequences of a predetermined length q, and contain information regarding the position of each qgram within the string.

...read moreread less

Abstract: Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1 st q-gram, 4 th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.

...read moreread less

Journal Article•DOI•

Edit distance of run-length encoded strings

[...]

Ora Arbell¹, Gad M. Landau¹, Joseph S. B. Mitchell²•Institutions (2)

University of Haifa¹, State University of New York System²

30 Sep 2002-Information Processing Letters

TL;DR: A simple O(|X|l + |Y|k) time algorithm that computes X and Y, two run-length encoded strings, of encoded lengths k and l, respectively.

...read moreread less

Proceedings Article•DOI•

Edit-distance of weighted automata

[...]

Mehryar Mohri¹•Institutions (1)

AT&T Labs¹

03 Jul 2002

TL;DR: In this paper, the authors define the edit distance of two distributions of strings given by two weighted automata and present a synchronization algorithm for weighted transducers which, combined with ǫ-removal, can be used to normalize weighted automaton with bounded delays.

...read moreread less

Abstract: The edit-distance of two strings is the minimal cost of a sequence of symbol insertions, deletions, or substitutions transforming one string into the other. The definition is used in various contexts to give a measure of the difference or similarity between two strings. This definition can be extended to measure the similarity between two sets of strings. In particular, when these sets are represented by automata, their edit-distance can be computed using the general algorithm of composition of weighted transducers combined with a single-source shortest-paths algorithm. More generally, in some applications such as speech recognition and computational biology, the strings may represent a range of alternative hypotheses with associated probabilities. Thus, we introduce the definition of the edit-distance of two distributions of strings given by two weighted automata. We show that general weighted automata algorithms over the appropriate semirings can be used to compute the edit-distance of two weighted automata exactly. The algorithm for computing exactly the edit-distance of weighted automata can be used to improve the word accuracy of automatic speech recognition systems. More generally, the algorithm can be extended to provide an edit-distance automaton useful for rescoring and other post-processing purposes in the context of large-vocabulary speech recognition. In the course of the presentation of our algorithm, we also introduce a new and general synchronization algorithm for weighted transducers which, combined with Ɛ-removal, can be used to normalize weighted transducers with bounded delays.

...read moreread less

Proceedings Article•

Two Approaches to Handling Noisy Variation in Text Mining

[...]

Un Yong Nahm and Mikhail Bilenko and Raymond J. Mooney

01 Jan 2002

TL;DR: A domain-independent two-level method for improving duplicate detection accuracy based on machine learning and an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity are presented.

...read moreread less

Abstract: Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to “hardening” noisy databases by identifying duplicate records, and (2) mining “soft” association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.

...read moreread less

Proceedings Article•DOI•

Parallel dynamic programming for solving the string editing problem on a CGM/BSP

[...]

Carlos Eduardo Rodrigues Alves, Edson Norberto Cáceres, Frank Dehne¹•Institutions (1)

Carleton University¹

10 Aug 2002

TL;DR: A coarse-grained parallel algorithm for solving the string edit distance problem for a string A and all substrings of a string C and is the first efficient CGM/BSP algorithm for the alignment of all sub strings of C with A.

...read moreread less

Abstract: In this paper we present a coarse-grained parallel algorithm for solving the string edit distance problem for a string A and all substrings of a string C. Our method is based on a novel CGM/BSP parallel dynamic programming technique for computing all highest scoring paths in a weighted grid graph. The algorithm requires \log p rounds/supersteps and O(\fracn^2p\log m) local computation, where $p$ is the number of processors, p^2 \leq m \leq n. To our knowledge, this is the first efficient CGM/BSP algorithm for the alignment of all substrings of C with A. Furthermore, the CGM/BSP parallel dynamic programming technique presented is of interest in its own right and we expect it to lead to other parallel dynamic programming methods for the CGM/BSP.

...read moreread less

Book Chapter•DOI•

Faster Bit-Parallel Approximate String Matching

[...]

Heikki Hyyrö¹, Gonzalo Navarro²•Institutions (2)

University of Tampere¹, University of Chile²

03 Jul 2002

TL;DR: This paper shows that the faster algorithm of Myers can be adapted to support all the required operations for approximate string matching, and involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match.

...read moreread less

Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

...read moreread less

Book Chapter•DOI•

String Edit Distance, Random Walks and Graph Matching

[...]

Antonio Robles-Kelly¹, Edwin R. Hancock¹•Institutions (1)

University of York¹

06 Aug 2002-Lecture Notes in Computer Science

TL;DR: This paper shows how the eigenstructure of the adjacency matrix can be used for the purposes of robust graph-matching, by finding the sequence of string edit operations which minimise edit distance.

...read moreread less

Abstract: This paper shows how the eigenstructure of the adjacency matrix can be used for the purposes of robust graph-matching. We commence from the observation that the leading eigenvector of a transition probability matrix is the steady state of the associated Markov chain. When the transition matrix is the normalised adjacency matrix of a graph, then the leading eigenvector gives the sequence of nodes of the steady state random walk on the graph. We use this property to convert the nodes in a graph into a string where the node-order is given by the sequence of nodes visited in the random walk. We match graphs represented in this way, by finding the sequence of string edit operations which minimise edit distance.

...read moreread less

Journal Article•DOI•

On the Weighted Mean of a Pair of Strings

[...]

Horst Bunke¹, Xiaoyi Jiang¹, Karin Abegglen¹, Abraham Kandel²•Institutions (2)

University of Bern¹, University of South Florida²

01 May 2002-Pattern Analysis and Applications

TL;DR: The weighted mean of a pair of strings is introduced, formal properties of the weighted mean are shown, a procedure for its computation is described, and practical examples are given.

...read moreread less

Abstract: String matching and string edit distance are fundamental concepts in structural pattern recognition. In this paper, the weighted mean of a pair of strings is introduced. Given two strings, x and y, where d(x, y) is the edit distance of x and y, the weighted mean of x and y is a string z that has edit distances d(x, z) and d(z, y)to x and y, respectively, such that d(x, z) _ d(z, y) = d(x, y). We'll show formal properties of the weighted mean, describe a procedure for its computation, and give practical examples.

...read moreread less

Journal Article•

String matching with metric trees using an approximate distance

[...]

Ilaria Bartolini, Paolo Ciaccia, Marco Patella

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this article, the authors investigated the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings.

...read moreread less

Journal Article•

Faster bit-parallel approximate string matching

[...]

Heikki Hyyrö¹, Gonzalo Navarro²•Institutions (2)

University of Tampere¹, University of Chile²

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this paper, a bit-parallel algorithm for approximate string matching is presented, which can be adapted to support edit distance, search for any pattern suffix, and detect in advance the impossibility of a later match.

...read moreread less

Book Chapter•DOI•

Faster String Matching with Super-Alphabets

[...]

Kimmo Fredriksson¹•Institutions (1)

University of Helsinki¹

11 Sep 2002

TL;DR: This paper shows how to obtain an O(n/m) average time string matching algorithm, using a super-alphabet for simulating suffix automaton and adopting a similar technique to the shift-or algorithm, extending its bit-parallelism in another direction.

...read moreread less

Abstract: Given a text T[1 . . . n] and a pattern P[1 . . . m] over some alphabet ? of size ?, finding the exact occurrences of P in T requires at least ?(n log? m/m) character comparisons on average, as shown in [19]. Consequently, it is believed that this lower bound implies also an ?(n log? m/m) lower bound for the execution time of an optimal algorithm. However, in this paper we show how to obtain an O(n/m) average time algorithm. This is achieved by slightly changing the model of computation, and with a modification of an existing algorithm. Our technique uses a super-alphabet for simulating suffix automaton. The space usage of the algorithm is O(?m). The technique can be applied to many other string matching algorithms, including dictionary matching, which is also solved in expected time O(n/m), and approximate matching allowing k edit operations (mismatches, insertions or deletions of characters). This is solved in expected time O(nk/m) for k ? O(m/log? m). The known lower bound for this problem is ?(n(k + log? m)/m), given in [6]. Finally we show how to adopt a similar technique to the shift-or algorithm, extending its bit-parallelism in another direction. This gives a speed-up by a factor s, where s is the number of characters processed simultaneously. Some of the algorithms are implemented, and we show that the methods work well in practice too. This is especially true for the shift-or algorithm, which in some cases works faster than predicted by the theory. The result is the fastest known algorithm for exact string matching for short patterns and small alphabets. All the methods and analyses assume the RAM model of computation, and that each symbol is coded in b = ?log2 ?? bits. They work for larger b too, but the speed-up is decreased.

...read moreread less

Proceedings Article•DOI•

Using contextual spelling correction to improve retrieval effectiveness in degraded text collections

[...]

Patrick Ruch¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

24 Aug 2002

TL;DR: The study presented relies on the design and evaluation of an improved IR system susceptible to cope with textual misspellings, and compares the improvement brought to the engine by the adjunction of two different non-interactive spelling correction strategies.

...read moreread less

Abstract: The study presented relies on the design and evaluation of an improved IR system susceptible to cope with textual misspellings. After selecting an optimal weighting scheme for the engine, we evaluate the effect of misspellings on the retrieval effectiveness. Then, we compare the improvement brought to the engine by the adjunction of two different non-interactive spelling correction strategies: a classical one, based on a string-to-string edit distance calculus, and a contextual one, which adds linguistically-motivated features to the string distance module. The results for the latter suggest that average precision in degraded texts can be reduced to a few percents (4%).

...read moreread less

Journal Article•DOI•

On-Line Approximate String Searching Algorithms: Survey and Experimental Results

[...]

Panagiotis D. Michailidis¹, Konstantinos G. Margaritis¹•Institutions (1)

University of Macedonia¹

01 Jan 2002-International Journal of Computer Mathematics

TL;DR: This paper presents a short survey and experimental results for well known sequential approximate string searching algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism.

...read moreread less

Abstract: The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities.

...read moreread less

Book Chapter•DOI•

Derivation of L-system Models from Measurements of Biological Branching Structures Using Genetic Algorithms

[...]

Bian Runqiang¹, Bian Runqiang², Yi-Ping Phoebe Chen³, Yi-Ping Phoebe Chen², Kevin Burrage², Jim Hanan², P. M. Room², John A. Belward² - Show less +4 more•Institutions (3)

Tianjin University of Technology¹, University of Queensland², Queensland University of Technology³

17 Jun 2002

TL;DR: An algorithm is proposed for automatic L-system translation that compares randomly generated branching structures with the target structure and Edit distance, which is proposed as a measure of dissimilarity between rooted trees, is extended for the comparison of structures represented in axial trees.

...read moreread less

Abstract: L-systems are widely used in the modelling of branching structures and the growth process of biological objects such as plants, nerves and airways in lungs. The derivation of such L-system models involves a lot of hard mental work and time-consuming manual procedures. A method based on genetic algorithms for automating the derivation of L-systems is presented here. The method involves representation of branching structure, translation of L-systems to axial tree architectures, comparison of branching structure and the application of genetic algorithms. Branching structures are represented as axial trees and positional information is considered as an important attribute along with length and angle in the database configuration of branches. An algorithm is proposed for automatic L-system translation that compares randomly generated branching structures with the target structure. Edit distance, which is proposed as a measure of dissimilarity between rooted trees, is extended for the comparison of structures represented in axial trees and positional information is involved in the local cost function. Conventional genetic algorithms and repair mechanics are employed in the search for L-system models having the best fit to observational data.

...read moreread less

Journal Article•DOI•

Distances between languages and reflexivity of relations

[...]

Christian Choffrut¹, Giovanni Pighizzini²•Institutions (2)

University of Paris¹, University of Milan²

06 Sep 2002

TL;DR: The notion of distance between subsets to that of "almost reflexivity" of relations over strings is extended, intuitively a relation is almost reflexive if every element of its domain is in relation with some "close" element in its range and vice versa.

...read moreread less

Abstract: We extend the Hamming, edit, prefix, suffix and subword distances between strings to subsets of strings. We show that computing these distances between two rational subsets reduces to computing the weight of an automaton "with distance function" as introduced by Hashiguchi (this latter notion of distance has nothing to do with our notion). We make a step further by extending the notion of distance between subsets to that of "almost reflexivity" of relations over strings: intuitively a relation is almost reflexive if every element of its domain is in relation with some "close" element in its range and vice versa. Various properties connected to almost reflexivity are investigated. With two exceptions, their decidability status relative to the five notions of distances is settled for the three families of recognizable, synchronous and deterministic relations.

...read moreread less

Journal Article•DOI•

Shape recognition using attributed string matching with polygon vertices as the primitives

[...]

Serkan Kaygin¹, Mehmet Bulut¹•Institutions (1)

Middle East Technical University¹

01 Jan 2002-Pattern Recognition Letters

TL;DR: The vertices of the polygons are suggested as the primitives of the attributed strings so that the benefits of split and merge operations are placed in the dynamic programming algorithm for the edit distance evaluation without an extra computation-cost.

...read moreread less