scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2002"


Proceedings Article
01 Jan 2002
TL;DR: A new algorithm suitable for matching discrete objects such as strings and trees in linear time is presented, thus obviating dynamic programming with quadratic time complexity and improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.
Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynamic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

354 citations


Journal ArticleDOI
TL;DR: Two polynomial-time approximationalgorithms with approximation ratio 1 + ε for any smallε to settle both the Closest String problem and the ClOSest Substring problem are presented.
Abstract: The problem of finding a center string that is "close" to every given string arises in computational molecular biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring problem. Given a set of strings S = {s1, s2, ..., sn}, each of length m, the Closest String problem is to find the smallest d and a string s of length m which is within Hamming distance d to each si e S. This problem comes from coding theory when we are looking for a code not too far away from a given set of codes. Closest Substring problem, with an additional input integer L, asks for the smallest d and a string s, of length L, which is within Hamming distance d away from a substring, of length L, of each si. This problem is much more elusive than the Closest String problem. The Closest Substring problem is formulated from applications in finding conserved regions, identifying genetic drug targets and generating genetic probes in molecular biology. Whether there are efficient approximation algorithms for both problems are major open questions in this area. We present two polynomial-time approximation algorithms with approximation ratio 1 + e for any small e to settle both questions.

219 citations


Proceedings ArticleDOI
19 May 2002
TL;DR: The crucial new idea underlying the first three results above is that of confirming matches by convolving vectors obtained by coding characters in the alphabet with non-boolean entries; in contrast, almost all previous pattern matching algorithms consider only boolean codes for the alphabet.
Abstract: (MATH) This paper obtains the following results on pattern matching problems in which the text has length n and the pattern has length mAn O(nlog m) time deterministic algorithm for the String Matching with Wildcards problems, even when the alphabet is large.An O(klog2 m) time Las Vegas algorithm for the Sparse String Matching with Wildcards problem, where k«n is the number of non-zeros in the text. We also give Las Vegas algorithms for the higher dimensional version of this problem.As an application of the above, an O(nlog2 m) time Las Vegas algorithm for the Subset Matching and Tree Pattern Matching problems, and a Las Vegas algorithm for the Geometric Pattern Matching problem.Finally, an O(nlog2 m) time deterministic algorithm for Subset Matching and Tree Pattern Matching..The crucial new idea underlying the first three results above is that of confirming matches by convolving vectors obtained by coding characters in the alphabet with non-boolean (i.e., rational or even complex) entries; in contrast, almost all previous pattern matching algorithms consider only boolean codes for the alphabet. The crucial new idea underlying the fourth result is a simpler method of shifting characters which ensures that each character occurs as a singleton in some shift.

159 citations


Journal ArticleDOI
TL;DR: Two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k, are given.
Abstract: We give two algorithms for finding all approximate matches of a pattern in a text, where the edit distance between the pattern and the matching text substring is at most k. The first algorithm, which is quite simple, runs in time $O(\frac{nk^3}{m}+n+m)$ on all patterns except k-break periodic strings (defined later). The second algorithm runs in time $O(\frac{nk^4}{m}+n+m)$ on k-break periodic patterns. The two classes of patterns are easily distinguished in O(m)time.

126 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce two new notions of approximate matching with application in computer assisted music analysis, and present algorithms for each notion of approximation: for approximate string matching and for computing approximate squares.
Abstract: Here we introduce two new notions of approximate matching with application in computer assisted music analysis. We present algorithms for each notion of approximation: for approximate string matching and for computing approximate squares.

102 citations


Proceedings ArticleDOI
06 Jan 2002
TL;DR: In this article, a significantly subquadratic algorithm for string edit distance matching with nontrivial alignments is presented. But the algorithm requires O(log n log*n) time to compute the edit distance.
Abstract: The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t. A well known dynamic programming algorithm takes time O(nm) to solve this problem, and it is an important open problem in Combinatorial Pattern Matching to significantly improve this bound.We relax the problem so that (a) we allow an additional operation, namely, substring moves, and (b) we approximate the string edit distance upto a factor of O(log n log*n). Our result is a near linear time deterministic algorithm for this version of the problem. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique we call Edit Sensitive Parsing (ESP). This embedding is approximately distance preserving, and we show many applications of this embedding to string proximity problems including nearest neighbors, outliers, and streaming computations with strings.

82 citations


Journal ArticleDOI
TL;DR: The three new algorithms for on‐line multiple string matching allowing errors are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50–100 on English text, depending on the pattern length).
Abstract: We present three new algorithms for on-line multiple string matching allowing errors. These are extensions of previous algorithms that search for a single pattern. The average running time achieved is in all cases linear in the text size for moderate error level, pattern length, and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We theoretically analyze when each algorithm should be used, and show their performance experimentally. The only previous solution for this problem allows only one error. Our algorithms are the first to allow more errors, and are faster than previous work for a moderate number of patterns (e.g. less than 50-100 on English text, depending on the pattern length).

81 citations


Journal ArticleDOI
11 Nov 2002
TL;DR: It is shown experimentally that suffix trees can be effectively used in approximate string matching with biological data and the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications are detailed.
Abstract: Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3p of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.

76 citations


Proceedings ArticleDOI
06 Jan 2002
TL;DR: A randomized algorithm for the string matching with don't cares problem is presented and is simpler and slightly faster than the previous algorithms.
Abstract: We present a randomized algorithm for the string matching with don't cares problem. Based on the simple fingerprint method of Karp and Rabin for ordinary string matching [4], our algorithm runs in time O(n log m) for a text of length n and a pattern of length m and is simpler and slightly faster than the previous algorithms [3, 5, 1].

70 citations



Book ChapterDOI
11 Sep 2002
TL;DR: This paper investigates the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings, and shows an improvement in performance up to 90% with respect to the basic case.
Abstract: Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90% with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pre-test and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem.

Book ChapterDOI
03 Apr 2002
TL;DR: A radically new indexing approach for approximate string matching where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space.
Abstract: We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the R occurrences of a pattern of length m in a text of length n in average time O(mlog2 n+m2+R), using O(n log n) space and O(n log2 n) index construction time. This complexity improves by far over all other previous methods. We also show a simpler scheme needing O(n) space.

Journal Article
TL;DR: This paper considers several new versions of approximate string matching with gaps, the main characteristic of which is the existence of gaps in the matching of a given pattern in a text.
Abstract: In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are devised for each version and their time and space complexities are stated. These specific versions of approximate string matching have various applications in computerized music analysis.

Patent
17 Jun 2002
TL;DR: The authors decompose each string in a database into overlapping "positional q-grams", sequences of a predetermined length q, and contain information regarding the position of each qgram within the string.
Abstract: Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1 st q-gram, 4 th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.

Book ChapterDOI
03 Jul 2002
TL;DR: This paper shows that the faster algorithm of Myers can be adapted to support all the required operations for approximate string matching, and involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match.
Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

Journal Article
TL;DR: In this article, the authors investigated the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings.
Abstract: Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90% with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pre-test and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem.

Journal Article
TL;DR: In this paper, a bit-parallel algorithm for approximate string matching is presented, which can be adapted to support edit distance, search for any pattern suffix, and detect in advance the impossibility of a later match.
Abstract: We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The first one [Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k differences in O(mn/w) time, where w is the width of the computer word. The second one [Navarro and Raffinot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximate searching. The latter technique makes use of an O(kmn/w) time algorithm [Wu and Manber, Comm. ACM, 1992] for its internal workings. This algorithm is slow but flexible enough to support all the required operations. In this paper we show that the faster algorithm of Myers can be adapted to support all those operations. This involves extending it to compute edit distance, to search for any pattern suffix, and to detect in advance the impossibility of a later match. The result is an algorithm that performs better than the original version of Navarro and Raffinot and that is the fastest for several combinations of m, k and alphabet sizes that are useful, for example, in natural language searching and computational biology.

Journal ArticleDOI
TL;DR: In the present study, feature-based matching techniques, in their classical and robust versions, are described, and an automatic method of fuzzy alignment (FA) is introduced that allows automatic matching of two gel images with different numbers of features with unknown correspondence.
Abstract: Automatic alignment (matching) of two-dimensional gel electrophoresis images is of primary interest in the evolving field of proteomics. In the present study, feature-based matching techniques, in their classical and robust versions, are described, and an automatic method of fuzzy alignment (FA) is introduced. This method allows automatic matching of two gel images with different numbers of features with unknown correspondence. Performance of FA is tested on simulated and real data sets.

Journal ArticleDOI
TL;DR: An algorithm to find all maximal approximate palindromes in a string with up to k errors is provided, and it is given that, for a string of size n on a fixed alphabet, runs in O ( k 2 n ) time.

Patent
Noriko Satoh1
23 Aug 2002
TL;DR: In this article, a symbol string detection unit detects the second symbol string matching the first symbol string having a predetermined length n from input character strings, and a matching length detector detects a matching matching length k between the third symbol string and the fourth symbol string following the second string.
Abstract: A symbol string detection unit detects the second symbol string matching the first symbol string having a predetermined length n from input character strings. A matching length detection unit detects a matching length k between the third symbol string following the first symbol string and the fourth symbol string following the second symbol string. A coding unit codes an input symbol string based on the symbol string detected by the symbol string detection unit and the matching length k detected by the matching length detection unit.

Journal ArticleDOI
TL;DR: The vertices of the polygons are suggested as the primitives of the attributed strings so that the benefits of split and merge operations are placed in the dynamic programming algorithm for the edit distance evaluation without an extra computation-cost.

Journal Article
TL;DR: This paper proposes and algorithm with finds the minimum distance t such that P is a t-approximate cover of T, which is an approximate version of covers.
Abstract: Repetitive strings have been studied in such diverse fields as molecular biology data compression etc. Some important regularities that have been studied are perods, covers seeds and squares. A natural extension of the repetition problems is to allow errors. Among the four notions above aproximate squares and approximate periodes have been studied. In this paper, we introduce the notion of approximate covers which is an approximate version of covers. Given two strings P(|P|=m) and T(|T|=n) we propose and algorithm with finds the minimum distance t such that P is a t-approximate cover of T. The algorithm take O(m,n) time for the edit distance and time of finding a string which is an approximate cover of T is minimum distance is NP-complete.

Journal ArticleDOI
TL;DR: It is shown that efficient vector algorithms exist for the problem of approximate string matching with arbitrary weighted distances, and a class of automata for which vector algorithms can be automatically derived from the transition table of the automata is characterized.
Abstract: Vector algorithms allow the computation of an output vector r = r1 r2 ⋯ rm given an input vector e = e1 e2 ⋯ em in a bounded number of operations, independent of m the length of the vectors. The allowable operations are usually restricted to bit-wise operations available in processors, including shifts and binary addition with carry. These restrictions imple that the existence of a vector algorithm for a particular problem opens the way to extremely fast implementations, using the inherent parallelism of bit-wise operations. This paper presents general results on the existence and construction of vertor algorithms, with a particular focus on problems arising from computational biology. We show that efficient vector algorithms exist for the problem of approximate string matching with arbitrary weighted distances, generalizing a previous result by G. Myers. We also characterize a class of automata for which vector algorithms can be automatically derived from the transition table of the automata.

Journal Article
TL;DR: In this article, the authors consider a version of pattern matching useful in processing large musical data, which consists in finding matches which are δ-approximate in the sense of the distance measured as maximum difference between symbols.
Abstract: We consider a version of pattern matching useful in processing large musical data: δ-matching, which consists in finding matches which are δ-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as |a - b|. We also consider (δ, γ)-matching, where γ is a bound on the total sum of the differences. We first consider "occurrence heuristics" by adapting exact string matching algorithms to the two notions of approximate string matching. The resulting algorithms are efficient in practice. Then we consider "substring heuristics". We present δ-matching algorithms fast on the average providing that the pattern is "non-flat" and the alphabet interval is large. The pattern is "flat" if its structure does not vary substantially. The algorithms, named δ- BM1, δ-BM2 and δ-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only "occurrence heuristics" have been considered Our substring heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use δ-versions of suffix tries and subword graphs. Surprisingly, in the context of δ-matching subword graphs appear to be superior compared with compact suffix trees.

Journal ArticleDOI
TL;DR: An algorithm to compute the mean shape, when the shape is represented by a string, is presented as a modification of the well-known string edit algorithm, which converts sets of mapped symbols into piecewise linear functions and compute their mean.

Journal ArticleDOI
TL;DR: A new approach to pattern discovery called string pattern regression is presented, where a data set is given that consists of a string attribute and an objective numerical attribute, and an exact but efficient branch-and-bound algorithm is presented which is applicable to various pattern classes.
Abstract: We present a new approach to pattern discovery called string pattern regression, where we are given a data set that consists of a string attribute and an objective numerical attribute. The problem is to find the best string pattern that divides the data set in such a way that the distribution of the numerical attribute values of the set for which the pattern matches the string attribute, is most distinct, with respect to some appropriate measure, from the distribution of the numerical attribute values of the set for which the pattern does not match the string attribute. By solving this problem, we are able to discover, at the same time, a subset of the data whose objective numerical attributes are significantly different from rest of the data, as well as the splitting rule in the form of a string pattern that is conserved in the subset. Although the problem can be solved in linear time for the substring pattern class, the problem is NP-hard in the general case (i.e. more complex patterns), and we present an exact but efficient branch-and-bound algorithm which is applicable to various pattern classes. We apply our algorithm to intron sequences of human, mouse, fly, and zebrafish, and show the practicality of our approach and algorithm. We also discuss possible extensions of our algorithm, as well as promising applications, such as microarray gene expression data.

Journal Article
TL;DR: This work presents new search algorithms to detect the occurrences of any pattern from a given pattern set in a text, allowing in the occurrences a limited number of spurious text characters among those of the pattern.
Abstract: We present new search algorithms to detect the occurrences of any pattern from a given pattern set in a text, allowing in the occurrences a limited number of spurious text characters among those of the pattern. This is a common requirement in intrusion detection applications. Our algorithms exploit the ability to represent the search state of one or more patterns in the bits of a single machine word and update all the search states in a single operation. We show analytically and experimentally that the algorithms are able of fast searching for large sets of patterns allowing a wide number of spurious characters, yielding in our machine about a 75-fold improvement over the classical dynamic programming algorithm.

Book ChapterDOI
03 Jul 2002
TL;DR: This work describes the first general method for computing the threshold for q-gram filters, based on a carefully chosen precise statement of the problem which is then transformed into a constrained shortest path problem.
Abstract: A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which defines the filter criterium. We describe the first general method for computing the threshold for q-gram filters. The method is based on a carefully chosen precise statement of the problem which is then transformed into a constrained shortest path problem. In its generic form the method leaves certain parts open but is applicable to a large variety of q-gram filters and may be extensible even to other classes of filters. We also give a full algorithm for a specific subclass. For this subclass, the algorithm has been implemented and used succesfully in an experimental comparison.

Book ChapterDOI
15 Aug 2002
TL;DR: An algorithm is presented to determine if there is a member in W within edit distance d of a given query string q of length m, which takes time O(dmd+1) in the RAM model, independent of n, and requires O(dm) additional space.
Abstract: Let W be a dictionary consisting of n binary strings of length m each, represented as a trie. The usual d-query asks if there exists a string in W within Hamming distance d of a given binary query string q. We present an algorithm to determine if there is a member in W within edit distance d of a given query string q of length m. The method takes time O(dmd+1) in the RAM model, independent of n, and requires O(dm) additional space.

Patent
01 Aug 2002
TL;DR: In this article, a method of comparing version strings in a computing environment for use in version-specific computing tasks is presented, where each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks.
Abstract: A method of comparing version strings in a computing environment for use in version-specific computing tasks. In one embodiment, the method divides each of a first and a second version string at each one of a set of predetermined delimiters to produce respective first and second sets of sequentially ordered string chunks. Next, string chunks of the same order from the first and second chunk sets are iteratively compared to determine matching of same-order string chunks, with the comparison continuing until a non-matching same-order string chunk pair is encountered. From the matching/non-matching comparisons, a determination may be made whether a specified quality relationship exists between the first and second version strings, where the quality relationship determines the propriety of a version-specific computing task.