scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 1999"


01 Jan 1999
TL;DR: It is argued that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets.
Abstract: A new algorithm to search for multiple patterns at the same time is presented. The algorithm is faster than previous algorithms and can support a very large number — tens of thousands — of patterns. Several applications of the multi-pattern matching problem are discussed. We argue that, in addition to previous applications that required such search, multi-pattern matching can be used in lieu of indexed or sorted data in some applications involving small to medium size datasets. Its advantage, of course, is that no additional search structure is needed.

564 citations


Journal ArticleDOI
Gene Myers1
TL;DR: An algorithm of comparable simplicity that requires only O(kn/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the approximate string matching problem, and is found to be more efficient than the previous results for many choices of k and small.
Abstract: The approximate string matching problem is to find all locations at which a query of lengthm matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nm/w) or O(nm log s/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and s is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and smallm.Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx w entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.

483 citations


Journal ArticleDOI
TL;DR: This work introduces a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures that is made more effective by adding extra pointers to speed up search and update operations.
Abstract: We introduce a new text-indexing data structure, the String B-Tree, that can be seen as a link between some traditional external-memory and string-matching data structures. In a short phrase, it is a combination of B-trees and Patricia tries for internal-node indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String B-Tree overcomes the theoretical limitations of inverted files, B-trees, prefix B-trees, suffix arrays, compacted tries and suffix trees. String B-trees have the same worst-case performance as B-trees but they manage unbounded-length strings and perform much more powerful search operations such as the ones supported by suffix trees. String B-trees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.

364 citations


Journal ArticleDOI
TL;DR: The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input and it is shown that the algorithms are among the fastest for typical text searching, being the fastest in some cases.
Abstract: We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k

175 citations


Journal ArticleDOI
TL;DR: It is proved that computing the median string corresponds to a NP-complete decision problems, thus proving that this problem is NP-hard.

137 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A content-based retrieval model for tackling the mismatch problems specific to music data and a distinct function that extracts key melodies for query suggestion is developed, which improves performance over direct search of the music database.
Abstract: A content-based retrieval model for tackling the mismatch problems specific to music data is proposed and implemented. The system uses a pitch profile encoding for queries in any key and an n-note indexing method for approximate matching in sub-linear time. A distinct function that extracts key melodies for query suggestion is developed. The Web-based system provides flexible user interface for query formulation and result browsing. Users can search the system by a short sequence of notes, by uploading a file created by singing, or by clicking suggested key melodies without input. Experiments show that the pitch profile encoding and a 3-note indexing are able to overcome the key mismatch problem and the random errors caused by pitch error, note deletion and insertion. The use of extracted key melodies improves performance over direct search of the music database. For the type of burst mismatch, a query expansion approach is applied.

135 citations


Patent
Richard Theodore Gillam1
01 Sep 1999
TL;DR: Disclosed as mentioned in this paper is a system, method, and program for determining boundaries in a string of characters using a dictionary, wherein the substrings in the dictionary may comprise words and the boundaries follow each of the initial substrings and the at least one substring that includes all the characters following the initial substring.
Abstract: Disclosed is a system, method, and program for determining boundaries in a string of characters using a dictionary, wherein the substrings in the dictionary may comprise words. A determination is made of all possible initial substrings of the string in the dictionary. One initial substring is selected such that all the characters following the initial substring can be divided into at least one substring in the dictionary. The boundaries follow each of the initial substring and the at least one substring that includes all the characters following the initial substring.

131 citations



Patent
29 Dec 1999
TL;DR: In this paper, a highly accurate technique for recognizing spoken digit strings is described, in which a spoken digit string is received and analyzed by a speech recognizer, which generates a list of hypothesized digit strings arranged in ranked order based on a likelihood of matching the spoken string.
Abstract: A highly accurate technique for recognizing spoken digit strings is described. A spoken digit string is received (14) and analyzed by a speech recognizer (18), which generates a list of hypothesized digit strings arranged in ranked order (16) based on a likelihood of matching the spoken digit string (20). The individual hypothesized strings are then analyzed in order beginning with the hypothesized string having the greatest likelihood of matching the spoken string to determine whether they satisfy a given constraint. The first hypothesized string in the list satisfying the constraint is selected as the recognized string (22).

73 citations


Journal ArticleDOI
TL;DR: This paper develops signiicantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems, and compares this run-length encoded string against the ith row or column of each of the character image-models.

71 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: In this paper, a pruned count-suffix tree is used to estimate the selectivity of a sub-string matching query based on all maximal substrings of the query in the tree.
Abstract: With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees as the basic framework for substring selectivity estimation. We present a novel technique to obtain a good estimate for a given substring matching query, called MO (for Maximal Overlap), that estimates the selectivity of a query based on all maximal substrings of the query in the pruned count-suffix tree. We show that MO is provably better than the (independence-based) substring selectivity estimation technique proposed by Krishnan et al. [6], called KVI, under the natural assumption that strings exhibit the so-called “short memory” property. We complement our analysis with an experiment, using a real AT&T data set, that demonstrates that MO is substantially superior to KVI in the quality of the estimate. Finally, we develop and analyze two selectivity estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given pruned count-suffix tree. We show that KVI, MO, MOC and MOLC illustrate an interesting tradeoff between estimation accuracy and computational efficiency. *This work was done when the author was at AT&T Labs-Research, Florham Park, NJ 07932, USA. +This work was done when the author was on sabbatical at AT&T Labs-Research, Florham Park, NJ 07932, USA. Permission to make digital or hard copies ol’all or part ol‘this work for personal or classroom use is granted without fee provided that topics are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the lirst page. To copy otherwise, to republish, lo post on servers or to redistribute to lists. requires prior specific permission andior a fee. PODS ‘W Philadelphia PA Copyright ACM 1999 1-58 113-062-7/99/05...$5.00

Proceedings ArticleDOI
07 Jun 1999
TL;DR: An approach for content based music data retrieval where thematic feature strings are extracted from the original music objects and treated as the meta data to represent their contents and a new approximate string matching algorithm is proposed which provides fault tolerance ability according to the music characteristics.
Abstract: An approach for content based music data retrieval is proposed. In this approach, thematic feature strings, such as melody strings, rhythm strings, and chord strings are extracted from the original music objects and treated as the meta data to represent their contents. The problem of content based music data retrieval is then transformed into the string matching problem. A new approximate string matching algorithm is also proposed which provides fault tolerance ability according to the music characteristics. To show the efficiency of the algorithm, a set of experiments are performed to compare with the agrep and the fgrep utility on both synthetic and real music data.

Patent
Jason Zien1
19 Jan 1999
TL;DR: In this paper, a method, system, and article of manufacture for generating a list of candidate objects for a requested object is presented, wherein an identifier for the desired object is accepted, wherein the identifier comprises a target string.
Abstract: A method, system, and article of manufacture for generating a list of candidate objects for a requested object. An identifier for the requested object is accepted, wherein the identifier comprises a target string. A list of candidate objects is generated when the requested object cannot be found by performing a hierarchical string match for the target string against a set of source strings using multi-path dynamic programming, wherein the set of source strings represent a set of objects from which the list of candidate objects is generated.

Patent
22 Jul 1999
TL;DR: In this article, a data processing system has a searching mechanism for finding occurrences of a plurality of key strings within a target string, which forms a hash value from each of the key strings, and adds each key string to a collection of key string having the same hash value.
Abstract: A data processing system has a searching mechanism for finding occurrences of a plurality of key strings within a target string. The searching mechanism forms a hash value from each of the key strings, and adds each key string to a collection of key strings having the same hash value. It then selects a plurality of symbol positions in the target string, and forms a hash value at each selected symbol position in the target string. This hash value is used to select one of the collections of key strings. Each key string in the selected collection of key strings is then compared with the target string.

Patent
Feng Yang1
14 Jun 1999
TL;DR: In this article, an internal dictionary and/or an external dictionary are used to provide translations of command strings from one language to the specific language of the application program, and the test script may then be translated at run time using the dictionaries to allow the testing program to test the application programs in accordance with the language of application program.
Abstract: The present invention is a system and method for testing various language versions of an application program using a single test script. An internal dictionary and/or an external dictionary are used to provide translations of command strings from one language to the specific language of the application program. The test script may then be translated at run time using the dictionaries to allow the testing program to test the application program in accordance with the language of the application program. Fuzzy match logic may be used to provide appropriate language translation of the command string. The internal dictionary may be automatically updated at run time so that it may learn language translations of unknown command strings for future runs.

Journal ArticleDOI
TL;DR: This work improves the fastest known algorithm for approximate string matching by using a new method to verify potential matches and a new optimization technique for biased texts (such as English).

Proceedings ArticleDOI
29 Mar 1999
TL;DR: The techniques used can be used in design of efficient algorithms for a wide range of the most typical string problems, in the compressed LZW setting, including: computing a period of a word, finding repetitions, symmetries, counting subwords, and multi-pattern matching.
Abstract: Given two strings: pattern P and text T of lengths |P|=M and |T|=N, a string matching problem is to find all occurrences of pattern P in text T. A fully compressed string matching problem is the string matching problem with input strings P and T given in compressed forms p and t respectively, where |p|=m and |t|=n. We present first, almost-optimal, string matching algorithms for LZW-compressed strings running in: (1) O((n+m)log(n+m)) time on a single processor machine; and (2) O/sup /spl tilde//(n+m) work on a (n+m)-processor PRAM. The techniques used can be used in design of efficient algorithms for a wide range of the most typical string problems, in the compressed LZW setting, including: computing a period of a word, finding repetitions, symmetries, counting subwords, and multi-pattern matching.

Proceedings ArticleDOI
Daniel P. Lopresti1
20 Sep 1999
TL;DR: This paper introduces a framework for clarifying and formalizing the duplicate document detection problem and presents four distinct models, each with a corresponding algorithm for its solution derived from the realm of approximate string matching.
Abstract: This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm for its solution derived from the realm of approximate string matching. The robustness of these techniques is demonstrated through a set of experiments using data reflecting real-world degradation effects.

Book ChapterDOI
22 Jul 1999
TL;DR: A new indexing method based on a suffix tree combined with a partitioning of the pattern that outperforms by far all other algorithms for indexed approximate searching, and it is shown how this index can be implemented using much less space.
Abstract: We present a new indexing method for the approximate string matching problem. The method is based on a suffix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(nλ), for 0 < λ < 1, whenever α < 1-e/√σ, where α is the error level tolerated and σ is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space.

Proceedings ArticleDOI
21 Sep 1999
TL;DR: An algorithm which attempts to align pairs of subsequences from a database of genetic sequences by simulating the classical dynamic programming alignment algorithm over a suffix array of the database is presented.
Abstract: We present an algorithm which attempts to align pairs of subsequences from a database of genetic sequences. The algorithm simulates the classical dynamic programming alignment algorithm over a suffix array of the database. We provide a detailed average case analysis which shows that the running time of the algorithm is subquadratic with respect to the database size. A similar algorithm solves the approximate string matching problem in sublinear average time.

01 Jan 1999
TL;DR: In this paper, the authors introduce two new notions of approximate matching with application in computer assisted music analysis, and present algorithms for each notion of approximation: for approximate string matching and for computing approximate squares.
Abstract: Here we introduce two new notions of approximate matching with application in computer assisted music analysis. We present algorithms for each notion of approximation: for approximate string matching and for computing approximate squares.

Proceedings ArticleDOI
01 May 1999
TL;DR: It is shown that the multi-method dispatching problem can be transformed to a geometric problem on multi-dimensional integer grids, for which a data structure is developed that uses near-linear space and has log-logarithmic query time.
Abstract: 1 Introduction Current object oriented programming languages (OOPLs) rely on mono-method dispatching. Recent research has identified multi-methods as a new, powerful feature to be added to OOPLs, and several experimental OOPLs now have multi-methods. Their ultimate success and impact in practice depends, among other things, on whether multi-method dispatching can be supported efficiently. We show that the multi-method dispatching problem can be transformed to a geometric problem on multi-dimensional integer grids, for which we then develop a data structure that uses near-linear space and has log-logarithmic query time. This gives a solution whose performance almost matches that of the best known algorithm for mono-method dispatching. In this paper we study problems from two different areas: the multi-method dispatching problem for object-oriented (00) languages and two string matching problems. It turns out that these problems are surprisingly similar: we prove that they can all be reduced to the same problem on multi-dimensional integer grids-see below for a description of this geometric problem. We present an efficient data structure for this problem, which allows various trade-offs between space and query time. This leads to significantly improved solutions to the multi-method dispatching problem and the string matching problems. In the rest of this introduction, as well as in the remainder of the paper, we lirst focus on the multi-method dispatching problem and then turn our attention to the string matching problems. Our geometric data structure has other applications as well, namely in two string matching problems: matching multiple rectangular patterns against a rectangular query text, and approximate dictionary matching with edit distance at most one. Our results for the former, long-standing open problem are substantially improved, near-linear time bounds. For the latter problem, which has applications in checking password security and the design of filtering tools, we obtain a near-linear solution as well. The multi-method dispatching problem Object-oriented languages. The 00.paradigm is becoming the norm for software development; languages such as Java, C++, and Smalltalk that embody some of the basic tenets of the 00.paradigm are highly popular. Recent research has identified new, powerful features that can enhance the current 00.technology, and the focus now is on understanding the implication of adding these features-their power and their cost in terms of the additional complexity. One such feature is the concept of multi-methods found in the new generation 00-languages such as CommonLoops [BK+86], CLOS [BD+88], Poly-Glot [AQl], Kea [MHHSl], Cecil [ChSZ] and Dylan [Ap94]. …

Proceedings ArticleDOI
20 Sep 1999
TL;DR: The proposed method converts a two-dimensional image into a one-dimensional string and computes the edit distance by the modified approximate string matching algorithm and presents the details of applications in handwriting analysis and both online and offline character recognition.
Abstract: Given two character images, we would like to measure their similarity or difference. Such a similarity or difference measure facilitates the solution to character recognition and handwriting analysis problems. There is, however, no universal definition for similarity measure satisfying a wide range of characteristics such as the slant, deformation or other invariant constraints. For this reason, we propose a new definition for the character similarity measure. First, the proposed method converts a two-dimensional image into a one-dimensional string. Next, it computes the edit distance by the modified approximate string matching algorithm. We describe how to extract the string information and compute the distance and then present the details of applications in handwriting analysis and both online and offline character recognition.

Patent
Jeremy S. De Bonet1
13 Jul 1999
TL;DR: In this paper, an approximate string matching scheme was proposed for lossless data compression employing an entropy-based compression technique, where the residual data represents the difference between each value of an earlier occurring block of source data, whose location and length is identified by a pointer, and an equal-sized block of the source data associated with the pointer.
Abstract: A system and process for lossless data compression employing a unique approximate string matching scheme. The encoder of the system characterizes source data as a set of pointers and associated blocks of residual data. Each pointer identifies a location earlier in the source data, as well as the number of source data values associated with the identified location. The residual data represents the difference between each value of an earlier occurring block of source data, whose location and length is identified by a pointer, and an equal-sized block of source data associated with the pointer. The choice of a block of earlier occurring source data for use in forming a residual data block is based on a cost analysis which is designed to minimize the entropy of the differences between the previous block and the new block of source data to a desired degree. The encoded data, which will exhibit a significantly lower entropy, can be compressed effectively using an entropy-based compression technique. The decoder portion of the system operates by initially decompressing the encoded data. Next, the first data value is decoded by adding the first residual to a predetermined constant. Once the first data value has been decoded, subsequent data values are decoded by first finding the block in the previously decoded data indicated by a pointer, and then adding each data value in the block to its corresponding data element in the residual data block associated with the pointer. The process is repeated until all the data is decoded.

Journal ArticleDOI
TL;DR: Experimental results indicate that the hostage string matching approach significantly improves the recognition rates compared to the one-stage string matching method.
Abstract: A two-stage string matching method for the recognition of two-dimensional (2-D) objects is proposed in this work. The first stage is a global cyclic string matching. The second stage is a local matching with local dissimilarity measure computing. The dissimilarity measure function of the input shape and the reference shape are obtained by combining the global matching cost and the local dissimilarity measure. The proposed method has the advantage that there is no need to set any parameter in the recognition process. Experimental results indicate that the hostage string matching approach significantly improves the recognition rates compared to the one-stage string matching method.

Journal ArticleDOI
TL;DR: In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed and it is demonstrated that the optimal savings in this case is given approximately by nlogn/I/sub l/(Q,Q,2D).
Abstract: The shortest common superstring problem and its extension to approximate matching are considered in the probability model where each string in a given set has the same length and letters of strings are drawn independently from a finite set. In the exact matching case, several algorithms proposed in the literature are shown to be asymptotically optimal in the sense that the ratio of the savings resulting from the superstring constructed by each of these algorithms, that is the difference between the total length of the strings in the given set and the length of the superstring, to the optimal savings from the shortest superstring approaches in probability to 1 as the number of strings in the given set increases. In the approximate matching case, a modified version of the shortest common approximate matching superstring problem is analyzed; it is demonstrated that the optimal savings in this case is given approximately by nlogn/I/sub l/(Q,Q,2D), where n is the number of strings in the given set, Q is the probability distribution governing the selection of letters of strings, I/sub l/(Q,Q,2D) is the lower mutual information between Q and Q with respect to 2D, and D/spl ges/0 is the distortion allowed in approximate matching. In addition, an approximation algorithm is proposed and proved asymptotically optimal.

Journal ArticleDOI
TL;DR: This paper includes the swapoperation that interchanges two adjacent characters into the set of allowable edit operations, and presents anO(tmin(m,n))-time algorithm for the extended edit distance problem, where tmin represents the edit distance between the given strings, and n represents the extendedk-differences problem.

Proceedings ArticleDOI
21 Sep 1999
TL;DR: This work introduces a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain.
Abstract: Approximate string matching is an important paradigm in domains ranging from speech recognition to information retrieval and molecular biology. We introduce a new formalism for a class of applications that takes two strings as input, each specified in terms of a particular domain, and performs a comparison motivated by constraints derived from a third, possibly different domain. This issue arises, for example, when searching multimedia databases built using imperfect recognition technologies (e.g., speech, optical character, and handwriting recognition). We present a polynomial time algorithm for solving the problem, and describe several variations that can also be solved efficiently.

Book ChapterDOI
22 Jul 1999
TL;DR: These are the first search algorithms for the problem of approximate string matching in d dimensions, and the first sublinear-time (on average) searching algorithm is presented, which is O(knd/md-1) for k < (m/(d(logσ m- logσ d)))d-1, where σ is the alphabet size.
Abstract: We address the problem of approximate string matching in d dimensions, that is, to find a pattern of size md in a text of size nd with at most k < md errors (substitutions, insertions and deletions along any dimension). We use a novel and very flexible error model, for which there exists only an algorithm to evaluate the similarity between two elements in two dimensions at O(m4) time. We extend the algorithm to d dimensions, at O(d!m2d) time and O(d!m2d-1) space. We also give the first search algorithm for such model, which is O(d!mdnd) time and O(d!mdnd-1) space. We show how to reduce the space cost to O(d!3dm2d-1) with little time penalty. Finally, we present the first sublinear-time (on average) searching algorithm (i.e. not all text cells are inspected), which is O(knd/md-1) for k < (m/(d(logσ m- logσ d)))d-1, where σ is the alphabet size. After that error level the filter still remains better than dynamic programming for k ≤ md-1/(d(logσ m - logσ d))(d-1)/d. These are the first search algorithms for the problem. As side-effects we extend to d dimensions an already proposed algorithm for two-dimensional exact string matching, and we obtain a sublinear-time filter to search in d dimensions allowing k mismatches.

Book ChapterDOI
22 Jul 1999
TL;DR: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis and it is shown that the third problem is NP-complete.
Abstract: The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis. Here we study different forms of approximate periodicity under a variety of distance rules.We consider three related problems, for two of which we derive polynomial-time algorithms; we then show that the third problem is NP-complete.