scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2001"


Journal ArticleDOI
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Abstract: We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

2,723 citations


Proceedings Article
11 Sep 2001
TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.
Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length , called -grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and -gram length, we also describe detailed experiments based on a prototype implementation.

556 citations


Patent
Barry Lynn Fritchman1
02 May 2001
TL;DR: In this paper, a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards, is described, which includes the steps of preprocessing the pattern string into a prefix, a suffix, and zero or more interior segments.
Abstract: The method of the present invention is useful in a computer system including at least one client. The program executes a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards. The method includes the steps of preprocessing the pattern string into a prefix segment, a suffix segment, and zero or more interior segments. Next, matching the prefix segment, the suffix segment, and the interior segment(s) with the target string.

217 citations



01 Jan 2001
TL;DR: It is shown that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q- grams and a filter parameter called threshold have to be optimized.
Abstract: A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

159 citations


Journal ArticleDOI
TL;DR: Nrgrep is a new pattern‐matching tool designed for efficient search of complex patterns based on a single and uniform concept: the bit‐parallel simulation of a non‐deterministic suffix automaton that can find from simple patterns to regular expressions, exactly or allowing errors in the matches.
Abstract: We present nrgrep (‘non-deterministic reverse grep’), a new pattern-matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a non-deterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept that is fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string-matching tools for the simplest patterns, and is by far unmatched for more complex patterns. Copyright © 2001 John Wiley & Sons, Ltd.

134 citations


Journal Article
TL;DR: This paper develops a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them by relying on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS.
Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.

117 citations


Book ChapterDOI
01 Jul 2001
TL;DR: In this paper, the authors show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering for approximate string matching than contiguous substrings, and they also show that the arrangement of the gaps in the qgram and a filter parameter called threshold have to be optimized.
Abstract: A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

108 citations


Journal ArticleDOI
TL;DR: A spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources is described, based on static and dynamic device mappings, approximate string matching, and n-gram analysis.
Abstract: In this paper, we describe a spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well.

97 citations


Patent
22 Jan 2001
TL;DR: In this paper, a character string of which a start point is each address of character string data in an input buffer is rearranged in a predetermined order, so that a rank list is generated.
Abstract: A character string of which a start point is each address of character string data in an input buffer is rearranged in the predetermined order, so that a rank list is generated. Next, the location of the matching candidate of a character string to be encoded is obtained on the basis of the rank list. Then, the character string to be encoded is compared with a matching candidate, thereby obtaining a matching length. Further, a code is generated using the location of the matching candidate and the matching length, and the code is output as compression data.

72 citations


Book ChapterDOI
19 Dec 2001
TL;DR: It is shown how to solve CLOSEST STRING in linear time for constant d (the exponential growth is O(d d) ), and this result is extended to the closely related problems d-MISMATCH and DISTINGUISHING STRING SELECTION.
Abstract: CLOSEST STRING is one of the core problems in the field of consensus word analysis with particular importance for computational biology Given k strings of same length and a positive integer d, find a "closest string" s such that none of the given strings has Hamming distance greater than d from s Closest String is NP-complete We show how to solve CLOSEST STRING in linear time for constant d (the exponential growth is O(d d We extend this result to the closely related problems d-MISMATCH and DISTINGUISHING STRING SELECTION Moreover, we discuss fixed parameter tractability for parameter k and give an efficient linear time algorithm for CLOSEST STRING when k = 3 Finally, the practical usefulness of our findings is substantiated by some experimental results

Patent
Yaniv Shapira1
21 Feb 2001
TL;DR: In this article, the strings to be searched for are divided into a plurality of two and three character substrings and stored in substring tables, and a hash of each substring is calculated and stored into a hash table whose output is an index to a substring table, and the string is declared found if all the substrings making up the string have been received in correct consecutive order.
Abstract: An apparatus for and method of simultaneously searching an input character stream for the presence of multiple strings. The strings to be searched for are determined a priori, processed and stored in substring tables during a configuration phase. The strings to be searched for are divided into a plurality of two and three character substrings and stored in substring tables. A hash of each substring is calculated and stored in a hash table whose output is an index to a substring table. During searching, the content filter generates the hash of the input character stream and attempts to find a matching substring stored in the hash table. A string is declared found if all the substrings making up the string have been received in correct consecutive order.

Proceedings ArticleDOI
27 Mar 2001
TL;DR: This work presents a different approach to the approximate string matching problem, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas, thus becoming the first practical solution to the problem.
Abstract: Approximate string matching on compressed text was an open problem for almost a decade The two existing solutions are very new Despite that they represent important complexity breakthroughs, in most practical cases they are not useful, in the sense that they are slower than uncompressing the text and then searching the uncompressed text We present a different approach, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas We show experimentally that this solution is 10-30 times faster than previous work and up to three times faster than the trivial approach of uncompressing and searching, thus becoming the first practical solution to the problem

Journal ArticleDOI
TL;DR: A randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a patternstring of length M, i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position.
Abstract: We give a randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a pattern string of length M , i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position. A direct application is approximate string matching. The randomized algorithm uses convolution to find an estimator of the scores; the variance of the estimator is particularly small for scores that are close to M , i.e., for approximate occurrences of the pattern in the text. No assumption is made about the probabilistic characteristics of the input, or about the size of the alphabet. The solution extends to string matching with classes, class complements, ``never match'' and ``always match'' symbols, to the weighted case and to higher dimensions.

Journal ArticleDOI
01 Oct 2001
TL;DR: Experimental results have shown that the proposed ideas of ACM, MM, fuzzy map integration, and fuzzy map matching are well suited for students with high performances and difficult subject materials.
Abstract: A concept map, typically depicted as a connected graph, is composed of a collection of propositions. Each proposition forming a semantic unit consists of a small set of concept nodes interconnected to one another with relation links. Concept maps possess a number of appealing features which make them a promising tool for teaching, learning, evaluation, and curriculum planning. We extend concept maps by associating their concept nodes and relation links with attribute values which indicate the relative significance of concepts and relationships in knowledge representation. The resulting maps are called attributed concept maps (ACM). Assessing students will be conducted by matching their ACMs with those prebuilt by experts. The associated techniques are referred to as map matching techniques. The building of an expert ACM has in the past been done by only one specialist. We integrate a number of maps developed by separate experts into a single map, called the master map (MM), which will serve as a prototypical map in map matching. Both map integration and map matching are conceptualized in terms of fuzzy set discipline. Experimental results have shown that the proposed ideas of ACM, MM, fuzzy map integration, and fuzzy map matching are well suited for students with high performances and difficult subject materials.

Journal ArticleDOI
Gad M. Landau, Michal Ziv-Ukelson1
TL;DR: This paper describes an algorithm which is composed of an encoding stage and an alignment stage, and shows how to reduce the O(n?) alignment work, for each appearance of the common substring Y in a source string, to O-at the cost of O( n?) encoding work, which is executed only once.

Patent
26 Jul 2001
TL;DR: In this article, a pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string.
Abstract: A system and method for examining a string of symbols and identifying portions of the string which match a predetermined pattern using adaptively weighted, partitioned context edit distances. A pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string. One or more candidate matches having the lowest edit distances are selected as matches for the pattern. The weighting of each of the component matches may be adapted to optimize the pattern matching and, in one embodiment, the context components may be heavily weighted to obtain matches of a value for which the corresponding pattern is not well defined. In one embodiment, an edit distance matrix is evaluated for each of a prefix component, a value component and a suffix component of a pattern. The evaluation of the prefix matrix provides a basis for identifying indicators of the beginning of a value window, while the evaluation of the suffix matrix provides a basis for identifying the alignment of the end of the value window. The value within the value window can then be evaluated via the value matrix to determine a corresponding value match score.

Journal ArticleDOI
TL;DR: Different forms of approximate periodicity under a variety of distance functions are studied, for two of which polynomial-time algorithms are derived and the third problem is NP-complete.

Book ChapterDOI
01 Jul 2001
TL;DR: A new notion of weak factor recognition that is the foundation of new data structures and on-line string matching algorithms, and a new automaton built on a string p = p1p2 ... pm that acts like an oracle on the set of factors pi ... pj.
Abstract: We introduce a new notion of weak factor recognition that is the foundation of new data structures and on-line string matching algorithms. We define a new automaton built on a string p = p1p2 ... pm that acts like an oracle on the set of factors pi ... pj. If a string is recognized by this automaton, it may be a factor of p. But, if it is rejected, it is surely not a factor. We call it factor oracle. More precisely, this automaton is acyclic, recognizes at least the factors of p, has m+ 1 states and a linear number of transitions. We give a very simple sequential construction algorithm to build it. Using this automaton, we design an efficient experimental on-line string matching algorithm (we conjecture its optimality in regard to the experimental results) that is really simple to implement. We also extend the factor oracle to predict that a string could be a suffix (i.e. in the set pi ... pm) of p. We obtain the suffix oracle, that enables in some cases a tricky improvement of the previous string matching algorithm.

Patent
30 Jul 2001
TL;DR: In this article, a system and method for improving string matching in a noisy channel environment is described. The system identifies candidates within the textual file that may match the query string and analyzes the probability that the string candidate matches a user-defined string.
Abstract: Described is a system and method for improving string matching in a noisy channel environment. The invention provides a method for identifying string candidates and analyzing the probability that the string candidate matches a user-defined string. In one implementation, a find engine receives a query string, converts an image file into a textual file, and identifies each instance of the query string in the textual file. The find engine identifies candidates within the textual file that may match the query string. The find engine refers to a confusion table to help identify whether candidates that are near matches to the query string are actually matches to the query string but for a common recognition error. Candidates meeting a probability threshold are identified as matches to the query string. The invention further provides for analysis options including word heuristics, language models, and OCR confidences.

Journal ArticleDOI
TL;DR: This work shows an excellent example of a complex and theoretical analysis of algorithms used for design and for practical algorithm engineering, instead of the common practice of first designing an algorithm and then analyzing it.
Abstract: We study a recent algorithm for fast on-line approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automaton, which is simulated using bit-parallelism. A number of techniques to extend this kernel for longer patterns are presented in that work. However, the techniques can be integrated in many ways and the optimal interplay among them is by no means obvious. The solution to this problem starts at a very low level, by obtaining basic probabilistic information about the problem which was not previously known, and ends integrating analytical results with empirical data to obtain the optimal heuristic. The conclusions obtained via analysis are experimentally confirmed. We also improve many of the techniques and obtain a combined heuristic which is faster than the original work. This work shows an excellent example of a complex and theoretical analysis of algorithms used for design and for practical algorithm engineering, instead of the common practice of first designing an algorithm and then analyzing it.

Journal ArticleDOI
TL;DR: The compressed suffix array is used, which compactly stores the suffix array at the cost of theoretically a small slowdown in access speed, and an approximate string matching algorithm is proposed which is suitable for the compressed suffix arrays.
Abstract: Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compressed form. We use the compressed suffix array. It compactly stores the suffix array at the cost of theoretically a small slowdown in access speed. We experimentally show that the overhead of using the compressed suffix array is reasonable in practice. We also propose an approximate string matching algorithm which is suitable for the compressed suffix array. Furthermore, we have constructed the compressed suffix array of the whole human genome. Because its size is about 2G bytes, a workstation can handle the search index for the whole data in main memory, which will accelerate the speed of solving various problems in genome informatics.

Patent
Yoav Ossia1
21 May 2001
TL;DR: In this paper, a computer implemented method and system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings calculates preliminary compression results for every string relative to an initial reference string, and uses the preliminary compression result to find a better reference string without additional compression tests.
Abstract: A computer implemented method and system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings calculates preliminary compression results for every string relative to an initial reference string, and uses the preliminary compression results to find a better reference string without additional compression tests. According to one embodiment, a histogram is calculated showing the number of occurrences of each compressed length for each string in the set plotted against the initial reference string and the better reference string has a length corresponding to an average compression length or center of gravity of the histogram.

Patent
10 Oct 2001
TL;DR: In this paper, a method and device for string matching HTTP headers is presented, which typically includes identifying a predefined string, identifying an unknown string to compare with the predefined strings, performing a bitwise exclusive OR operation on an ASCII binary representation of at least one segment of the unknown string, and identifying a case-insensitive string match based on the exclusive operation.
Abstract: A method and device for string matching HTTP headers. The method typically includes identifying a predefined string, identifying an unknown string to compare with the predefined string, performing a bitwise exclusive OR operation on an ASCII binary representation of at least one segment of the unknown string and an ASCII binary representation of at least one segment of the predefined string, and identifying a case-insensitive string match based on the exclusive OR operation. The method may further include performing a bitwise operation with a predefined flag to determine the case-insensitive segment match.

Patent
05 Dec 2001
TL;DR: In this article, an iterative search technique is used to quickly and accurately locate information in a database, such as one storing information about digital versatile discs (DVDs), where a presumably unique search key is generated for an unidentified DVD and compared with corresponding keys in the database.
Abstract: An iterative search technique is used to quickly and accurately locate information in a database, such as one storing information about digital versatile discs (DVDs). First, a presumably unique search key is generated for an unidentified DVD and compared with corresponding keys in a database. If no match is found progressively less specific information is used to generate a series of search keys that are similarly compared with corresponding keys in the database. If at least one possibly matching record is found, it is determined whether the best matching record can be considered a match, otherwise, less specific information is used to search for a match until predefined least specific information is used.

Patent
19 Jan 2001
TL;DR: In this paper, a method for manipulation, storage, modeling, visualization, and quantification of datasets which correspond to target strings is described, which is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function.
Abstract: There is described a method for manipulation, storage, modeling, visualization, and quantification of datasets, which correspond to target strings. An iterative algorithm is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function. Preferably these points are located in the complex plane, such as in and/or near the Mandelbrot Set or a Julian Set. The comparison string is scored by evaluating a function having the comparison string and one of the plurality of target strings as inputs. The evaluation may be repeated for a number of the other target strings. The score or some other property corresponding to the comparison string is used to determine the target string's placement on a map. The points are analyzed and/or compared by examining, either visually or mathematically, their relative locations, their absolute locations within the region, and/or metrics other than location.

Proceedings Article
01 Jan 2001
TL;DR: These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis and are presented as fast, efficient and practical algorithms for these two notion of approximate string matching.
Abstract: Here we consider computational problems on δ-approximate and (δ, γ)-approximate string matching. These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis. We present fast, efficient and practical algorithms for these two notions of approximate string matching

Proceedings ArticleDOI
01 May 2001
TL;DR: A technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points is presented and can be practically realized using a combination of string B-trees and R-tree.
Abstract: As databases have expanded in scope to storing string data (XML documents, product catalogs), it has become increasingly important to search databases based on matching substrings, often on multiple, correlated dimensions. While string B-trees are I/O optimal in one dimension, no index structure with non-trivial query bounds is known for two-dimensional substring indexing.In this paper, we present a technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points. We develop an I/O efficient algorithm for solving the common colors problem, and use it to obtain an I/O efficient (poly-logarithmic query time) algorithm for the two-dimensional substring indexing problem. Our techniques result in a family of secondary memory index structures that trade space for time, with no loss of accuracy. We show how our technique can be practically realized using a combination of string B-trees and R-trees.

Book ChapterDOI
TL;DR: This paper presents an elegant and veryeasy to implement bit-vector algorithm for answering the following incremental version of the approximate string matching problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B with equal efficiency?
Abstract: The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this paper we consider the following incremental version of the problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B, and the answer for A and Bc with equal efficiency, where b and c are additional symbols? Here we present an elegant and veryeasy to implement bit-vector algorithm for answering these questions that requires only O(n⌈m/w⌉) time, where n is the length of A, m is the length of B and w is the number of bits in a machine word. We also present an O(nm⌈h/w⌉) algorithm for the fixed-length approximate string matching problem: given a text t, a pattern p and an integer h, compute the optimal alignment of all substrings of p of length h and a substring of t.

Patent
19 Dec 2001
TL;DR: In this article, a look-up table is used to address a chained array or list of previously matching character strings, and the array is updated if there is another matching character string found when compressing the input string.
Abstract: A system and method for compressing and decompressing data in real time begins by taking a character string from an input string (12), generating a hash value (28) of the character string (16) which is utilized in a look up table (18) to address a chained array or list (20) of previously matching character strings. The array is updated (34) if there is another matching character string found when compressing the input string. A token generator (36) writes a code (102, 103, 105) to the output string (14) indicating whether or not that there has been a match. The token generator (36) generates an indication of the length of the character string not compressed, the one or more characters string not compressed, the length of a matching character string, and the number of characters processed since the last match. These values generated by the token generator are optimally represented based upon preselected criteria.