Topic
Approximate string matching
About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.
Papers published on a yearly basis
Papers
More filters
•
29 Dec 1998TL;DR: In this paper, a character string is automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings and that ends at a probable string ending.
Abstract: Selected character strings are automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings. The automatic search includes a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending. Each iteration also finds a starting point for the next iteration that is a probable string beginning. The selected strings can be words and multiple word expressions, in which case probable string endings and beginnings are word boundaries. A finite state lexicon, such as a finite state transducer or a finite state automation, can be used to determine whether character strings match the list of selected strings. A tokenizing automation can be used to find starting points.
50 citations
••
16 May 2000TL;DR: This paper describes a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data, and instantiates its generic techniques by adapting the 2-dimensional R-tree to string data.
Abstract: As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string dataIn this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages These basic ideas affect all index algorithms In this paper, we present efficient algorithms for different types of string matchingWhile our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally
50 citations
•
28 Oct 1994TL;DR: In this paper, an improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computational intensive lower bound estimate.
Abstract: An improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computationally intensive lower bound estimate. The lower bound estimate of the string edit distance between the two strings is calculated by equalising the lengths of the two strings by adding padding elements to the shorter one. The elements of the strings are then sorted and the substitution costs between corresponding elements are summed.
50 citations
•
AT&T1
TL;DR: The authors decompose each string in a database into overlapping "positional q-grams", sequences of a predetermined length q, and contain information regarding the position of each qgram within the string.
Abstract: Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1 st q-gram, 4 th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.
49 citations
•
TL;DR: A textual problem for exponentially long strings is reduced here to simple arithmetics on integers with (only) linearly many bits, which allows to represent some sets of exponentially many positions in terms of feasibly many arithmetic progressions.
Abstract: We consider strings which are succinctly described. The description is in terms of straight-line programs in which the constants are symbols and the only operation is the concatenation. Such descriptions correspond to the systems of recurrences or to context-free grammars generating single words. The descriptive size of a string is the length n of a straight-line program (or size of a grammar) which defines this string. Usually the strings of descriptive size n are of exponential length. Fibonacci and Thue-Morse words are examples of such strings. We show that for a pattern P and text T of descriptive sizes m, n, an occurrence of P in T can be found (if there is any) in time polynomial with respect to n. This is nontrivial, since the actual lengths of P and T could be exponential, and none of the known string-matching algorithms is directly applicable. Our first tool is the periodicity lemma, which allows to represent some sets of exponentially many positions in terms of feasibly many arithmetic progressions. The second tool is arithmetics: a simple application of Euclid algorithm. Hence a textual problem for exponentially long strings is reduced here to simple arithmetics on integers with (only) linearly many bits. We present also an NP-complete version of the pattern-matching for shortly described strings.
49 citations