Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Patent•

Finding selected character strings in text and providing information relating to the selected character strings

[...]

Jean-Pierre Chanod¹•Institutions (1)

Xerox¹

29 Dec 1998

TL;DR: In this paper, a character string is automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings and that ends at a probable string ending.

...read moreread less

Abstract: Selected character strings are automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings. The automatic search includes a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending. Each iteration also finds a starting point for the next iteration that is a probable string beginning. The selected strings can be words and multiple word expressions, in which case probable string endings and beginnings are word boundaries. A finite state lexicon, such as a finite state transducer or a finite state automation, can be used to determine whether character strings match the list of selected strings. A tokenizing automation can be used to find starting points.

...read moreread less

50 citations

Journal Article•DOI•

On effective multi-dimensional indexing for strings

[...]

H. V. Jagadish¹, Nick Koudas², Divesh Srivastava²•Institutions (2)

University of Michigan¹, AT&T Labs²

16 May 2000

TL;DR: This paper describes a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data, and instantiates its generic techniques by adapting the 2-dimensional R-tree to string data.

...read moreread less

Abstract: As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string dataIn this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages These basic ideas affect all index algorithms In this paper, we present efficient algorithms for different types of string matchingWhile our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally

...read moreread less

50 citations

Patent•

Method for performing string matching

[...]

Richard Hull¹•Institutions (1)

Hewlett-Packard¹

28 Oct 1994

TL;DR: In this paper, an improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computational intensive lower bound estimate.

...read moreread less

Abstract: An improved method of matching a query string against a plurality of candidate strings replaces a highly computationally intensive string edit distance calculation with a less computationally intensive lower bound estimate. The lower bound estimate of the string edit distance between the two strings is calculated by equalising the lengths of the two strings by adding padding elements to the shorter one. The elements of the strings are then sorted and the substitution costs between corresponding elements are summed.

...read moreread less

50 citations

Patent•

Method of performing approximate substring indexing

[...]

H. V. Jagadish¹, Nikolaos Koudas¹, S. Muthukrishnan¹, Divesh Srivastava¹•Institutions (1)

AT&T¹

17 Jun 2002

TL;DR: The authors decompose each string in a database into overlapping "positional q-grams", sequences of a predetermined length q, and contain information regarding the position of each qgram within the string.

...read moreread less

Abstract: Approximate substring indexing is accomplished by decomposing each string in a database into overlapping “positional q-grams”, sequences of a predetermined length q, and containing information regarding the “position” of each q-gram within the string (i.e., 1 st q-gram, 4 th q-gram, etc.). An index is then formed of the tuples of the positional q-gram data (such as, for example, a B-tree index or a hash index). Each query applied to the database is similarly parsed into a plurality of positional q-grams (of the same length), and a candidate set of matches is found. Position-directed filtering is used to remove the candidates which have the q-grams in the wrong order and/or too far apart to form a “verified” output of matching candidates. If errors are permitted (defined in terms of an edit distance between each candidate and the query), an edit distance calculation can then be performed to produce the final set of matching strings.

...read moreread less

49 citations

Journal Article•

Pattern-Matching for Strings with Short Descriptions

[...]

Marek Karpinski, Wojciech Rytter, Ayumi Shinohara

01 Jan 1995-Electronic Colloquium on Computational Complexity

TL;DR: A textual problem for exponentially long strings is reduced here to simple arithmetics on integers with (only) linearly many bits, which allows to represent some sets of exponentially many positions in terms of feasibly many arithmetic progressions.

...read moreread less

Abstract: We consider strings which are succinctly described. The description is in terms of straight-line programs in which the constants are symbols and the only operation is the concatenation. Such descriptions correspond to the systems of recurrences or to context-free grammars generating single words. The descriptive size of a string is the length n of a straight-line program (or size of a grammar) which defines this string. Usually the strings of descriptive size n are of exponential length. Fibonacci and Thue-Morse words are examples of such strings. We show that for a pattern P and text T of descriptive sizes m, n, an occurrence of P in T can be found (if there is any) in time polynomial with respect to n. This is nontrivial, since the actual lengths of P and T could be exponential, and none of the known string-matching algorithms is directly applicable. Our first tool is the periodicity lemma, which allows to represent some sets of exponentially many positions in terms of feasibly many arithmetic progressions. The second tool is arithmetics: a simple application of Euclid algorithm. Hence a textual problem for exponentially long strings is reduced here to simple arithmetics on integers with (only) linearly many bits. We present also an NP-complete version of the pattern-matching for shortly described strings.

...read moreread less

49 citations

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics