Using q-grams in a DBMS for Approximate String Processing.

Open AccessJournal Article

Using q-grams in a DBMS for Approximate String Processing.

Luis Gravano, +6 more

- 01 Jan 2001 -

IEEE Data(base) Engineering Bulletin

- Vol. 24, pp 28-34

TLDR

This paper develops a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them by relying on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS.

Abstract:

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.

Using q-grams in a DBMS for Approximate String Processing.

Citations

Duplicate Record Detection: A Survey

Approximate String Joins in a Database (Almost) for Free

Approximate String Joins in a Database (Almost) for Free -- Erratum

Entity Resolution with Markov Logic

Substructure similarity search in graph databases

References

A guided tour to approximate string matching

Approximate string-matching with q -grams and maximal matches

Approximate String Joins in a Database (Almost) for Free

A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words

On Using q-Gram Locations in Approximate String Matching

Related Papers (5)

A guided tour to approximate string matching

A Theory for Record Linkage

Binary codes capable of correcting deletions, insertions, and reversals

Identification of common molecular subsequences.

Duplicate Record Detection: A Survey