Estimating the selectivity of approximate string queries

doi:10.1145/1242524.1242529

Journal ArticleDOI

Estimating the selectivity of approximate string queries

Arturas Mazeika, +3 more

- 01 Jun 2007 -

ACM Transactions on Database Systems

- Vol. 32, Iss: 2, pp 12

TLDR

The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings, and shows that VSol is effective for large skewed databases of short strings.

Abstract:

Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

Estimating the selectivity of approximate string queries

Citations

ACM Transactions on Database Systems

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

Efficient approximate entity extraction with edit distance constraints

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

References

Data clustering: a review

Introduction to Modern Information Retrieval

Algorithms for clustering data

Algorithms for clustering data

A guided tour to approximate string matching

Related Papers (5)

A Primitive Operator for Similarity Joins in Data Cleaning

Efficient exact set-similarity joins

A guided tour to approximate string matching

Approximate String Joins in a Database (Almost) for Free

Efficient set joins on similarity predicates