scispace - formally typeset
Journal ArticleDOI

Estimating the selectivity of approximate string queries

TLDR
The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings, and shows that VSol is effective for large skewed databases of short strings.
Abstract
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

read more

Citations
More filters
Journal Article

ACM Transactions on Database Systems

TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Proceedings ArticleDOI

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

TL;DR: This paper proposes an adaptive framework to support similarity join, and proposes a cost model to judiciously select an appropriate prefix for each object to efficiently select prefixes.
Proceedings ArticleDOI

Efficient approximate entity extraction with edit distance constraints

TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.
Proceedings ArticleDOI

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

TL;DR: Two novel approaches are proposed based on discarding gram lists, and one is based on combining correlated lists, which are both orthogonal to existing compression techniques, exploit a unique property of the authors' setting, and offer new opportunities for improving query performance.
Proceedings ArticleDOI

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

TL;DR: This study proposes a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance and proposes an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries.
References
More filters
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Book

Introduction to Modern Information Retrieval

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.