Journal ArticleDOI
Estimating the selectivity of approximate string queries
TLDR
The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings, and shows that VSol is effective for large skewed databases of short strings.Abstract:
Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.read more
Citations
More filters
Journal Article
ACM Transactions on Database Systems
Dan Suciu,Gerhard Weikum +1 more
TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Proceedings ArticleDOI
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
TL;DR: This paper proposes an adaptive framework to support similarity join, and proposes a cost model to judiciously select an appropriate prefix for each object to efficiently select prefixes.
Proceedings ArticleDOI
Efficient approximate entity extraction with edit distance constraints
TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.
Proceedings ArticleDOI
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
TL;DR: Two novel approaches are proposed based on discarding gram lists, and one is based on combining correlated lists, which are both orthogonal to existing compression techniques, exploit a unique property of the authors' setting, and offer new opportunities for improving query performance.
Proceedings ArticleDOI
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently
Xiaochun Yang,Bin Wang,Chen Li +2 more
TL;DR: This study proposes a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance and proposes an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries.
References
More filters
Journal ArticleDOI
Data clustering: a review
TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Book
Introduction to Modern Information Retrieval
Gerard Salton,Michael J. McGill +1 more
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Journal ArticleDOI
A guided tour to approximate string matching
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.