Open AccessJournal Article
Using q-grams in a DBMS for Approximate String Processing.
Luis Gravano,Panagiotis G. Ipeirotis,H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Lauri Pietarinen,Divesh Srivastava +6 more
TLDR
This paper develops a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them by relying on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS.Abstract:
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.read more
Citations
More filters
Journal ArticleDOI
Duplicate Record Detection: A Survey
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Proceedings Article
Approximate String Joins in a Database (Almost) for Free
Luis Gravano,Panagiotis G. Ipeirotis,H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Divesh Srivastava +5 more
TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.
Approximate String Joins in a Database (Almost) for Free -- Erratum
Luis Gravano,Hosagrahar Visvesvaraya Jagadish,Panagiotis G. Ipeirotis,Divesh Srivastava,Nick Koudas,S. Muthukrishnan +5 more
TL;DR: This paper develops a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them, and demonstrates experimentally the benefits of the technique over the direct use of UDFs.
Proceedings ArticleDOI
Entity Resolution with Markov Logic
Parag Singla,Pedro Domingos +1 more
TL;DR: A well-founded, integrated solution to the entity resolution problem based on Markov logic, which combines first-order logic and probabilistic graphical models by attaching weights to first- order formulas, and viewing them as templates for features of Markov networks.
Proceedings ArticleDOI
Substructure similarity search in graph databases
TL;DR: This paper investigates the issues of substructure similarity search using indexed features in graph databases, and develops a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features.
References
More filters
Journal ArticleDOI
A guided tour to approximate string matching
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Journal ArticleDOI
Approximate string-matching with q -grams and maximal matches
TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Proceedings Article
Approximate String Joins in a Database (Almost) for Free
Luis Gravano,Panagiotis G. Ipeirotis,H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Divesh Srivastava +5 more
TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.
Journal ArticleDOI
A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words
Book ChapterDOI
On Using q-Gram Locations in Approximate String Matching
Erkki Sutinen,Jorma Tarhio +1 more
TL;DR: A sublinear filtration algorithm is presented based on the locations of the q-grams in the pattern, which gives better filTration efficiency than an earlier method.