Open AccessDOI
Approximate String Joins in a Database (Almost) for Free -- Erratum
Luis Gravano,Hosagrahar Visvesvaraya Jagadish,Panagiotis G. Ipeirotis,Divesh Srivastava,Nick Koudas,S. Muthukrishnan +5 more
TLDR
This paper develops a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them, and demonstrates experimentally the benefits of the technique over the direct use of UDFs.About:
The article was published on 2003-01-01 and is currently open access. It has received 543 citations till now. The article focuses on the topics: String (computer science) & Joins.read more
Citations
More filters
Journal ArticleDOI
Duplicate Record Detection: A Survey
Elmagarmid,Ipeirotis,Verykios +2 more
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Journal ArticleDOI
Duplicate Record Detection: A Survey
TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Proceedings ArticleDOI
Robust and fast similarity search for moving object trajectories
TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.
Proceedings ArticleDOI
Interactive deduplication using active learning
TL;DR: This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
Book
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection
TL;DR: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database as mentioned in this paper.
References
More filters
Journal ArticleDOI
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI
A guided tour to approximate string matching
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Journal ArticleDOI
Approximate string-matching with q -grams and maximal matches
TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Proceedings Article
Near Neighbor Search in Large Metric Spaces
TL;DR: A data structure to solve the problem of finding approximate matches in a large database called a GNAT { Geometric Near-neighbor Access Tree} is introduced based on the philosophy that the data structure should act as a hierarchical geometrical model of the data as opposed to a simple decomposition of theData that does not use its intrinsic geometry.
Proceedings Article
Approximate String Joins in a Database (Almost) for Free
Luis Gravano,Panagiotis G. Ipeirotis,H. V. Jagadish,Nick Koudas,S. Muthukrishnan,Divesh Srivastava +5 more
TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.