scispace - formally typeset
Open AccessDOI

Approximate String Joins in a Database (Almost) for Free -- Erratum

TLDR
This paper develops a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them, and demonstrates experimentally the benefits of the technique over the direct use of UDFs.
About
The article was published on 2003-01-01 and is currently open access. It has received 543 citations till now. The article focuses on the topics: String (computer science) & Joins.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Proceedings ArticleDOI

Robust and fast similarity search for moving object trajectories

TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.
Proceedings ArticleDOI

Interactive deduplication using active learning

TL;DR: This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
Book

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

TL;DR: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database as mentioned in this paper.
References
More filters
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Journal ArticleDOI

Approximate string-matching with q -grams and maximal matches

TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Proceedings Article

Near Neighbor Search in Large Metric Spaces

Sergey Brin
TL;DR: A data structure to solve the problem of finding approximate matches in a large database called a GNAT { Geometric Near-neighbor Access Tree} is introduced based on the philosophy that the data structure should act as a hierarchical geometrical model of the data as opposed to a simple decomposition of theData that does not use its intrinsic geometry.
Proceedings Article

Approximate String Joins in a Database (Almost) for Free

TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.
Related Papers (5)