Robust and efficient fuzzy match for online data cleaning

doi:10.1145/872757.872796

Proceedings ArticleDOI

Robust and efficient fuzzy match for online data cleaning

- pp 313-324

TLDR

A new similarity function is proposed which overcomes limitations of commonly used similarity functions, and an efficient fuzzy match algorithm is developed which can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation.

Abstract:

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Duplicate Record Detection: A Survey

Elmagarmid, +2 more

- 01 Jan 2007 -

IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

Journal ArticleDOI

Duplicate Record Detection: A Survey

Ahmed K. Elmagarmid, +2 more

- 01 Jan 2007 -

IEEE Transactions on Knowledge and Data ...

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

...read moreread less

Journal ArticleDOI

Collective entity resolution in relational data

Indrajit Bhattacharya, +1 more

- 01 Mar 2007 -

ACM Transactions on Knowledge Discovery ...

TL;DR: In this article, a relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities is proposed, which improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively.

...read moreread less

Proceedings ArticleDOI

A Primitive Operator for Similarity Joins in Data Cleaning

Surajit Chaudhuri, +2 more

TL;DR: This paper proposes a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity.

...read moreread less

Journal ArticleDOI

Information Extraction

Sunita Sarawagi

TL;DR: A taxonomy of the field is created along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced to survey techniques for optimizing the various steps in an information extraction pipeline.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981 -

Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

Book

Modern Information Retrieval

Ricardo Baeza-Yates, +1 more

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.

...read moreread less

Proceedings ArticleDOI

Approximate nearest neighbors: towards removing the curse of dimensionality

Piotr Indyk, +1 more

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.

...read moreread less

Book

Randomized Algorithms

Rajeev Motwani, +1 more

TL;DR: This book introduces the basic concepts in the design and analysis of randomized algorithms and presents basic tools such as probability theory and probabilistic analysis that are frequently used in algorithmic applications.

...read moreread less

Proceedings ArticleDOI

On the resemblance and containment of documents

Andrei Z. Broder

- 11 Jun 1997 -

Sequence

TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.

...read moreread less

Robust and efficient fuzzy match for online data cleaning

Citations

Duplicate Record Detection: A Survey

Duplicate Record Detection: A Survey

Collective entity resolution in relational data

A Primitive Operator for Similarity Joins in Data Cleaning

Information Extraction

References

Identification of common molecular subsequences.

Modern Information Retrieval

Approximate nearest neighbors: towards removing the curse of dimensionality

Randomized Algorithms

On the resemblance and containment of documents

Related Papers (5)

A Theory for Record Linkage

The merge/purge problem for large databases

Adaptive duplicate detection using learnable string similarity measures

Interactive deduplication using active learning

Efficient clustering of high-dimensional data sets with application to reference matching