Trends in Cleaning Relational Data: Consistency and Deduplication

Open AccessBook

Trends in Cleaning Relational Data: Consistency and Deduplication

Chats0

TLDR

A taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation is proposed, and is concluded by highlighting current trends in "big data" cleaning.

Abstract:

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a reportby InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints ICs have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in "big data" cleaning.

Trends in Cleaning Relational Data: Consistency and Deduplication

Citations

Answering the Min-Cost Quality-Aware Query on Multi-Sources in Sensor-Cloud Systems.

Efficient Bidirectional Order Dependency Discovery

Cleaning Data with Forbidden Itemsets

The Effects of Data Quality on Machine Learning Performance

Semi-supervised clustering for de-duplication

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Binary codes capable of correcting deletions, insertions and reversals

Binary codes capable of correcting deletions, insertions, and reversals

Active Learning Literature Survey

Related Papers (5)

Holistic data cleaning: Putting violations into context

HoloClean: holistic data repairs with probabilistic inference

Foundations of Data Quality Management

NADEEF: a commodity data cleaning system

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing