scispace - formally typeset
Open AccessBook

Trends in Cleaning Relational Data: Consistency and Deduplication

Reads0
Chats0
TLDR
A taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation is proposed, and is concluded by highlighting current trends in "big data" cleaning.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

HoloClean: holistic data repairs with probabilistic inference

TL;DR: A series of optimizations are introduced which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples, and yields an average F1 improvement of more than 2× against state-of-the-art methods.
Proceedings ArticleDOI

Data Cleaning: Overview and Emerging Challenges

TL;DR: This work presents a taxonomy of the data cleaning literature and discusses recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaned on statistical analysis.
Journal ArticleDOI

Detecting data errors: where are we and what needs to be done?

TL;DR: A holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results is proposed.
Journal ArticleDOI

Automating large-scale data quality verification

TL;DR: This work presents a system for automating the verification of data quality at scale, which meets the requirements of production use cases and provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data.
Proceedings ArticleDOI

Data Integration: After the Teenage Years

TL;DR: The evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's is described and two important challenges for the field going forward are described.
References
More filters

Quantitative Data Cleaning for Large Databases

TL;DR: A statistical view of data quality is taken, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics, and algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically are stressed.
Journal ArticleDOI

Scorpion: explaining away outliers in aggregate queries

TL;DR: This work proposes Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results.
Journal ArticleDOI

Data fusion: resolving data conflicts for integration

TL;DR: Modern data management applications often require integrating available data sources and providing a uniform interface for users to access data from different sources, and such requirements have been driving fruitful research on data integration over the last two decades.
Journal ArticleDOI

Reasoning about record matching rules

TL;DR: A class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations is introduced, defined in terms of similarity metrics and a dynamic semantics, and a mechanism for inferring MDs is proposed, a departure from traditional implication analysis.
Proceedings Article

Data Curation at Scale: The Data Tamer System

TL;DR: Data Tamer is described, an end-to-end curation system built at M.I.T. Brandeis, and Qatar Computing Research Institute, and it has been shown to lower curation cost by about 90%, relative to the currently deployed production software.
Related Papers (5)