Open AccessBook
Trends in Cleaning Relational Data: Consistency and Deduplication
Ihab F. Ilyas,Xu Chu +1 more
Reads0
Chats0
TLDR
A taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation is proposed, and is concluded by highlighting current trends in "big data" cleaning.Abstract:
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a reportby InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints ICs have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in "big data" cleaning.read more
Citations
More filters
Journal ArticleDOI
HoloClean: holistic data repairs with probabilistic inference
TL;DR: A series of optimizations are introduced which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples, and yields an average F1 improvement of more than 2× against state-of-the-art methods.
Proceedings ArticleDOI
Data Cleaning: Overview and Emerging Challenges
TL;DR: This work presents a taxonomy of the data cleaning literature and discusses recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaned on statistical analysis.
Journal ArticleDOI
Detecting data errors: where are we and what needs to be done?
Ziawasch Abedjan,Xu Chu,Dong Deng,Raul Fernandez,Ihab F. Ilyas,Mourad Ouzzani,Paolo Papotti,Michael Stonebraker,Nan Tang +8 more
TL;DR: A holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results is proposed.
Journal ArticleDOI
Automating large-scale data quality verification
Sebastian Schelter,Dustin Lange,Philipp Schmidt,Meltem Celikel,Felix Biessmann,Andreas Grafberger +5 more
TL;DR: This work presents a system for automating the verification of data quality at scale, which meets the requirements of production use cases and provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data.
Proceedings ArticleDOI
Data Integration: After the Teenage Years
TL;DR: The evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's is described and two important challenges for the field going forward are described.
References
More filters
Quantitative Data Cleaning for Large Databases
TL;DR: A statistical view of data quality is taken, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics, and algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically are stressed.
Journal ArticleDOI
Scorpion: explaining away outliers in aggregate queries
Eugene Wu,Samuel Madden +1 more
TL;DR: This work proposes Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results.
Journal ArticleDOI
Data fusion: resolving data conflicts for integration
Xin Luna Dong,Felix Naumann +1 more
TL;DR: Modern data management applications often require integrating available data sources and providing a uniform interface for users to access data from different sources, and such requirements have been driving fruitful research on data integration over the last two decades.
Journal ArticleDOI
Reasoning about record matching rules
TL;DR: A class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations is introduced, defined in terms of similarity metrics and a dynamic semantics, and a mechanism for inferring MDs is proposed, a departure from traditional implication analysis.
Proceedings Article
Data Curation at Scale: The Data Tamer System
Michael Stonebraker,Daniel Meir Bruckner,Ihab F. Ilyas,George Beskales,Mitch Cherniack,Stanley B. Zdonik,Alexander Richter Pagan,Shan Xu +7 more
TL;DR: Data Tamer is described, an end-to-end curation system built at M.I.T. Brandeis, and Qatar Computing Research Institute, and it has been shown to lower curation cost by about 90%, relative to the currently deployed production software.