Open AccessBook
Trends in Cleaning Relational Data: Consistency and Deduplication
Ihab F. Ilyas,Xu Chu +1 more
Reads0
Chats0
TLDR
A taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation is proposed, and is concluded by highlighting current trends in "big data" cleaning.Abstract:
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a reportby InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints ICs have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in "big data" cleaning.read more
Citations
More filters
Journal ArticleDOI
Answering the Min-Cost Quality-Aware Query on Multi-Sources in Sensor-Cloud Systems.
TL;DR: This paper studies the problem of min- cost quality-aware query which aims to find high quality results from multi-sources with the minimized cost, and two methods for answering min-cost quality- aware query are proposed.
Proceedings ArticleDOI
Efficient Bidirectional Order Dependency Discovery
Yifeng Jin,Lin Zhu,Zijing Tan +2 more
TL;DR: This paper presents carefully designed data structures, a host of algorithms and optimizations, for efficient order dependency discovery, and proves that its approach significantly outperforms state-of-the-art techniques, by orders of magnitude.
Journal ArticleDOI
Cleaning Data with Forbidden Itemsets
Joeri Rammelaere,Floris Geerts +1 more
TL;DR: This work presents a different type of repairing method, which prevents introducing new constraint violations, according to a discovery algorithm, for a new kind of constraints, called forbidden itemsets (FBIs), capturing unlikely value co-occurrences.
The Effects of Data Quality on Machine Learning Performance
Lukas Budach,Moritz Feuerpfeil,Nina Ihde,Andrea Nathansen,N. Noack,Hendrik Patzlaff,Hazar Harmouch,Felix Naumann +7 more
TL;DR: This work explores empirically the relationship between six of the traditional data quality dimensions and the performance of fifteen widely used machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explain-ing their performance in terms of data quality.
Proceedings Article
Semi-supervised clustering for de-duplication
TL;DR: In this article, the authors consider a restricted version of correlation clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal Article
Binary codes capable of correcting deletions, insertions, and reversals
Active Learning Literature Survey
TL;DR: This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.