Open AccessJournal Article
Data Cleaning: Problems and Current Approaches.
Erhard Rahm,Hong Hai Do +1 more
Reads0
Chats0
TLDR
This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.Abstract:
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.read more
Citations
More filters
Book
Data Mining: Concepts and Techniques
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Journal ArticleDOI
The rise of big data on cloud computing
Ibrahim Abaker Targio Hashem,Ibrar Yaqoob,Nor Badrul Anuar,Salimah Binti Mokhtar,Abdullah Gani,Samee U. Khan +5 more
TL;DR: The definition, characteristics, and classification of big data along with some discussions on cloud computing are introduced, and research challenges are investigated, with focus on scalability, availability, data integrity, data transformation, data quality, data heterogeneity, privacy, legal and regulatory issues, and governance.
Journal ArticleDOI
Data fusion
Jens Bleiholder,Felix Naumann +1 more
TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Book ChapterDOI
COMA: a system for flexible combination of schema matching approaches
Hong-Hai Do,Erhard Rahm +1 more
TL;DR: This work develops the COMA schema matching system as a platform to combine multiple matchers in a flexible way and uses COMA as a framework to comprehensively evaluate the effectiveness of different matchers and their combinations for real-world schemas.
Book
Data Mining: The Textbook
TL;DR: This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues.
References
More filters
Journal Article
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Sergey Brin,Lawrence Page +1 more
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Proceedings Article
Fast Algorithms for Mining Association Rules in Large Databases
Journal ArticleDOI
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI
Term Weighting Approaches in Automatic Text Retrieval
Gerard Salton,Chris Buckley +1 more
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.