Open AccessProceedings Article
Robust decision trees: removing outliers from databases
George H. John
- pp 174-179
TLDR
This paper examines C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy, and extends the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.Abstract:
Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by addressing the problem more directly. In this paper we examine C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy. C4.5 incorporates a pruning scheme that partially addresses the outlier removal problem. In our ROBUST-C4.5 algorithm we extend the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.read more
Citations
More filters
Book
Data Mining: Practical Machine Learning Tools and Techniques
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI
A Survey of Outlier Detection Methodologies
Victoria J. Hodge,Jim Austin +1 more
TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.
Journal ArticleDOI
Classification in the Presence of Label Noise: A Survey
Benoît Frénay,Michel Verleysen +1 more
TL;DR: In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Journal ArticleDOI
Identifying mislabeled training data
Carla E. Brodley,Mark A. Friedl +1 more
TL;DR: This paper uses a set of learning algorithms to create classifiers that serve as noise filters for the training data and suggests that for situations in which there is a paucity of data, consensus filters are preferred, whereas majority vote filters are preferable for situations with an abundance of data.
Journal ArticleDOI
Class noise vs. attribute noise: a quantitative study of their impacts
Xingquan Zhu,Xindong Wu +1 more
TL;DR: A systematic evaluation on the effect of noise in machine learning separates noise into two categories: class noise and attribute noise, and investigates the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise.
References
More filters
Journal ArticleDOI
Classification and Regression Trees.
Book
C4.5: Programs for Machine Learning
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book
Classification and regression trees
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Book
Robust Regression and Outlier Detection
TL;DR: This paper presents the results of a two-year study of the statistical treatment of outliers in the context of one-Dimensional Location and its applications to discrete-time reinforcement learning.