scispace - formally typeset
Open AccessProceedings Article

Robust decision trees: removing outliers from databases

George H. John
- pp 174-179
TLDR
This paper examines C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy, and extends the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.
Abstract
Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by addressing the problem more directly. In this paper we examine C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy. C4.5 incorporates a pruning scheme that partially addresses the outlier removal problem. In our ROBUST-C4.5 algorithm we extend the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI

A Survey of Outlier Detection Methodologies

TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.
Journal ArticleDOI

Classification in the Presence of Label Noise: A Survey

TL;DR: In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Journal ArticleDOI

Identifying mislabeled training data

TL;DR: This paper uses a set of learning algorithms to create classifiers that serve as noise filters for the training data and suggests that for situations in which there is a paucity of data, consensus filters are preferred, whereas majority vote filters are preferable for situations with an abundance of data.
Journal ArticleDOI

Class noise vs. attribute noise: a quantitative study of their impacts

TL;DR: A systematic evaluation on the effect of noise in machine learning separates noise into two categories: class noise and attribute noise, and investigates the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Classification and regression trees

Leo Breiman
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Journal ArticleDOI

Generalized Additive Models.

Book

Robust Regression and Outlier Detection

TL;DR: This paper presents the results of a two-year study of the statistical treatment of outliers in the context of one-Dimensional Location and its applications to discrete-time reinforcement learning.