scispace - formally typeset
Open AccessProceedings Article

A fast decision tree learning algorithm

TLDR
The proposed algorithm is a core tree-growing algorithm that can be combined with other scaling-up techniques to achieve further speedup and is as fast as naive Bayes but outperforms naive Baye in accuracy according to experiments.
Abstract
There is growing interest in scaling up the widely-used decision-tree learning algorithms to very large data sets. Although numerous diverse techniques have been proposed, a fast tree-growing algorithm without substantial decrease in accuracy and substantial increase in space complexity is essential. In this paper, we present a novel, fast decision-tree learning algorithm that is based on a conditional independence assumption. The new algorithm has a time complexity of O(m ċ n), where m is the size of the training data and n is the number of attributes. This is a significant asymptotic improvement over the time complexity O(m ċ n2) of the standard decision-tree learning algorithm C4.5, with an additional space increase of only O(n). Experiments show that our algorithm performs competitively with C4.5 in accuracy on a large number of UCI benchmark data sets, and performs even better and significantly faster than C4.5 on a large number of text classification data sets. The time complexity of our algorithm is as low as naive Bayes'. Indeed, it is as fast as naive Bayes but outperforms naive Bayes in accuracy according to our experiments. Our algorithm is a core tree-growing algorithm that can be combined with other scaling-up techniques to achieve further speedup.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal Article

Inductive learning algorithms and representations for text categorization

TL;DR: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.
Proceedings ArticleDOI

Stochastic gradient boosted distributed decision trees

TL;DR: Two different distributed methods that generates exact stochastic GBDT models are presented, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Journal ArticleDOI

Multi-target regression via input space expansion: treating targets as inputs

TL;DR: In this article, two new methods for multi-target regression, called stacked single-target and ensemble of regressor chains, were introduced by adapting two popular multi-label classification methods of this family.
Journal ArticleDOI

A Correlation-Based Feature Weighting Filter for Naive Bayes

TL;DR: This paper argues that for NB highly predictive features should be highly correlated with the class, yet uncorrelated with other features (minimum mutual redundancy), and proposes a correlation-based feature weighting (CFW) filter for NB.
Proceedings ArticleDOI

A comparative study of Reduced Error Pruning method in decision tree algorithms

TL;DR: An experiment was conducted using Weka application to compare the performance in term of complexity of tree structure and accuracy of classification for J 48, REPTree, PART, JRip, and Ridor algorithms using seven standard datasets from UCI machine learning repository.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI

Data mining: practical machine learning tools and techniques with Java implementations

TL;DR: This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.
Journal ArticleDOI

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

TL;DR: The Bayesian classifier is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption, and will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
Journal ArticleDOI

Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

TL;DR: On most datasets studied, the best of very simple rules that classify examples on the basis of a single attribute is as accurate as the rules induced by the majority of machine learning systems.