Open AccessProceedings Article
A fast decision tree learning algorithm
Jiang Su,Harry Zhang +1 more
- pp 500-505
TLDR
The proposed algorithm is a core tree-growing algorithm that can be combined with other scaling-up techniques to achieve further speedup and is as fast as naive Bayes but outperforms naive Baye in accuracy according to experiments.Abstract:
There is growing interest in scaling up the widely-used decision-tree learning algorithms to very large data sets. Although numerous diverse techniques have been proposed, a fast tree-growing algorithm without substantial decrease in accuracy and substantial increase in space complexity is essential. In this paper, we present a novel, fast decision-tree learning algorithm that is based on a conditional independence assumption. The new algorithm has a time complexity of O(m ċ n), where m is the size of the training data and n is the number of attributes. This is a significant asymptotic improvement over the time complexity O(m ċ n2) of the standard decision-tree learning algorithm C4.5, with an additional space increase of only O(n). Experiments show that our algorithm performs competitively with C4.5 in accuracy on a large number of UCI benchmark data sets, and performs even better and significantly faster than C4.5 on a large number of text classification data sets. The time complexity of our algorithm is as low as naive Bayes'. Indeed, it is as fast as naive Bayes but outperforms naive Bayes in accuracy according to our experiments. Our algorithm is a core tree-growing algorithm that can be combined with other scaling-up techniques to achieve further speedup.read more
Citations
More filters
Journal Article
Inductive learning algorithms and representations for text categorization
TL;DR: Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks.
Proceedings ArticleDOI
Stochastic gradient boosted distributed decision trees
TL;DR: Two different distributed methods that generates exact stochastic GBDT models are presented, the first is a MapReduce implementation and the second utilizes MPI on the Hadoop grid environment.
Journal ArticleDOI
Multi-target regression via input space expansion: treating targets as inputs
TL;DR: In this article, two new methods for multi-target regression, called stacked single-target and ensemble of regressor chains, were introduced by adapting two popular multi-label classification methods of this family.
Journal ArticleDOI
A Correlation-Based Feature Weighting Filter for Naive Bayes
TL;DR: This paper argues that for NB highly predictive features should be highly correlated with the class, yet uncorrelated with other features (minimum mutual redundancy), and proposes a correlation-based feature weighting (CFW) filter for NB.
Proceedings ArticleDOI
A comparative study of Reduced Error Pruning method in decision tree algorithms
TL;DR: An experiment was conducted using Weka application to compare the performance in term of complexity of tree structure and accuracy of classification for J 48, REPTree, PART, JRip, and Ridor algorithms using seven standard datasets from UCI machine learning repository.
References
More filters
Book
C4.5: Programs for Machine Learning
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book
Data Mining: Practical Machine Learning Tools and Techniques
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI
Data mining: practical machine learning tools and techniques with Java implementations
Ian H. Witten,Eibe Frank +1 more
TL;DR: This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.
Journal ArticleDOI
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
TL;DR: The Bayesian classifier is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption, and will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
Journal ArticleDOI
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets
TL;DR: On most datasets studied, the best of very simple rules that classify examples on the basis of a single attribute is as accurate as the rules induced by the majority of machine learning systems.