scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2002"


Journal ArticleDOI
TL;DR: Simulation studies show that the performance of the combining techniques is strongly affected by the small sample size properties of the base classifier: boosting is useful for large training sample sizes, while bagging and the random subspace method are useful for criticalTraining sample sizes.
Abstract: Recently bagging, boosting and the random subspace method have become popular combining techniques for improving weak classifiers. These techniques are designed for, and usually applied to, decision trees. In this paper, in contrast to a common opinion, we demonstrate that they may also be useful in linear discriminant analysis. Simulation studies, carried out for several artificial and real data sets, show that the performance of the combining techniques is strongly affected by the small sample size properties of the base classifier: boosting is useful for large training sample sizes, while bagging and the random subspace method are useful for critical training sample sizes. Finally, a table describing the possible usefulness of the combining techniques for linear classifiers is presented.

449 citations


Journal ArticleDOI
Tin Kam Ho1
TL;DR: There are strong correlations between the classifier accuracies and measures of length of class boundaries, thickness of the class manifolds, and nonlinearities of decision boundaries and the bootstrapping method is better when the classes are compact and the boundaries are smooth.
Abstract: Using a number of measures for characterising the complexity of classification problems, we studied the comparative advantages of two methods for constructing decision forests – bootstrapping and random subspaces We investigated a collection of 392 two-class problems from the UCI depository, and observed that there are strong correlations between the classifier accuracies and measures of length of class boundaries, thickness of the class manifolds, and nonlinearities of decision boundaries We found characteristics of both difficult and easy cases where combination methods are no better than single classifiers Also, we observed that the bootstrapping method is better when the training samples are sparse, and the subspace method is better when the classes are compact and the boundaries are smooth

198 citations


Book ChapterDOI
19 Aug 2002
TL;DR: In this paper, a cost-sensitive decision-making approach is proposed to solve the problem of minimizing the actual cost of the decisions rather than improving the overall quality of the probability estimates, where the decision making step is based on the distribution of individual scores computed by classifiers that are built by different types of ensembles of decision trees.
Abstract: For a variety of applications, machine learning algorithms are required to construct models that minimize the total loss associated with the decisions, rather than the number of errors. One of the most efficient approaches to building models that are sensitive to non-uniform costs of errors is to first estimate the class probabilities of the unseen instances and then to make the decision based on both the computed probabilities and the loss function. Although all classification algorithms can be converted into algorithms for learning models that compute class probabilities, in many cases the computed estimates have proven to be inaccurate. As a result, there is a large research effort to improve the accuracy of the estimates computed by different algorithms. This paper presents a novel approach to cost-sensitive learning that addresses the problem of minimizing the actual cost of the decisions rather than improving the overall quality of the probability estimates. The decision making step for our methods is based on the distribution of the individual scores computed by classifiers that are built by different types of ensembles of decision trees. The new approach relies on statistics that measure the probability that the computed estimates are on one side or the other of the decision boundary, rather than trying to improve the quality of the estimates. The experimental analysis of the new algorithms that were developed based on our approach gives new insight into cost-sensitive decision making and shows that for some tasks, the new algorithms outperform some of the best probability-based algorithms for cost-sensitive learning.

66 citations


Journal ArticleDOI
TL;DR: A combination scheme labelled ‘Bagfs’, in which new learning sets are generated on the basis of both bootstrap replicates and random subspaces, shows that on average, Bagfs exhibits the best agreement between prediction and supervision.
Abstract: Several ways of manipulating a training set have shown that weakened classifier combination can improve prediction accuracy. In the present paper, we focus on learning set sampling (Breiman's Bagging) and random feature subset selections (Ho's Random Subspaces). We present a combination scheme labelled 'Bagfs', in which new learning sets are generated on the basis of both bootstrap replicates and random subspaces. The performances of the three methods (Bagging, Random Subspaces and Bagfs) are compared to the standard Adaboost algorithm. All four methods are assessed by means of a decision-tree inducer (C4.5). In addition, we also study whether the number and the way in which they are created has a significant influence on the performance of their combination. To answer these two questions, we undertook the application of the McNemar test of significance and the Kappa degree-of-agreement. The results, obtained on 23 conventional databases, show that on average, Bagfs exhibits the best agreement between prediction and supervision.

13 citations


Book ChapterDOI
Svante Janson1
01 Jan 2002
TL;DR: In this paper, the analysis of an algorithm by Koda and Ruskey for listing ideals in a forest poset leads to a study of random binary trees and their limits as infinite Random Binary Trees.
Abstract: The analysis of an algorithm by Koda and Ruskey for listing ideals in a forest poset leads to a study of random binary trees and their limits as infinite random binary trees. The corresponding finite and infinite random forests are studied too. The infinite random binary trees and forests studied here have exactly one infinite path; they can be defined using suitable size-biazed GaltonWatson processs. Limit theorems are proved using a version of the contraction method.

8 citations


01 Jan 2002
TL;DR: An unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap form of re-sampling from data, supported by a theoretical framework and reduces the risk of model overfitting.
Abstract: The problem of modeling binary responses by using cross sectional data has found a number of satisfying solutions extending throughout both parametric and nonparametric methods. Examples are traditional classification models like logistic regression, discriminant analysis, classification trees or procedures at the forefront as neural networks or combinations of classifiers (bagging, boosting, random forests). These models are based on the implicit assumption that the distribution of the responses is well balanced over the sample. However, there exist many real situations where it is a priori known that one of the two responses (usually the most interesting for the analysis) is rare. This class imbalance occurs in several domains as for example finance (detection of defaulter credit applicants), epidemiology (diagnosis of rare diseases), social sciences (analysis of anomalous behaviors), computer sciences (identification of some features of interest in image data). The class imbalance heavily aects both the model estimation and the evaluation of its accuracy. Classification methods are in fact conceived to estimate the model that best fits the data according to some criterion of global accuracy. When data are unbalanced the model tends to focus on the prevalent class and ignore the rare events (Japkowicz, and Stephen, 2002). Moreover, when evaluating the quality of the classification, the same measures of global accuracy may lead to misleading results or even if alternative error measures are used, the scarcity of data conducts to high bias and variance estimates of the error rate, especially for the rare class. In this work an unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap form of re-sampling from data. The proposed technique includes some of the existing solutions as a special case, it is supported by a theoretical framework and reduces the risk of model overfitting.

5 citations


01 Sep 2002
TL;DR: Two methods of constructing ensembles of classifiers are proposed, one of which directly injects randomness into classification tree algorithms by choosing a split randomly at each node with probabilities proportional to the measure of goodness for a split and the other perturbs the output and constructs a classifier using the perturbed data.
Abstract: : We propose two methods of constructing ensembles of classifiers, One method directly injects randomness into classification tree algorithms by choosing a split randomly at each node with probabilities proportional to the measure of goodness for a split We combine this method with a stopping rule which uses permutation of the outputs The other method perturbs the output and constructs a classifier using the perturbed data, In both methods, the final classifier is given by an unweighted vote of the individual classifiers, These methods are compared with bagging, Adaboost, and random forests on thirteen commonly used data sets, The results show that our methods perform better than bagging, and comparably to Adaboost and random forests on average, Additional computation shows that our perturbation method could improve its performance by perturbing both the inputs and with the outputs, and combining a sufficiently large number of trees, Plots of strength and correlation show an interesting relationship, We also explore combining sampling subsets of the training set with our proposed methods, The results of a few trials show that the performance of our proposed methods could be improved by combining sampling subsets of the training set,

1 citations