scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2004"


Journal ArticleDOI
TL;DR: In this paper, the authors present two approaches for obtaining class probabilities, which can be reduced to linear systems and are easy to implement, and show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998).
Abstract: Pairwise coupling is a popular multi-class classification method that combines all comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998)

1,888 citations


Proceedings Article
14 Apr 2004
TL;DR: In this article, the authors revisited the formulation of random forests and investigated prediction performance on real-world and simulated datasets for which maximally sized trees do overfit, and revealed that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed.
Abstract: Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ‘random forests’. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about the scope of the repository. With this as motivation, and coupled with experience from boosting methods, we revisit the formulation of random forests and investigate prediction performance on real-world and simulated datasets for which maximally sized trees do overfit. These explorations reveal that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed. Nonetheless, even in these settings, good performance for random forests can be attained by using larger (than default) primary tuning parameter values.

462 citations


Proceedings ArticleDOI
02 Nov 2004
TL;DR: This paper presents a novel methodology for predicting fault prone modules, based on random forests, an extension of decision tree learning that generates hundreds or even thousands of trees using subsets of the training data.
Abstract: Accurate prediction of fault prone modules (a module is equivalent to a C function or a C+ + method) in software development process enables effective detection and identification of defects. Such prediction models are especially beneficial for large-scale systems, where verification experts need to focus their attention and resources to problem areas in the system under development. This paper presents a novel methodology for predicting fault prone modules, based on random forests. Random forests are an extension of decision tree learning. Instead of generating one decision tree, this methodology generates hundreds or even thousands of trees using subsets of the training data. Classification decision is obtained by voting. We applied random forests in five case studies based on NASA data sets. The prediction accuracy of the proposed methodology is generally higher than that achieved by logistic regression, discriminant analysis and the algorithms in two machine learning software packages, WEKA [I. H. Witten et al. (1999)] and See5. The difference in the performance of the proposed methodology over other methods is statistically significant. Further, the classification accuracy of random forests is more significant over other methods in larger data sets.

272 citations


Book ChapterDOI
09 Jun 2004
TL;DR: The performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine.
Abstract: Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.

244 citations


Journal Article
TL;DR: Using several attribute evaluation measures instead of just one gives promising results and replacement of ordinary voting with voting weighted with margin achieved on most similar instances gives improvements which are statistically highly significant over several data sets.
Abstract: Random forests are one of the most successful ensemble methods which exhibits performance on the level of boosting and support vector machines. The method is fast, robust to noise, does not overfit and offers possibilities for explanation and visualization of its output. We investigate some possibilities to increase strength or decrease correlation of individual trees in the forest. Using several attribute evaluation measures instead of just one gives promising results. On the other hand replacement of ordinary voting with voting weighted with margin achieved on most similar instances gives improvements which are statistically highly significant over several data sets.

210 citations


Book ChapterDOI
20 Sep 2004
TL;DR: In this paper, the authors investigate some possibilities to increase strength or decrease correlation of individual trees in the forest, using several attribute evaluation measures instead of just one, and propose to replace ordinary voting with voting weighted with margin achieved on most similar instances.
Abstract: Random forests are one of the most successful ensemble methods which exhibits performance on the level of boosting and support vector machines. The method is fast, robust to noise, does not overfit and offers possibilities for explanation and visualization of its output. We investigate some possibilities to increase strength or decrease correlation of individual trees in the forest. Using several attribute evaluation measures instead of just one gives promising results. On the other hand replacement of ordinary voting with voting weighted with margin achieved on most similar instances gives improvements which are statistically highly significant over several data sets.

208 citations


Proceedings ArticleDOI
01 Dec 2004
TL;DR: A new method to compute similarities for the task of classifying pairs of proteins as interacting or not is presented, able to improve coverage to 20% of interacting pairs with a false positive rate of 50%.
Abstract: One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

197 citations


Proceedings ArticleDOI
27 Dec 2004
TL;DR: The use of random forests for classification of multisource data is investigated and the experimental results obtained were compared to results obtained by bagging and boosting methods.
Abstract: The use of random forests for classification of multisource data is investigated in this paper. Random Forest is a classifier that grows many classification trees. Each tree is trained on a bootstrapped sample of the training data, and at each node the algorithm only searches across a random subset of the variables to determine a split. To classify an input vector in random forest, the vector is submitted as an input to each of the trees in the forest, and the classification is then determined by a majority vote. The experiments presented in the paper were done on a multisource remote sensing and geographic data set. The experimental results obtained with random forests were compared to results obtained by bagging and boosting methods.

104 citations


Proceedings ArticleDOI
27 Dec 2004
TL;DR: The HSVM method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into two-subsets, till pure leaf nodes that have only one class label, are obtained.
Abstract: This paper presents a new approach called Hierarchical Support Vector Machines (HSVM), to address multiclass problems. The method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into two-subsets, till pure leaf nodes that have only one class label, are obtained. The SVM is applied at each internal node to construct the discriminant function for a binary metaclass classifier. Because maxcut unsupervised decomposition uses distance measures to investigate the natural class groupings. HSVM has a fast and intuitive SVM training process that requires little tuning and yields both high accuracy levels and good generalization. The HSVM method was applied to Hyperion hyperspectral data collected over the Okavango Delta of Botswana. Classification accuracies and generalization capability are compared to those achieved by the Best Basis Binary Hierarchical Classifier, a Random Forest CART binary decision tree classifier and Binary Hierarchical Support Vector Machines.

96 citations


Journal ArticleDOI
01 Aug 2004
TL;DR: A way to assign probability after multiclass SVM classification is developed and a system to recognize underwater plankton images from the shadow image particle profiling evaluation recorder (SIPPER) is presented.
Abstract: We present a system to recognize underwater plankton images from the shadow image particle profiling evaluation recorder (SIPPER). The challenge of the SIPPER image set is that many images do not have clear contours. To address that, shape features that do not heavily depend on contour information were developed. A soft margin support vector machine (SVM) was used as the classifier. We developed a way to assign probability after multiclass SVM classification. Our approach achieved approximately 90% accuracy on a collection of plankton images. On another larger image set containing manually unidentifiable particles, it also provided 75.6% overall accuracy. The proposed approach was statistically significantly more accurate on the two data sets than a C4.5 decision tree and a cascade correlation neural network. The single SVM significantly outperformed ensembles of decision trees created by bagging and random forests on the smaller data set and was slightly better on the other data set. The 15-feature subset produced by our feature selection approach provided slightly better accuracy than using all 29 features. Our probability model gave us a reasonable rejection curve on the larger data set.

87 citations


Proceedings Article
01 Jul 2004
TL;DR: It is shown that the RF language models are superior to regular -gram language models in reducing both the perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system.
Abstract: In this paper, we explore the use of Random Forests (RFs) (Amit and Geman, 1997; Breiman, 2001) in language modeling, the problem of predicting the next word based on words already seen before. The goal in this work is to develop a new language modeling approach based on randomly grown Decision Trees (DTs) and apply it to automatic speech recognition. We study our RF approach in the context of -gram type language modeling. Unlike regular -gram language models, RF language models have the potential to generalize well to unseen data, even when a complicated history is used. We show that our RF language models are superior to regular -gram language models in reducing both the perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system.

Journal ArticleDOI
TL;DR: A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition, and the linear regression and nonlinear computational neural network models developed contain the two most important descriptors indicated by the random forest model.
Abstract: A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.

01 Jul 2004
TL;DR: This paper addresses the problem of generating sets of conditions (inputs, disturbances, and parameters) that might be used to "test" a given hybrid system and extends the method of Rapidly exploring Random Trees to obtain test inputs, and introduces new measures for coverage and tree growth.
Abstract: : Most robot control and planning algorithms are complex, involving a combination of reactive controllers, behavior-based controllers, and deliberative controllers. The switching between different behaviors or controllers makes such systems hybrid, i.e. combining discrete and continuous dynamics. While proofs of convergence, robustness and stability are often available for simple controllers under a carefully crafted set of operating conditions, there is no systematic approach to experimenting with, testing, and validating the performance of complex hybrid control systems. In this paper we address the problem of generating sets of conditions (inputs, disturbances, and parameters) that might be used to "test" a given hybrid system. We use the method of Rapidly exploring Random Trees (RRTs) to obtain test inputs. We extend the traditional RRT, which only searches over continuous inputs, to a new algorithm called the Rapidly exploring Random Forest of Trees (RRFT), which can also search over time invariant parameters by growing a set of trees for each parameter value choice. We introduce new measures for coverage and tree growth that allows us to dynamically allocate our resources among the set of trees and to plant new trees when the growth rate of existing ones slows to an unacceptable level. We demonstrate the application of RRFT to testing and validation of aerial robotic control systems.

01 Mar 2004
TL;DR: This work investigates an instance of a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype, and evaluates random forests as applied in this setting, and details why prediction gains obtained in other situations are not realized.
Abstract: The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support vector machines and neural networks are also applied. We interpret results in terms of HIV-1 reverse transcriptase structure and function.


Book ChapterDOI
09 Jun 2004
TL;DR: Bagging and six other randomization-based ensemble tree methods are evaluated and it is found that none of them is consistently more accurate than standard bagging when tested for statistical significance.
Abstract: We experimentally evaluated bagging and six other randomization-based ensemble tree methods. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5, apply randomization in selecting a test at a given node of a tree. Then there are approaches, such as random forests and random subspaces, that apply randomization in the selection of attributes to be used in building the tree. On the other hand boosting incrementally builds classifiers by focusing on examples misclassified by existing classifiers. Experiments were performed on 34 publicly available data sets. While each of the other six approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.

Journal ArticleDOI
TL;DR: In this paper, the problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data, and a variety of data analytic methods have been proposed in this context.
Abstract: The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support vector machines and neural networks are also applied. We interpret results in terms of HIV-1 reverse transcriptase structure and function.

Journal ArticleDOI
TL;DR: Assessment of classification methods that can potentially handle genetic interactions suggests that SNP data might be useful for the classification of individuals into categories of high or low risk of diseases.

Journal ArticleDOI
TL;DR: A novel classification method that integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation, is presented and demonstrated that the multiclass DF is an effective classification method for analysis of gene expression data for the purpose of molecular diagnostics.
Abstract: The wealth of knowledge imbedded in gene expression data from DNA microarrays portends rapid advances in both research and clinic. Turning the prodigious and noisy data into knowledge is a challenge to the field of bioinformatics, and development of classifiers using supervised learning techniques is the primary methodological approach for clinical application using gene expression data. In this paper, we present a novel classification method, multiclass Decision Forest (DF), that is the direct extension of the two-class DF previously developed in our lab. Central to DF is the synergistic combining of multiple heterogenic but comparable decision trees to reach a more accurate and robust classification model. The computationally inexpensive multiclass DF algorithm integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation. Importantly, the method provides several statistical means for assessment of prediction accuracy, prediction confidence, and di...

Journal ArticleDOI
TL;DR: It is shown here how low-level features can be related to semantic photo categories, such as indoor, outdoor and close-up, using decision forests consisting of trees constructed according to CART methodology.
Abstract: Annotating photographs with broad semantic labels can be useful in both image processing and content-based image retrieval. We show here how low-level features can be related to semantic photo categories, such as indoor, outdoor and close-up, using decision forests consisting of trees constructed according to CART methodology. We also show how the results can be improved by introducing a rejection option in the classification process. Experimental results on a test set of 4,500 photographs are reported and discussed.

Book ChapterDOI
06 Sep 2004
TL;DR: A random forest based approach to learning first order theories with aggregates is introduced and several variants are experimentally validated: first order random forests without aggregate, with simple aggregates, and with complex aggregates in the feature set.
Abstract: Random forest induction is a bagging method that randomly samples the feature set at each node in a decision tree. In propositional learning, the method has been shown to work well when lots of features are available. This certainly is the case in first order learning, especially when aggregate functions, combined with selection conditions on the set to be aggregated, are included in the feature space. In this paper, we introduce a random forest based approach to learning first order theories with aggregates. We experimentally validate and compare several variants: first order random forests without aggregates, with simple aggregates, and with complex aggregates in the feature set.

Proceedings Article
Wei Fan1
25 Jul 2004
TL;DR: This study has shown that the actual reason for random tree's superior performance is due to its optimal approximation to each example's true probability to be a member of a given class.
Abstract: Random decision tree is an ensemble of decision trees. The feature at any node of a tree in the ensemble is chosen randomly from remaining features. A chosen discrete feature on a decision path cannot be chosen again. Continuous feature can be chosen multiple times, however, with a different splitting value each time. During classification, each tree outputs raw posterior probability. The probabilities from each tree in the ensemble are averaged as the final posterior probability estimate. Although remarkably simple and somehow counter-intuitive, random decision tree has been shown to be highly accurate under 0-1 loss and cost-sensitive loss functions. Preliminary explanation of its high accuracy is due to the "error-tolerance" property of probabilistic decision making. Our study has shown that the actual reason for random tree's superior performance is due to its optimal approximation to each example's true probability to be a member of a given class.

Posted Content
TL;DR: Using Adaboost, the boosting approach can be used to quantify the corporate governance risk in the case of Latin American markets and an alternating decision tree (ADT) was chosen that explained the relationship between the corporate variables that determined performance and efficiency.
Abstract: The objective of this paper is to demonstrate how the boosting approach can be used to quantify the corporate governance risk in the case of Latin American markets. We compare our results using Adaboost with logistic regression, bagging, and random forests. We conduct tenfold cross-validation experiments on one sample of Latin American Depository Receipts (ADRs), and on another sample of Latin American banks. We find that if the dataset is uniform (similar types of companies and same source of information), as is the case with the Latin American ADRs dataset, the results of Adaboost are similar to the results of bagging and random forests. Only when the dataset shows significant non-uniformity does bagging improve the results. Additionally, the uniformity of the dataset affects the interpretability of the results. Using Adaboost, we were able to select an alternating decision tree (ADT) that explained the relationship between the corporate variables that determined performance and efficiency.

Proceedings ArticleDOI
01 Nov 2004
TL;DR: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as bagging, boosting, random forests and many distributed and data stream mining algorithms.
Abstract: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as bagging, boosting, random forests and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers a technique to construct such trees based on eigen-analysis of the ensemble and offers experimental results to document the performance of orthogonal trees on grounds of accuracy and model complexity.

Proceedings ArticleDOI
22 Aug 2004
TL;DR: This study investigates support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification, and links the selected predictors to the biochemistry of the disease.
Abstract: Metabolomics is the "omics" science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly with other "omic" data, genes and proteins, have great potential to unravel complex human diseases. The dataset used in this study has 63 individuals, normal and diseased, and the diseased are drug treated or not, so there are three classes. The goal is to classify these individuals using the observed metabolite levels for 317 measured metabolites. There are a number of statistical challenges: non-normal data, the number of samples is less than the number of metabolites; there are missing data and the fact that data are missing is informative (assay values below detection limits can point to a specific class); also, there are high correlations among the metabolites. We investigate support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification. We use the variables selected with RF in SVM and visa versa. The benefit of this study is insight into interplay of variable selection and classification methods. We link our selected predictors to the biochemistry of the disease.

01 Jan 2004
TL;DR: In this paper, the authors evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in predicting the distributions of four tree species under current and future climate.
Abstract: More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in predicting the distributions of four tree species under current and future climate. RTA's single tree was the easicst to interpret but is less accurate compared to BT and RF which use multiple regession trees with resampling and resampling-randomisation respectively. Future estimates of suitable habitat following climate change were also improved with BT and RF, with a slight edge to RF because it better smoothes the outputs in a logical gradient fashion. We recommend widespread use of these tools for CISbased vegetation mapping.

Journal ArticleDOI
TL;DR: A novel interactive pattern analysis method is presented that reduces relevance feedback to a two-class classification problem and classifies multimedia objects as relevant or irrelevant and proposes two online pattern classification methods that adapt a composite classifier known as random forests for relevance feedback.
Abstract: Relevance feedback is a mechanism to interactively learn a user's query concept online. It has been extensively used to improve the performance of multimedia information retrieval. In this paper, we present a novel interactive pattern analysis method that reduces relevance feedback to a two-class classification problem and classifies multimedia objects as relevant or irrelevant. To perform interactive pattern analysis, we propose two online pattern classification methods, called interactive random forests (IRF) and adaptive random forests (ARF), that adapt a composite classifier known as random forests for relevance feedback. IRF improves the efficiency of regular random forests (RRF) with a novel two-level resampling technique called biased random sample reduction, while ARF boosts the performance of RRF with two adaptive learning techniques called dynamic feature extraction and adaptive sample selection. During interactive multimedia retrieval, both ARF and IRF run two to three times faster than RRF while achieving comparable precision and recall against the latter. Extensive experiments on a COREL image set (with 31,438 images) demonstrate that our methods (i.e., IRF and RRF) achieve at least a $20\%$ improvement on average precision and recall over the state-of-the-art approaches.

Book ChapterDOI
20 Sep 2004
TL;DR: An experimental study of the performance of three machine learning algorithms applied to the difficult problem of galaxy classification using the Naive Bayes classifier, the rule-induction algorithm C4.5 and a recently introduced classifier named random forest.
Abstract: In this paper we present an experimental study of the performance of three machine learning algorithms applied to the difficult problem of galaxy classification. We use the Naive Bayes classifier, the rule-induction algorithm C4.5 and a recently introduced classifier named random forest (RF). We first employ image processing to standardize the images, eliminating the effects of orientation and scale, then perform principal component analysis to reduce the dimensionality of the data, and finally, classify the galaxy images. Our experiments show that RF obtains the best results considering three, five and seven galaxy types.

BookDOI
TL;DR: A random forest based approach to learning first order theories with aggregates is introduced and several variants are experimentally validated: first order random forests without aggregate, with simple aggregates, and with complex aggregates in the feature set.
Abstract: Random forest induction is a bagging method that randomly samples the feature set at each node in a decision tree. In propositional learning, the method has been shown to work well when lots of features are available. This certainly is the case in first order learning, especially when aggregate functions, combined with selection conditions on the set to be aggregated, are included in the feature space. In this paper, we introduce a random forest based approach to learning first order theories with aggregates. We experimentally validate and compare several variants: first order random forests without aggregates, with simple aggregates, and with complex aggregates in the feature set.

Book ChapterDOI
09 Jun 2004
TL;DR: Experimental results with small samples show that random aggregating, implemented through subsampled bagging, reduces the variance component of the error by about 90%, while baging, as expected, achieves a lower reduction.
Abstract: Bagging can be interpreted as an approximation of random aggregating, an ideal ensemble method by which base learners are trained using data sets randomly drawn according to an unknown probability distribution. An approximate realization of random aggregating can be obtained through subsampled bagging, when large training sets are available. In this paper we perform an experimental bias–variance analysis of bagged and random aggregated ensembles of Support Vector Machines, in order to quantitatively evaluate their theoretical variance reduction properties. Experimental results with small samples show that random aggregating, implemented through subsampled bagging, reduces the variance component of the error by about 90%, while bagging, as expected, achieves a lower reduction. Bias–variance analysis explains also why ensemble methods based on subsampling techniques can be successfully applied to large data mining problems.