Showing papers on "Random forest published in 2004"

PDF

Open Access

Journal Article•DOI•

Probability Estimates for Multi-class Classification by Pairwise Coupling

[...]

01 Dec 2004-Journal of Machine Learning Research

TL;DR: In this paper, the authors present two approaches for obtaining class probabilities, which can be reduced to linear systems and are easy to implement, and show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998).

...read moreread less

Abstract: Pairwise coupling is a popular multi-class classification method that combines all comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than the two existing popular methods: voting and the method by Hastie and Tibshirani (1998)

...read moreread less

1,888 citations

Proceedings Article•

Machine Learning Benchmarks and Random Forest Regression

[...]

Mark R. Segal¹•Institutions (1)

University of California, San Francisco¹

14 Apr 2004

TL;DR: In this article, the authors revisited the formulation of random forests and investigated prediction performance on real-world and simulated datasets for which maximally sized trees do overfit, and revealed that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed.

...read moreread less

Abstract: Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ‘random forests’. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about the scope of the repository. With this as motivation, and coupled with experience from boosting methods, we revisit the formulation of random forests and investigate prediction performance on real-world and simulated datasets for which maximally sized trees do overfit. These explorations reveal that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed. Nonetheless, even in these settings, good performance for random forests can be attained by using larger (than default) primary tuning parameter values.

...read moreread less

462 citations

Proceedings Article•DOI•

Robust prediction of fault-proneness by random forests

[...]

Lan Guo¹, Yan Ma¹, Bojan Cukic¹, Harshinder Singh¹•Institutions (1)

West Virginia University¹

02 Nov 2004

TL;DR: This paper presents a novel methodology for predicting fault prone modules, based on random forests, an extension of decision tree learning that generates hundreds or even thousands of trees using subsets of the training data.

...read moreread less

Abstract: Accurate prediction of fault prone modules (a module is equivalent to a C function or a C+ + method) in software development process enables effective detection and identification of defects. Such prediction models are especially beneficial for large-scale systems, where verification experts need to focus their attention and resources to problem areas in the system under development. This paper presents a novel methodology for predicting fault prone modules, based on random forests. Random forests are an extension of decision tree learning. Instead of generating one decision tree, this methodology generates hundreds or even thousands of trees using subsets of the training data. Classification decision is obtained by voting. We applied random forests in five case studies based on NASA data sets. The prediction accuracy of the proposed methodology is generally higher than that achieved by logistic regression, discriminant analysis and the algorithms in two machine learning software packages, WEKA [I. H. Witten et al. (1999)] and See5. The difference in the performance of the proposed methodology over other methods is statistically significant. Further, the classification accuracy of random forests is more significant over other methods in larger data sets.

...read moreread less

272 citations

Book Chapter•DOI•

Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules

[...]

Vladimir Svetnik¹, Andy Liaw¹, Christopher Tong¹, Ting Wang¹•Institutions (1)

Merck & Co.¹

09 Jun 2004

TL;DR: The performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine.

...read moreread less

Abstract: Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.

...read moreread less

244 citations

Journal Article•

Improving random forests

[...]

Marko Robnik-Šikonja

01 Jan 2004-Lecture Notes in Computer Science

TL;DR: Using several attribute evaluation measures instead of just one gives promising results and replacement of ordinary voting with voting weighted with margin achieved on most similar instances gives improvements which are statistically highly significant over several data sets.

...read moreread less

Abstract: Random forests are one of the most successful ensemble methods which exhibits performance on the level of boosting and support vector machines. The method is fast, robust to noise, does not overfit and offers possibilities for explanation and visualization of its output. We investigate some possibilities to increase strength or decrease correlation of individual trees in the forest. Using several attribute evaluation measures instead of just one gives promising results. On the other hand replacement of ordinary voting with voting weighted with margin achieved on most similar instances gives improvements which are statistically highly significant over several data sets.

...read moreread less

210 citations

Book Chapter•DOI•

Improving random forests

[...]

Marko Robnik-Šikonja¹•Institutions (1)

University of Ljubljana¹

20 Sep 2004

TL;DR: In this paper, the authors investigate some possibilities to increase strength or decrease correlation of individual trees in the forest, using several attribute evaluation measures instead of just one, and propose to replace ordinary voting with voting weighted with margin achieved on most similar instances.

...read moreread less

208 citations

Proceedings Article•DOI•

Random forest similarity for protein-protein interaction prediction from multiple sources.

[...]

Yanjun Qi¹, Judith Klein-Seetharaman², Ziv Bar-Joseph¹•Institutions (2)

Carnegie Mellon University¹, University of Pittsburgh²

01 Dec 2004

TL;DR: A new method to compute similarities for the task of classifying pairs of proteins as interacting or not is presented, able to improve coverage to 20% of interacting pairs with a false positive rate of 50%.

...read moreread less

Abstract: One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

...read moreread less

197 citations

Proceedings Article•DOI•

Random Forest classification of multisource remote sensing and geographic data

[...]

P.O. Gislason, Jon Atli Benediktsson, Johannes R. Sveinsson

27 Dec 2004

TL;DR: The use of random forests for classification of multisource data is investigated and the experimental results obtained were compared to results obtained by bagging and boosting methods.

...read moreread less

Abstract: The use of random forests for classification of multisource data is investigated in this paper. Random Forest is a classifier that grows many classification trees. Each tree is trained on a bootstrapped sample of the training data, and at each node the algorithm only searches across a random subset of the variables to determine a split. To classify an input vector in random forest, the vector is submitted as an input to each of the trees in the forest, and the classification is then determined by a majority vote. The experiments presented in the paper were done on a multisource remote sensing and geographic data set. The experimental results obtained with random forests were compared to results obtained by bagging and boosting methods.

...read moreread less

104 citations

Proceedings Article•DOI•

Integrating support vector machines in a hierarchical output space decomposition framework

[...]

Yangchi Chen, Melba M. Crawford, Joydeep Ghosh¹•Institutions (1)

University of Texas at Austin¹

27 Dec 2004

TL;DR: The HSVM method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into two-subsets, till pure leaf nodes that have only one class label, are obtained.

...read moreread less

Abstract: This paper presents a new approach called Hierarchical Support Vector Machines (HSVM), to address multiclass problems. The method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into two-subsets, till pure leaf nodes that have only one class label, are obtained. The SVM is applied at each internal node to construct the discriminant function for a binary metaclass classifier. Because maxcut unsupervised decomposition uses distance measures to investigate the natural class groupings. HSVM has a fast and intuitive SVM training process that requires little tuning and yields both high accuracy levels and good generalization. The HSVM method was applied to Hyperion hyperspectral data collected over the Okavango Delta of Botswana. Classification accuracies and generalization capability are compared to those achieved by the Best Basis Binary Hierarchical Classifier, a Random Forest CART binary decision tree classifier and Binary Hierarchical Support Vector Machines.

...read moreread less

96 citations

Journal Article•DOI•

Recognizing plankton images from the shadow image particle profiling evaluation recorder

[...]

Tong Luo¹, Kurt Kramer¹, Dmitry B. Goldgof¹, Lawrence O. Hall¹, Scott Samson¹, Andrew Remsen¹, Thomas L. Hopkins¹ - Show less +3 more•Institutions (1)

University of South Florida¹

01 Aug 2004

TL;DR: A way to assign probability after multiclass SVM classification is developed and a system to recognize underwater plankton images from the shadow image particle profiling evaluation recorder (SIPPER) is presented.

...read moreread less

Abstract: We present a system to recognize underwater plankton images from the shadow image particle profiling evaluation recorder (SIPPER). The challenge of the SIPPER image set is that many images do not have clear contours. To address that, shape features that do not heavily depend on contour information were developed. A soft margin support vector machine (SVM) was used as the classifier. We developed a way to assign probability after multiclass SVM classification. Our approach achieved approximately 90% accuracy on a collection of plankton images. On another larger image set containing manually unidentifiable particles, it also provided 75.6% overall accuracy. The proposed approach was statistically significantly more accurate on the two data sets than a C4.5 decision tree and a cascade correlation neural network. The single SVM significantly outperformed ensembles of decision trees created by bagging and random forests on the smaller data set and was slightly better on the other data set. The 15-feature subset produced by our feature selection approach provided slightly better accuracy than using all 29 features. Our probability model gave us a reasonable rejection curve on the larger data set.

...read moreread less

87 citations

Proceedings Article•

Random Forests in Language Modelin

[...]

Peng Xu, Frederick Jelinek

01 Jul 2004

TL;DR: It is shown that the RF language models are superior to regular -gram language models in reducing both the perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system.

...read moreread less

Abstract: In this paper, we explore the use of Random Forests (RFs) (Amit and Geman, 1997; Breiman, 2001) in language modeling, the problem of predicting the next word based on words already seen before. The goal in this work is to develop a new language modeling approach based on randomly grown Decision Trees (DTs) and apply it to automatic speech recognition. We study our RF approach in the context of -gram type language modeling. Unlike regular -gram language models, RF language models have the potential to generalize well to unseen data, even when a complicated history is used. We show that our RF language models are superior to regular -gram language models in reducing both the perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system.

...read moreread less

Journal Article•DOI•

Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors.

[...]

Rajarshi Guha¹, Peter C. Jurs¹•Institutions (1)

Pennsylvania State University¹

14 Sep 2004-Journal of Chemical Information and Computer Sciences

TL;DR: A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition, and the linear regression and nonlinear computational neural network models developed contain the two most important descriptors indicated by the random forest model.

...read moreread less

Abstract: A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.

...read moreread less

Adaptive RRTs for Validating Hybrid Robotic Control Systems

[...]

Joel M. Esposito, Jongwoo Kim, Vijay Kumar

01 Jul 2004

TL;DR: This paper addresses the problem of generating sets of conditions (inputs, disturbances, and parameters) that might be used to "test" a given hybrid system and extends the method of Rapidly exploring Random Trees to obtain test inputs, and introduces new measures for coverage and tree growth.

...read moreread less

Abstract: : Most robot control and planning algorithms are complex, involving a combination of reactive controllers, behavior-based controllers, and deliberative controllers. The switching between different behaviors or controllers makes such systems hybrid, i.e. combining discrete and continuous dynamics. While proofs of convergence, robustness and stability are often available for simple controllers under a carefully crafted set of operating conditions, there is no systematic approach to experimenting with, testing, and validating the performance of complex hybrid control systems. In this paper we address the problem of generating sets of conditions (inputs, disturbances, and parameters) that might be used to "test" a given hybrid system. We use the method of Rapidly exploring Random Trees (RRTs) to obtain test inputs. We extend the traditional RRT, which only searches over continuous inputs, to a new algorithm called the Rapidly exploring Random Forest of Trees (RRFT), which can also search over time invariant parameters by growing a set of trees for each parameter value choice. We introduce new measures for coverage and tree growth that allows us to dynamically allocate our resources among the set of trees and to plant new trees when the growth rate of existing ones slows to an unacceptable level. We demonstrate the application of RRFT to testing and validation of aerial robotic control systems.

...read moreread less

Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests - eScholarship

[...]

Mark R. Segal, Jason D. Barbour, Robert M. Grant

01 Mar 2004

TL;DR: This work investigates an instance of a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype, and evaluates random forests as applied in this setting, and details why prediction gains obtained in other situations are not realized.

...read moreread less

Abstract: The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support vector machines and neural networks are also applied. We interpret results in terms of HIV-1 reverse transcriptase structure and function.

...read moreread less

Proceedings Article•

Exploring Support Vector Machines and Random Forests for Spam Detection.

[...]

Gordon Rios, Hongyuan Zha

01 Jan 2004

Book Chapter•DOI•

A Comparison of Ensemble Creation Techniques

[...]

Robert E. Banfield¹, Lawrence O. Hall¹, Kevin W. Bowyer, Divya Bhadoria¹, W. Philip Kegelmeyer², Steven A. Eschrich¹ - Show less +2 more•Institutions (2)

University of South Florida¹, Sandia National Laboratories²

09 Jun 2004

TL;DR: Bagging and six other randomization-based ensemble tree methods are evaluated and it is found that none of them is consistently more accurate than standard bagging when tested for statistical significance.

...read moreread less

Abstract: We experimentally evaluated bagging and six other randomization-based ensemble tree methods. Bagging uses randomization to create multiple training sets. Other approaches, such as Randomized C4.5, apply randomization in selecting a test at a given node of a tree. Then there are approaches, such as random forests and random subspaces, that apply randomization in the selection of attributes to be used in building the tree. On the other hand boosting incrementally builds classifiers by focusing on examples misclassified by existing classifiers. Experiments were performed on 34 publicly available data sets. While each of the other six approaches has some strengths, we find that none of them is consistently more accurate than standard bagging when tested for statistical significance.

...read moreread less

Journal Article•DOI•

Relating HIV-1 sequence variation to replication capacity via trees and forests.

[...]

Mark R. Segal¹, Jason D. Barbour, Robert M. Grant•Institutions (1)

University of California, San Francisco¹

12 Feb 2004-Statistical Applications in Genetics and Molecular Biology

TL;DR: In this paper, the problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data, and a variety of data analytic methods have been proposed in this context.

...read moreread less

Journal Article•DOI•

A pilot study on the application of statistical classification procedures to molecular epidemiological data

[...]

Holger Schwender, Manuela Zucknick¹, Katja Ickstadt, Herrnann M. Bolt•Institutions (1)

Imperial College London¹

15 Jun 2004-Toxicology Letters

TL;DR: Assessment of classification methods that can potentially handle genetic interactions suggests that SNP data might be useful for the classification of individuals into categories of high or low risk of diseases.

...read moreread less

Journal Article•DOI•

Multiclass Decision Forest--a novel pattern recognition method for multiclass classification in microarray data analysis.

[...]

Huixiao Hong¹, Weida Tong, Roger Perkins, Hong Fang, Qian Xie, Leming Shi - Show less +2 more•Institutions (1)

National Center for Toxicological Research¹

01 Oct 2004-DNA and Cell Biology

TL;DR: A novel classification method that integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation, is presented and demonstrated that the multiclass DF is an effective classification method for analysis of gene expression data for the purpose of molecular diagnostics.

...read moreread less

Abstract: The wealth of knowledge imbedded in gene expression data from DNA microarrays portends rapid advances in both research and clinic. Turning the prodigious and noisy data into knowledge is a challenge to the field of bioinformatics, and development of classifiers using supervised learning techniques is the primary methodological approach for clinical application using gene expression data. In this paper, we present a novel classification method, multiclass Decision Forest (DF), that is the direct extension of the two-class DF previously developed in our lab. Central to DF is the synergistic combining of multiple heterogenic but comparable decision trees to reach a more accurate and robust classification model. The computationally inexpensive multiclass DF algorithm integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation. Importantly, the method provides several statistical means for assessment of prediction accuracy, prediction confidence, and di...

...read moreread less

Journal Article•DOI•

Automatic classification of digital photographs based on decision forests

[...]

Raimondo Schettini¹, Carla Brambilla, Claudio Cusano¹, Gianluigi Ciocca¹•Institutions (1)

University of Milano-Bicocca¹

01 Aug 2004-International Journal of Pattern Recognition and Artificial Intelligence

TL;DR: It is shown here how low-level features can be related to semantic photo categories, such as indoor, outdoor and close-up, using decision forests consisting of trees constructed according to CART methodology.

...read moreread less

Abstract: Annotating photographs with broad semantic labels can be useful in both image processing and content-based image retrieval. We show here how low-level features can be related to semantic photo categories, such as indoor, outdoor and close-up, using decision forests consisting of trees constructed according to CART methodology. We also show how the results can be improved by introducing a rejection option in the classification process. Experimental results on a test set of 4,500 photographs are reported and discussed.

...read moreread less

Book Chapter•DOI•

First Order Random Forests with Complex Aggregates

[...]

Celine Vens¹, Anneleen Van Assche¹, Hendrik Blockeel¹, Sašo Džeroski²•Institutions (2)

Katholieke Universiteit Leuven¹, Jožef Stefan Institute²

06 Sep 2004

TL;DR: A random forest based approach to learning first order theories with aggregates is introduced and several variants are experimentally validated: first order random forests without aggregate, with simple aggregates, and with complex aggregates in the feature set.

...read moreread less

Abstract: Random forest induction is a bagging method that randomly samples the feature set at each node in a decision tree. In propositional learning, the method has been shown to work well when lots of features are available. This certainly is the case in first order learning, especially when aggregate functions, combined with selection conditions on the set to be aggregated, are included in the feature space. In this paper, we introduce a random forest based approach to learning first order theories with aggregates. We experimentally validate and compare several variants: first order random forests without aggregates, with simple aggregates, and with complex aggregates in the feature set.

...read moreread less

Proceedings Article•

On the optimality of probability estimation by random decision trees

[...]

Wei Fan¹•Institutions (1)

IBM¹

25 Jul 2004

TL;DR: This study has shown that the actual reason for random tree's superior performance is due to its optimal approximation to each example's true probability to be a member of a given class.

...read moreread less

Abstract: Random decision tree is an ensemble of decision trees. The feature at any node of a tree in the ensemble is chosen randomly from remaining features. A chosen discrete feature on a decision path cannot be chosen again. Continuous feature can be chosen multiple times, however, with a different splitting value each time. During classification, each tree outputs raw posterior probability. The probabilities from each tree in the ensemble are averaged as the final posterior probability estimate. Although remarkably simple and somehow counter-intuitive, random decision tree has been shown to be highly accurate under 0-1 loss and cost-sensitive loss functions. Preliminary explanation of its high accuracy is due to the "error-tolerance" property of probabilistic decision making. Our study has shown that the actual reason for random tree's superior performance is due to its optimal approximation to each example's true probability to be a member of a given class.

...read moreread less

Posted Content•

Predicting Performance and Quantifying Corporate Governance Risk for Latin American Adrs and Banks

[...]

Germán G. Creamer¹, Germán G. Creamer², Yoav Freund³•Institutions (3)

Columbia University¹, Stevens Institute of Technology², University of California, San Diego³

01 Nov 2004-Social Science Research Network

TL;DR: Using Adaboost, the boosting approach can be used to quantify the corporate governance risk in the case of Latin American markets and an alternating decision tree (ADT) was chosen that explained the relationship between the corporate variables that determined performance and efficiency.

...read moreread less

Abstract: The objective of this paper is to demonstrate how the boosting approach can be used to quantify the corporate governance risk in the case of Latin American markets. We compare our results using Adaboost with logistic regression, bagging, and random forests. We conduct tenfold cross-validation experiments on one sample of Latin American Depository Receipts (ADRs), and on another sample of Latin American banks. We find that if the dataset is uniform (similar types of companies and same source of information), as is the case with the Latin American ADRs dataset, the results of Adaboost are similar to the results of bagging and random forests. Only when the dataset shows significant non-uniformity does bagging improve the results. Additionally, the uniformity of the dataset affects the interpretability of the results. Using Adaboost, we were able to select an alternating decision tree (ADT) that explained the relationship between the corporate variables that determined performance and efficiency.

...read moreread less

Proceedings Article•DOI•

Orthogonal decision trees

[...]

Hillol Kargupta¹, Haimonti Dutta¹•Institutions (1)

University of Baltimore¹

01 Nov 2004

TL;DR: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as bagging, boosting, random forests and many distributed and data stream mining algorithms.

...read moreread less

Abstract: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as bagging, boosting, random forests and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers a technique to construct such trees based on eigen-analysis of the ensemble and offers experimental results to document the performance of orthogonal trees on grounds of accuracy and model complexity.

...read moreread less

Proceedings Article•DOI•

Learning a complex metabolomic dataset using random forests and support vector machines

[...]

Young K. Truong¹, Xiaodong Lin², Chris Beecher²•Institutions (2)

University of North Carolina at Chapel Hill¹, Research Triangle Park²

22 Aug 2004

TL;DR: This study investigates support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification, and links the selected predictors to the biochemistry of the disease.

...read moreread less

Abstract: Metabolomics is the "omics" science of biochemistry. The associated data include the quantitative measurements of all small molecule metabolites in a biological sample. These datasets provide a window into dynamic biochemical networks and conjointly with other "omic" data, genes and proteins, have great potential to unravel complex human diseases. The dataset used in this study has 63 individuals, normal and diseased, and the diseased are drug treated or not, so there are three classes. The goal is to classify these individuals using the observed metabolite levels for 317 measured metabolites. There are a number of statistical challenges: non-normal data, the number of samples is less than the number of metabolites; there are missing data and the fact that data are missing is informative (assay values below detection limits can point to a specific class); also, there are high correlations among the metabolites. We investigate support vector machines (SVM), and random forest (RF), for outlier detection, variable selection and classification. We use the variables selected with RF in SVM and visa versa. The benefit of this study is insight into interplay of variable selection and classification methods. We link our selected predictors to the biochemistry of the disease.

...read moreread less

New machine learning tools for predictive vegetation mapping after climate change: Bagging and Random Forest perform better than Regression Tree Analysis.

[...]

L.R. Iverson, A.M. Prasad, A. Liaw

01 Jan 2004

TL;DR: In this paper, the authors evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in predicting the distributions of four tree species under current and future climate.

...read moreread less

Abstract: More and better machine learning tools are becoming available for landscape ecologists to aid in understanding species-environment relationships and to map probable species occurrence now and potentially into the future. To thal end, we evaluated three statistical models: Regression Tree Analybib (RTA), Bagging Trees (BT) and Random Forest (RF) for their utility in predicting the distributions of four tree species under current and future climate. RTA's single tree was the easicst to interpret but is less accurate compared to BT and RF which use multiple regession trees with resampling and resampling-randomisation respectively. Future estimates of suitable habitat following climate change were also improved with BT and RF, with a slight edge to RF because it better smoothes the outputs in a logical gradient fashion. We recommend widespread use of these tools for CISbased vegetation mapping.

...read moreread less

Journal Article•DOI•

Interactive pattern analysis for relevance feedback in multimedia information retrieval

[...]

Yimin Wu¹, Aidong Zhang¹•Institutions (1)

University at Buffalo¹

01 Jun 2004-Multimedia Systems

TL;DR: A novel interactive pattern analysis method is presented that reduces relevance feedback to a two-class classification problem and classifies multimedia objects as relevant or irrelevant and proposes two online pattern classification methods that adapt a composite classifier known as random forests for relevance feedback.

...read moreread less

Abstract: Relevance feedback is a mechanism to interactively learn a user's query concept online. It has been extensively used to improve the performance of multimedia information retrieval. In this paper, we present a novel interactive pattern analysis method that reduces relevance feedback to a two-class classification problem and classifies multimedia objects as relevant or irrelevant. To perform interactive pattern analysis, we propose two online pattern classification methods, called interactive random forests (IRF) and adaptive random forests (ARF), that adapt a composite classifier known as random forests for relevance feedback. IRF improves the efficiency of regular random forests (RRF) with a novel two-level resampling technique called biased random sample reduction, while ARF boosts the performance of RRF with two adaptive learning techniques called dynamic feature extraction and adaptive sample selection. During interactive multimedia retrieval, both ARF and IRF run two to three times faster than RRF while achieving comparable precision and recall against the latter. Extensive experiments on a COREL image set (with 31,438 images) demonstrate that our methods (i.e., IRF and RRF) achieve at least a $20\%$ improvement on average precision and recall over the state-of-the-art approaches.

...read moreread less

Book Chapter•DOI•

Automated classification of galaxy images

[...]

Jorge de la Calleja, Olac Fuentes

20 Sep 2004

TL;DR: An experimental study of the performance of three machine learning algorithms applied to the difficult problem of galaxy classification using the Naive Bayes classifier, the rule-induction algorithm C4.5 and a recently introduced classifier named random forest.

...read moreread less

Abstract: In this paper we present an experimental study of the performance of three machine learning algorithms applied to the difficult problem of galaxy classification. We use the Naive Bayes classifier, the rule-induction algorithm C4.5 and a recently introduced classifier named random forest (RF). We first employ image processing to standardize the images, eliminating the effects of orientation and scale, then perform principal component analysis to reduce the dimensionality of the data, and finally, classify the galaxy images. Our experiments show that RF obtains the best results considering three, five and seven galaxy types.

...read moreread less

Book•DOI•

First order random forests with complex aggregates

[...]

Celine Vens, Anneleen Van Assche, Hendrik Blockeel, Saso Dzeroski

01 Jan 2004-Lecture Notes in Computer Science

...read moreread less

Book Chapter•DOI•

Random Aggregated and Bagged Ensembles of SVMs: An Empirical Bias–Variance Analysis

[...]

Giorgio Valentini¹•Institutions (1)

University of Milan¹

09 Jun 2004

TL;DR: Experimental results with small samples show that random aggregating, implemented through subsampled bagging, reduces the variance component of the error by about 90%, while baging, as expected, achieves a lower reduction.

...read moreread less

Abstract: Bagging can be interpreted as an approximation of random aggregating, an ideal ensemble method by which base learners are trained using data sets randomly drawn according to an unknown probability distribution. An approximate realization of random aggregating can be obtained through subsampled bagging, when large training sets are available. In this paper we perform an experimental bias–variance analysis of bagged and random aggregated ensembles of Support Vector Machines, in order to quantitatively evaluate their theoretical variance reduction properties. Experimental results with small samples show that random aggregating, implemented through subsampled bagging, reduces the variance component of the error by about 90%, while bagging, as expected, achieves a lower reduction. Bias–variance analysis explains also why ensemble methods based on subsampling techniques can be successfully applied to large data mining problems.

...read moreread less