scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2008"


Journal ArticleDOI
TL;DR: This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model.
Abstract: Summary 1 Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions 2 This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance) The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion 3 Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods 4 The unique features of BRT raise a number of practical issues in model fitting We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel ( Anguilla australis Richardson), a native freshwater fish of New Zealand We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data We provide code and a tutorial to enable the wider use of BRT by ecologists

4,787 citations


Proceedings ArticleDOI
15 Dec 2008
TL;DR: The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement.
Abstract: Most existing model-based approaches to anomaly detection construct a profile of normal instances, then identify instances that do not conform to the normal profile as anomalies. This paper proposes a fundamentally different model-based method that explicitly isolates anomalies instead of profiles normal points. To our best knowledge, the concept of isolation has not been explored in current literature. The use of isolation enables the proposed method, iForest, to exploit sub-sampling to an extent that is not feasible in existing methods, creating an algorithm which has a linear time complexity with a low constant and a low memory requirement. Our empirical evaluation shows that iForest performs favourably to ORCA, a near-linear time complexity distance-based method, LOF and random forests in terms of AUC and processing time, and especially in large data sets. iForest also works well in high dimensional problems which have a large number of irrelevant attributes, and in situations where training set does not contain any anomalies.

3,455 citations


Journal ArticleDOI
TL;DR: Random Survival Forest (RSF) as discussed by the authors is a random forests method for the analysis of right-censored survival data, which is based on the conservation-of-events principle.
Abstract: We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

1,562 citations


Journal ArticleDOI
TL;DR: This article introduces random survival forests, a random forests method for the analysis of right-censored survival data, and extends Breiman’s random forests (RF) method, showing it to be highly accurate and comparable to state-of-the-art methods.
Abstract: We introduce random survival forests, a random forests method for the analysis of right-censored survival data. New survival splitting rules for growing survival trees are introduced, as is a new missing data algorithm for imputing missing data. A conservation-of-events principle for survival forests is introduced and used to define ensemble mortality, a simple interpretable measure of mortality that can be used as a predicted outcome. Several illustrative examples are given, including a case study of the prognostic implications of body mass for individuals with coronary artery disease. Computations for all examples were implemented using the freely available R-software package, randomSurvivalForest.

1,147 citations


Journal ArticleDOI
TL;DR: The RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables.

854 citations


Journal ArticleDOI
TL;DR: Two tree-based ensemble classification algorithms are assessed: Adaboost and Random Forest, based on standard classification accuracy, training time and classification stability, and both outperform a neural network classifier in dealing with hyperspectral data.

720 citations


Journal ArticleDOI
TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Abstract: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

616 citations


Proceedings ArticleDOI
05 Jul 2008
TL;DR: To the surprise, the method that performs consistently well across all dimensions is random forests, followed by neural nets, boosted trees, and SVMs, and the effect of increasing dimensionality on the performance of the learning algorithms changes.
Abstract: In this paper we perform an empirical evaluation of supervised learning on high-dimensional data. We evaluate performance on three metrics: accuracy, AUC, and squared loss and study the effect of increasing dimensionality on the performance of the learning algorithms. Our findings are consistent with previous studies for problems of relatively low dimension, but suggest that as dimensionality increases the relative performance of the learning algorithms changes. To our surprise, the method that performs consistently well across all dimensions is random forests, followed by neural nets, boosted trees, and SVMs.

555 citations


Journal Article
TL;DR: A number of theorems are given that establish the universal consistency of averaging rules, and it is shown that some popular classifiers, including one suggested by Breiman, are not universally consistent.
Abstract: In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent.

521 citations


Journal ArticleDOI
01 Sep 2008
TL;DR: The experimental results demonstrate that the performance provided by the proposed misuse approach is better than the best KDDpsila99 result; compared to other reported unsupervised anomaly detection approaches, the anomaly detection approach achieves higher detection rate when the false positive rate is low; and the presented hybrid system can improve the overall performance of the aforementioned IDSs.
Abstract: Prevention of security breaches completely using the existing security technologies is unrealistic. As a result, intrusion detection is an important component in network security. However, many current intrusion detection systems (IDSs) are rule-based systems, which have limitations to detect novel intrusions. Moreover, encoding rules is time-consuming and highly depends on the knowledge of known intrusions. Therefore, we propose new systematic frameworks that apply a data mining algorithm called random forests in misuse, anomaly, and hybrid-network-based IDSs. In misuse detection, patterns of intrusions are built automatically by the random forests algorithm over training data. After that, intrusions are detected by matching network activities against the patterns. In anomaly detection, novel intrusions are detected by the outlier detection mechanism of the random forests algorithm. After building the patterns of network services by the random forests algorithm, outliers related to the patterns are determined by the outlier detection algorithm. The hybrid detection system improves the detection performance by combining the advantages of the misuse and anomaly detection. We evaluate our approaches over the knowledge discovery and data mining 1999 (KDDpsila99) dataset. The experimental results demonstrate that the performance provided by the proposed misuse approach is better than the best KDDpsila99 result; compared to other reported unsupervised anomaly detection approaches, our anomaly detection approach achieves higher detection rate when the false positive rate is low; and the presented hybrid system can improve the overall performance of the aforementioned IDSs.

440 citations


Book
01 Jan 2008
TL;DR: This paper presents a meta-modelling framework called CART, which automates the very labor-intensive and therefore time-heavy and expensive process of Classification and Regression Trees (CART) that is currently used in statistical inference.
Abstract: Statistical Learning as a Regression Problem.- Regression Splines and Regression Smoothers.- Classification and Regression Trees (CART).- Bagging.- Random Forests.- Boosting.- Support Vector Machines.- Broader Implications and a Bit of Craft Lore.

Journal ArticleDOI
TL;DR: Several kinds of decision trees for finding active objects by multi-wavelength data, such as REPTree, Random Tree, Decision Stump, Random Forest, J48, NBTree, AdTree are described and experimental results show that ADTree is the best only in terms of accuracy.

Journal ArticleDOI
TL;DR: It is demonstrated that the generated explanations closely follow the learned models and a visualization technique is presented that shows the utility of the approach and enables the comparison of different prediction methods.
Abstract: We present a method for explaining predictions for individual instances. The presented approach is general and can be used with all classification models that output probabilities. It is based on the decomposition of a model's predictions on individual contributions of each attribute. Our method works for the so-called black box models such as support vector machines, neural networks, and nearest neighbor algorithms, as well as for ensemble methods such as boosting and random forests. We demonstrate that the generated explanations closely follow the learned models and present a visualization technique that shows the utility of our approach and enables the comparison of different prediction methods.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: It is shown that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously.
Abstract: This work investigates the use of Random Forests for class based pixel-wise segmentation of images. The contribution of this paper is three-fold. First, we show that apparently quite dissimilar classifiers (such as nearest neighbour matching to texton class histograms) can be mapped onto a Random Forest architecture. Second, based on this insight, we show that the performance of such classifiers can be improved by incorporating the spatial context and discriminative learning that arises naturally in the Random Forest framework. Finally, we show that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously. The benefit of the multi-feature classifier is demonstrated with extensive experimentation on existing labelled image datasets. The method equals or exceeds the state of the art on these datasets.

Journal ArticleDOI
TL;DR: The importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes are revealed.
Abstract: Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results. In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers. We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

Book ChapterDOI
Toby Sharp1
12 Oct 2008
TL;DR: A method for implementing the evaluation and training of decision trees and forests entirely on a GPU is described and how this method can be used in the context of object recognition is shown.
Abstract: We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition.

Journal ArticleDOI
TL;DR: This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.
Abstract: Although the random forest classification procedure works well in datasets with many features, when the number of features is huge and the percentage of truly informative features is small, such as with DNA microarray data, its performance tends to decline significantly. In such instances, the procedure can be improved by reducing the contribution of trees whose nodes are populated by non-informative features. To some extent, this can be achieved by prefiltering, but we propose a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features. This results in an ‘enriched random forest’. We illustrate the superior performance of this procedure in several actual microarray datasets. Contact: damaratu@prdus.jnj.com

Journal ArticleDOI
TL;DR: It is shown that the classification of multilevel-multisource data sets with SVM and RF is feasible and does not require a definition of ideal aggregation levels.
Abstract: A strategy for the joint classification of multiple segmentation levels from multisensor imagery is introduced by using synthetic aperture radar and optical data. At first, the two data sets are separately segmented, creating independent aggregation levels at different scales. Each individual level from the two sensors is then preclassified by a support vector machine (SVM). The original outputs of each SVM, i.e., images showing the distances of the pixels to the hyperplane fitted by the SVM, are used in a decision fusion to determine the final classes. The fusion strategy is based on the application of an additional classifier, which is applied on the preclassification results. Both a second SVM and random forests (RF) were tested for the decision fusion. The results are compared with SVM and RF applied to the full data set without preclassification. Both the integration of multilevel information and the use of multisensor imagery increase the overall accuracy. It is shown that the classification of multilevel-multisource data sets with SVM and RF is feasible and does not require a definition of ideal aggregation levels. The proposed decision fusion approach that applies RF to the preclassification outperforms all other approaches.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: This paper addresses human pose recognition from video sequences by formulating it as a classification problem and considers Histograms of Orientated Gradient (HOG), a pose detection algorithm based on random forests.
Abstract: This paper addresses human pose recognition from video sequences by formulating it as a classification problem. Unlike much previous work we do not make any assumptions on the availability of clean segmentation. The first step of this work consists in a novel method of aligning the training images using 3D Mocap data. Next we define classes by discretizing a 2D manifold whose two dimensions are camera viewpoint and actions. Our main contribution is a pose detection algorithm based on random forests. A bottom-up approach is followed to build a decision tree by recursively clustering and merging the classes at each level. For each node of the decision tree we build a list of potentially discriminative features using the alignment of training images; in this paper we consider Histograms of Orientated Gradient (HOG). We finally grow an ensemble of trees by randomly sampling one of the selected HOG blocks at each node. Our proposed approach gives promising results with both fixed and moving cameras.

Journal ArticleDOI
TL;DR: This paper proposes the Random MultiNomial Logit (RMNL), i.e. a random forest of MNLs, and compares its predictive performance to that of (a) MNL with expert feature selection, (b) Random Forests of classification trees, and indicates a substantial increase in model accuracy of the RMNL model.
Abstract: Several supervised learning algorithms are suited to classify instances into a multiclass value space. MultiNomial Logit (MNL) is recognized as a robust classifier and is commonly applied within the CRM (Customer Relationship Management) domain. Unfortunately, to date, it is unable to handle huge feature spaces typical of CRM applications. Hence, the analyst is forced to immerse himself into feature selection. Surprisingly, in sharp contrast with binary logit, current software packages lack any feature-selection algorithm for MultiNomial Logit. Conversely, Random Forests, another algorithm learning multiclass problems, is just like MNL robust but unlike MNL it easily handles high-dimensional feature spaces. This paper investigates the potential of applying the Random Forests principles to the MNL framework. We propose the Random MultiNomial Logit (RMNL), i.e. a random forest of MNLs, and compare its predictive performance to that of (a) MNL with expert feature selection, (b) Random Forests of classification trees. We illustrate the Random MultiNomial Logit on a cross-sell CRM problem within the home-appliances industry. The results indicate a substantial increase in model accuracy of the RMNL model to that of the MNL model with expert feature selection.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The results indicate that random forests do poorly when faced with irrelevant attributes, while the heterogeneous technique handles them robustly, and it is shown that large ensembles of random trees are more susceptible to diminishing returns than the technique.
Abstract: Using decision trees that split on randomly selected attributes is one way to increase the diversity within an ensemble of decision trees. Another approach increases diversity by combining multiple tree algorithms. The random forest approach has become popular because it is simple and yields good results with common datasets. We present a technique that combines heterogeneous tree algorithms and contrast it with homogeneous forest algorithms. Our results indicate that random forests do poorly when faced with irrelevant attributes, while our heterogeneous technique handles them robustly. Further, we show that large ensembles of random trees are more susceptible to diminishing returns than our technique. We are able to obtain better results across a large number of common datasets with a significantly smaller ensemble.

Journal ArticleDOI
TL;DR: In this paper, a measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned.
Abstract: This article considers a measure of variable importance frequently used in variable-selection methods based on decision trees and tree-based ensemble models These models include CART, random forests, and gradient boosting machine The measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned Despite its popularity, some authors have shown that this measure is biased to the extent that, under certain conditions, there may be dangerous effects on variable selection Here we present a simple and effective method for bias correction, focusing on the easily generalizable case of the Gini index as a measure of heterogeneity

Posted Content
TL;DR: This paper aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size.
Abstract: This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of variables hugely exceeds the sample size. But the main contribution of this paper is twofold: to provide some insights about the behavior of the variable importance index based on random forests and in addition, to propose to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The strategy involves a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.

Journal ArticleDOI
TL;DR: This paper investigates the possibilities of applying the random forests algorithm (RF) in machine fault diagnosis, and proposes a hybrid method combined with genetic algorithm to improve the classification accuracy.
Abstract: This paper investigates the possibilities of applying the random forests algorithm (RF) in machine fault diagnosis, and proposes a hybrid method combined with genetic algorithm to improve the classification accuracy. The proposed method is based on RF, a novel ensemble classifier which builds a number of decision trees to improve the single tree classifier. Although there are several existing techniques for faults diagnosis, the application research on RF is meaningful and necessary because of its fast execution speed, the characteristics of tree classifier, and high performance in machine faults diagnosis. The proposed method is demonstrated by a case study on induction motor fault diagnosis. Experimental results indicate the validity and reliability of RF-based diagnosis method.

Journal ArticleDOI
TL;DR: An ensemble system incorporating majority voting and involving Multilayer Perceptron, Logistic Regression, decision trees, Random Forest, Radial Basis Function, and Support Vector Machine as the constituents is developed to solve the customer credit card churn prediction via data mining.
Abstract: In this paper, we solve the customer credit card churn prediction via data mining. We developed an ensemble system incorporating majority voting and involving Multilayer Perceptron (MLP), Logistic Regression (LR), decision trees (J48), Random Forest (RF), Radial Basis Function (RBF) network and Support Vector Machine (SVM) as the constituents. The dataset was taken from the Business Intelligence Cup organised by the University of Chile in 2004. Since it is a highly unbalanced dataset with 93% loyal and 7% churned customers, we employed (1) undersampling, (2) oversampling, (3) a combination of undersampling and oversampling and (4) the Synthetic Minority Oversampling Technique (SMOTE) for balancing it. Furthermore, tenfold cross-validation was employed. The results indicated that SMOTE achieved good overall accuracy. Also, SMOTE and a combination of undersampling and oversampling improved the sensitivity and overall accuracy in majority voting. In addition, the Classification and Regression Tree (CART) was used for the purpose of feature selection. The reduced feature set was fed to the classifiers mentioned above. Thus, this paper outlines the most important predictor variables in solving the credit card churn prediction problem. Moreover, the rules generated by decision tree J48 act as an early warning expert system.

DOI
30 Jan 2008
TL;DR: In this article, the authors highlight the advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated predictor variables, for the test of Breiman and Cutler (2008).
Abstract: Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures, especially in the context of correlated predictor variables. For the test of Breiman and Cutler (2008), we investigate the statistical properties and find that the power of the test depends both on the sample size and the number of trees, an arbitrarily chosen tuning parameter, leading to undesired results that nullify any significance judgments. Moreover, the specification of the null hypothesis of this test is discussed in the context of correlated predictor variables.

Book ChapterDOI
10 Jun 2008
TL;DR: This work proposes a hierarchical segmentation procedure based on statistical learning and topology-preserving grouping that is close to human labelings in three-dimensional electron-microscopic image stacks with almost isotropic resolution.
Abstract: Three-dimensional electron-microscopic image stacks with almost isotropic resolution allow, for the first time, to determine the complete connection matrix of parts of the brain. In spite of major advances in staining, correct segmentation of these stacks remains challenging, because very few local mistakes can lead to severe global errors. We propose a hierarchical segmentation procedure based on statistical learning and topology-preserving grouping. Edge probability maps are computed by a random forest classifier (trained on hand-labeled data) and partitioned into supervoxels by the watershed transform. Over-segmentation is then resolved by another random forest. Careful validation shows that the results of our algorithm are close to human labelings.

Journal ArticleDOI
TL;DR: An interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times and follows the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure.
Abstract: We propose an interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times. The proposed method recursively partitions the data into two subsets that show the greatest interaction with the treatment, which results in a number of objectively defined subgroups: in some of them the treatment effect is prominent while in others the treatment may have a negligible or even negative effect. The resultant tree structure can be used to explore the overall interaction between treatment and other covariates and help identify and describe possible target populations on which an experimental treatment demonstrates desired efficacy. We follow the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure. Variable importance information is extracted via random forests of interaction trees. Both simulated experiments and an analysis of the primary billiary cirrhosis (PBC) data are provided for evaluation and illustration of the proposed procedure.

Journal ArticleDOI
TL;DR: The authors used the recently developed statistical method of random forests to obtain a new perspective of variables that are associated with persistence to a science or engineering degree and compared the results from classification trees and random forests with results from the more commonly used method of logistic regression.
Abstract: Many students who start college intending to major in science or engineering do not graduate, or decide to switch to a non-science major. We used the recently developed statistical method of random forests to obtain a new perspective of variables that are associated with persistence to a science or engineering degree. We describe classification trees and random forests and contrast the results from these methods with results from the more commonly used method of logistic regression. Among the variables available in Arizona State University data, high school and freshman year GPAs have highest importance for predicting persistence; other variables such as number of science and engineering courses taken freshman year are important for subgroups of the student population. The method used in this study could be employed in other settings to identify faculty practices, teaching methods, and other factors that are associated with high persistence to a degree.

Journal ArticleDOI
TL;DR: This work proposes a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross- validation for avoiding overfitting.
Abstract: Motivation: In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: they should ideally be able to catch subtypes which are not identified by clinical parameters alone. Moreover, they should address the question of the additional predictive value of microarray data in a fair framework. Results: We propose a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross-validation for avoiding overfitting. Our approach is fast, flexible and can be used both for assessing the overall additional significance of the microarray data and for building optimal hybrid classification rules. Its efficiency is demonstrated through simulations and an application to breast cancer and colorectal cancer data. Availability: Our method is implemented in the freely available R package ‘MAclinical’ which can be downloaded from http://www.stat.uni-muenchen.de/~socher/MAclinical Contact: boulesteix@slcmsr.org