scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2006"


Journal ArticleDOI
TL;DR: It is shown that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Abstract: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

2,610 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: A large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps is presented.
Abstract: A number of supervised learning methods have been introduced in the last decade. Unfortunately, the last comprehensive empirical evaluation of supervised learning was the Statlog Project in the early 90's. We present a large-scale empirical comparison between ten supervised learning methods: SVMs, neural nets, logistic regression, naive bayes, memory-based learning, random forests, decision trees, bagged trees, boosted trees, and boosted stumps. We also examine the effect that calibrating the models via Platt Scaling and Isotonic Regression has on their performance. An important aspect of our study is the use of a variety of performance criteria to evaluate the learning methods.

2,450 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluated four statistical models (Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS) for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model.
Abstract: The task of modeling the distribution of a large number of tree species under future climate scenarios presents unique challenges. First, the model must be robust enough to handle climate data outside the current range without producing unacceptable instability in the output. In addition, the technique should have automatic search mechanisms built in to select the most appropriate values for input model parameters for each species so that minimal effort is required when these parameters are fine-tuned for individual tree species. We evaluated four statistical models—Regression Tree Analysis (RTA), Bagging Trees (BT), Random Forests (RF), and Multivariate Adaptive Regression Splines (MARS)—for predictive vegetation mapping under current and future climate scenarios according to the Canadian Climate Centre global circulation model. To test, we applied these techniques to four tree species common in the eastern United States: loblolly pine (Pinus taeda), sugar maple (Acer saccharum), American beech (Fagus grandifolia), and white oak (Quercus alba). When the four techniques were assessed with Kappa and fuzzy Kappa statistics, RF and BT were superior in reproducing current importance value (a measure of basal area in addition to abundance) distributions for the four tree species, as derived from approximately 100,000 USDA Forest Service’s Forest Inventory and Analysis plots. Future estimates of suitable habitat after climate change were visually more reasonable with BT and RF, with slightly better performance by RF as assessed by Kappa statistics, correlation estimates, and spatial distribution of importance values. Although RTA did not perform as well as BT and RF, it provided interpretive models for species whose distributions were captured well by our current set of predictors. MARS was adequate for predicting current distributions but unacceptable for future climate. We consider RTA, BT, and RF modeling approaches, especially when used together to take advantage of their individual strengths, to be robust for predictive mapping and recommend their inclusion in the ecological toolbox.

1,879 citations


Journal ArticleDOI
TL;DR: This work examined the rotation forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with bagging, AdaBoost, and random forest and prompted an investigation into diversity-accuracy landscape of the ensemble models.
Abstract: We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and principal component analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name "forest". Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the rotation forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with bagging, AdaBoost, and random forest. The results were favorable to rotation forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that rotation forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and random forest, and more diverse than these in bagging, sometimes more accurate as well

1,708 citations


Journal ArticleDOI
TL;DR: The Random Forest classifier uses bagging, or bootstrap aggregating, to form an ensemble of classification and regression tree (CART)-like classifiers, which is computationally much lighter than methods based on boosting and somewhat lighter than simple bagging.

1,634 citations


Journal Article
Nicolai Meinshausen1
TL;DR: It is shown here that random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean, in order to be competitive in terms of predictive power.
Abstract: Random forests were introduced as a machine learning tool in Breiman (2001) and have since proven to be very popular and powerful for high-dimensional regression and classification For regression, random forests give an accurate approximation of the conditional mean of a response variable It is shown here that random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean Conditional quantiles can be inferred with quantile regression forests, a generalisation of random forests Quantile regression forests give a non-parametric and accurate way of estimating conditional quantiles for high-dimensional predictor variables The algorithm is shown to be consistent Numerical examples suggest that the algorithm is competitive in terms of predictive power

1,284 citations


Journal ArticleDOI
TL;DR: It is shown that random forests with adaptive splitting schemes assign weights to k-PNNs in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k- PNNs of the target point according to the local importance of different input variables.
Abstract: In this article we study random forests through their connection with a new framework of adaptive nearest-neighbor methods. We introduce a concept of potential nearest neighbors (k-PNNs) and show that random forests can be viewed as adaptively weighted k-PNN methods. Various aspects of random forests can be studied from this perspective. We study the effect of terminal node sizes on the prediction accuracy of random forests. We further show that random forests with adaptive splitting schemes assign weights to k-PNNs in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k-PNNs of the target point according to the local importance of different input variables. We propose a new simple splitting scheme that achieves desirable adaptivity in a straightforward fashion. This simple scheme can be combined with existing algorithms. The resulting algorithm is computationally faster and gives comparable results. Other possible aspects of random forests, such...

504 citations


Journal ArticleDOI
TL;DR: The RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions and can be described with simple thresholding rules in this application.
Abstract: A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl 1 RF dissimilarity weighs the contribution of each variable according to how dependent it is ...

460 citations


Book ChapterDOI
01 Jan 2006
TL;DR: An alternative selection method, based on the technique of Random Forests, is proposed in the context of classification, with an application to a real dataset.
Abstract: One of the main topic in the development of predictive models is the identification of variables which are predictors of a given outcome. Automated model selection methods, such as backward or forward stepwise regression, are classical solutions to this problem, but are generally based on strong assumptions about the functional form of the model or the distribution of residuals. In this pa-per an alternative selection method, based on the technique of Random Forests, is proposed in the context of classification, with an application to a real dataset.

440 citations


Journal ArticleDOI
TL;DR: RFE outperforms SVM-RFE and KWS on the task of finding small subsets of features with high discrimination levels on PTR-MS data sets, and it is shown how selection probabilities and features co-occurrence can be used to highlight the most relevant features for discrimination.

429 citations


Journal ArticleDOI
15 May 2006-Proteins
TL;DR: In this article, the authors investigated the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, and assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks.
Abstract: Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false-positive and false-negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co-complex relationship, and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity-based k-Nearest-Neighbor, Naive Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not among the top-ranking features under any condition.

01 Jan 2006
TL;DR: A Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learners, and fltting and inference are accomplished via an iterative Bayesian backfltting MCMC algorithm that generates samples from a posterior.
Abstract: We develop a Bayesian \sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fltting and inference are accomplished via an iterative Bayesian backfltting MCMC algorithm that generates samples from a posterior. Efiectively, BART is a nonparametric Bayesian regression approach which uses dynamic random basis elements that are dimensionally adaptive. BART is motivated by ensemble methods in general, and boosting algorithms in particular. However, BART is deflned by a statistical model: a prior and a likelihood, while boosting is deflned by an algorithm. This model-based approach enables a full assessment of prediction uncertainty while remaining highly competitive in terms of prediction accuracy. The potential of BART is illustrated on examples where it compares favorably with competing methods including gradient boosting, neural nets and random forests. It is also seen that BART is remarkably efiective at flnding low dimensional structure in high dimensional data.

Proceedings ArticleDOI
11 Dec 2006
TL;DR: This paper applies one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs, and presents the modification on the outlier detection algorithm of random forests that is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.
Abstract: Anomaly detection is a critical issue in Network Intrusion Detection Systems (NIDSs). Most anomaly based NIDSs employ supervised algorithms, whose performances highly depend on attack-free training data. However, this kind of training data is difficult to obtain in real world network environment. Moreover, with changing network environment or services, patterns of normal traffic will be changed. This leads to high false positive rate of supervised NIDSs. Unsupervised outlier detection can overcome the drawbacks of supervised anomaly detection. Therefore, we apply one of the efficient data mining algorithms called random forests algorithm in anomaly based NIDSs. Without attack-free training data, random forests algorithm can detect outliers in datasets of network traffic. In this paper, we discuss our framework of anomaly based network intrusion detection. In the framework, patterns of network services are built by random forests algorithm over traffic data. Intrusions are detected by determining outliers related to the built patterns. We present the modification on the outlier detection algorithm of random forests. We also report our experimental results over the KDD'99 dataset. The results show that the proposed approach is comparable to previously reported unsupervised anomaly detection approaches evaluated over the KDD' 99 dataset.

Journal ArticleDOI
TL;DR: A pathway-based classification and regression method using Random Forests to analyze gene expression data and can provide biological insight into the study of microarray data is described.
Abstract: Motivation: Although numerous methods have been developed to better capture biological information from microarray data, commonly used single gene-based methods neglect interactions among genes and leave room for other novel approaches. For example, most classification and regression methods for microarray data are based on the whole set of genes and have not made use of pathway information. Pathway-based analysis in microarray studies may lead to more informative and relevant knowledge for biological researchers. Results: In this paper, we describe a pathway-based classification and regression method using Random Forests to analyze gene expression data. The proposed methods allow researchers to rank important pathways from externally available databases, discover important genes, find pathway-based outlying cases and make full use of a continuous outcome variable in the regression setting. We also compared Random Forests with other machine learning methods using several datasets and found that Random Forests classification error rates were either the lowest or the second-lowest. By combining pathway information and novel statistical methods, this procedure represents a promising computational strategy in dissecting pathways and can provide biological insight into the study of microarray data. Availability: Source code written in R is available from http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm Contact: hongyu.zhao@yale.edu Supplementary Information: Supplementary Data are available at http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm

Journal ArticleDOI
TL;DR: An ensemble learning framework based on random sampling on all three key components of a classification system: the feature space, training samples, and subspace parameters is developed, and a robust random sampling face recognition system integrating shape, texture, and Gabor responses is constructed.
Abstract: Subspace face recognition often suffers from two problems: (1) the training sample set is small compared with the high dimensional feature vector; (2) the performance is sensitive to the subspace dimension. Instead of pursuing a single optimal subspace, we develop an ensemble learning framework based on random sampling on all three key components of a classification system: the feature space, training samples, and subspace parameters. Fisherface and Null Space LDA (N-LDA) are two conventional approaches to address the small sample size problem. But in many cases, these LDA classifiers are overfitted to the training set and discard some useful discriminative information. By analyzing different overfitting problems for the two kinds of LDA classifiers, we use random subspace and bagging to improve them respectively. By random sampling on feature vectors and training samples, multiple stabilized Fisherface and N-LDA classifiers are constructed and the two groups of complementary classifiers are integrated using a fusion rule, so nearly all the discriminative information is preserved. In addition, we further apply random sampling on parameter selection in order to overcome the difficulty of selecting optimal parameters in our algorithms. Then, we use the developed random sampling framework for the integration of multiple features. A robust random sampling face recognition system integrating shape, texture, and Gabor responses is finally constructed.

Journal ArticleDOI
TL;DR: The most accurate algorithm is Breiman's random forest, an ensemble method which provides automatic combination of tree-classifiers trained on bootstrapped subsamples and randomised variable sets, which shows a potential area of P. sylvestris for the Iberian Peninsula which is larger than the present one.

Journal ArticleDOI
TL;DR: Empirical experimentation suggests that the SVM outperforms the other classification methods in terms of predicting the direction of the stock market movement and random forest method outperforms neural network, discriminant analysis and logit model used in this study.
Abstract: There exists vast research articles which predict the stock market as well pricing of stock index financial instruments but most of the proposed models focus on the accurate forecasting of the levels (i.e. value) of the underlying stock index. There is a lack of studies examining the predictability of the direction/sign of stock index movement. Given the notion that a prediction with little forecast error does not necessarily translate into capital gain, this study is an attempt to predict the direction of S&P CNX NIFTY Market Index of the National Stock Exchange, one of the fastest growing financial exchanges in developing Asian countries. Random forest and Support Vector Machines (SVM) are very specific type of machine learning method, and are promising tools for the prediction of financial time series. The tested classification models, which predict direction, include linear discriminant analysis, logit, artificial neural network, random forest and SVM. Empirical experimentation suggests that the SVM outperforms the other classification methods in terms of predicting the direction of the stock market movement and random forest method outperforms neural network, discriminant analysis and logit model used in this study.

Journal ArticleDOI
TL;DR: In this paper, the authors provide an introduction to ensemble statistical procedures as a special case of algorithmic methods, using classification and regression trees (CART) as a didactic device to introduce many of the key issues.
Abstract: This paper provides an introduction to ensemble statistical procedures as a special case of algorithmic methods. The discussion beings with classification and regression trees (CART) as a didactic device to introduce many of the key issues. Following the material on CART is a consideration of cross-validation, bagging, random forests and boosting. Major points are illustrated with analyses of real data.

Journal ArticleDOI
TL;DR: It is shown that these methods do surprisingly well in predicting recently published ligands of a target on the basis of initial leads and that a combination of the results of different methods in certain cases can improve results compared to the most consistent method.
Abstract: How well do different classification methods perform in selecting the ligands of a protein target out of large compound collections not used to train the model? Support vector machines, random forest, artificial neural networks, k-nearest-neighbor classification with genetic-algorithm-optimized feature selection, trend vectors, naive Bayesian classification, and decision tree were used to divide databases into molecules predicted to be active and those predicted to be inactive. Training and predicted activities were treated as binary. The database was generated for the ligands of five different biological targets which have been the object of intense drug discovery efforts: HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor, and thrombin. We report significant differences in the performance of the methods independent of the biological target and compound class. Different methods can have different applications; some provide particularly high enrichment, others are strong in retrieving the maximum number of actives. We also show that these methods do surprisingly well in predicting recently published ligands of a target on the basis of initial leads and that a combination of the results of different methods in certain cases can improve results compared to the most consistent method.

Journal ArticleDOI
TL;DR: The results show the usefulness of random forest‐ and SVM‐based feature selection approaches in comparison to the SVM/GA approach for land cover classification problems with hyperspectral data, and an improved performance using these techniques in relation to the maximum noise transformation based feature extraction technique.
Abstract: This paper present the results of a support vector machine (SVM) technique and a genetic algorithm (GA) technique using generalization error bounds derived for SVMs as fitness functions (SVM/GA) for feature selection using hyperspectral data. Results obtained with the SVM/GA‐based technique were compared with those produced by random forest‐ and SVM‐based feature selection techniques in terms of classification accuracy and computational cost. The classification accuracy using SVM‐based feature selection was 91.89%. The number of features selected was 15. For comparison, the accuracy produced by the use of the full set of 65 features was 91.76%. The level of classification accuracy achieved by the SVM/GA approach using 15 features varied from 91.87% to 92.44% with different fitness functions but required a large training time. The performance of the random forest‐based feature selection approach gave a classification accuracy of 91.89%, which is comparable to the accuracy achieved by using the SVM and SVM/...

Journal ArticleDOI
TL;DR: Information on clinical diagnoses can be used to accurately predict mortality among hospitalized patients with SLE and random forests represent a useful technique to identify the most important predictors from a larger (often much larger) number and to validate the classification.
Abstract: Objective To identify demographic and clinical characteristics that classify patients with systemic lupus erythematosus (SLE) at risk for in-hospital mortality. Methods Patients hospitalized in California from 1996 to 2000 with a principal diagnosis of SLE (N = 3,839) were identified from a state hospitalization database. As candidate predictors of mortality, we used patient demographic characteristics; the presence or absence of 40 different clinical conditions listed among the discharge diagnoses; and 2 summary indexes derived from the discharge diagnoses, the Charlson Index and the SLE Comorbidity Index. Predictors of patients at increased risk of mortality were identified and validated using random forests, a statistical procedure that is a generalization of single classification trees. Random forests use bootstrapped samples of patients and randomly selected subsets of predictors to create individual classification trees, and this process is repeated to generate multiple trees (a forest). Classification is then done by majority vote across all trees. Results Of the 3,839 patients, 109 died during hospitalization. Selecting from all available predictors, the random forests had excellent predictive accuracy for classification of death. The mean classification error rate, averaged over 10 forests of 500 trees each, was 11.9%. The most important predictors were the Charlson Index, respiratory failure, SLE Comorbidity Index, age, sepsis, nephritis, and thrombocytopenia. Conclusion Information on clinical diagnoses can be used to accurately predict mortality among hospitalized patients with SLE. Random forests represent a useful technique to identify the most important predictors from a larger (often much larger) number and to validate the classification.

01 Jan 2006
TL;DR: Few amino acid positions in rpoB are associated with most of the rifampin resistance in Mycobacterium tuberculosis, and development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors.
Abstract: (2004). Few amino acid positions in rpoB are associated with most of the rifampin resistance in Mycobacterium tuberculosis. Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors.

Journal ArticleDOI
01 Sep 2006
TL;DR: First order random forests with complex aggregates are an efficient and effective approach towards learning relational classifiers that involve aggregates over complex selections.
Abstract: In relational learning, predictions for an individual are based not only on its own properties but also on the properties of a set of related individuals. Relational classifiers differ with respect to how they handle these sets: some use properties of the set as a whole (using aggregation), some refer to properties of specific individuals of the set, however, most classifiers do not combine both. This imposes an undesirable bias on these learners. This article describes a learning approach that avoids this bias, using first order random forests. Essentially, an ensemble of decision trees is constructed in which tests are first order logic queries. These queries may contain aggregate functions, the argument of which may again be a first order logic query. The introduction of aggregate functions in first order logic, as well as upgrading the forest's uniform feature sampling procedure to the space of first order logic, generates a number of complications. We address these and propose a solution for them. The resulting first order random forest induction algorithm has been implemented and integrated in the ACE-ilProlog system, and experimentally evaluated on a variety of datasets. The results indicate that first order random forests with complex aggregates are an efficient and effective approach towards learning relational classifiers that involve aggregates over complex selections.

Book ChapterDOI
TL;DR: This chapter summarizes the Random Forests methodology and illustrates its use on freely available data sets.
Abstract: Random Forests is a powerful multipurpose tool for predicting and understanding data. If gene expression data come from known groups or classes (e.g., tumor patients and controls), Random Forests can rank the genes in terms of their usefulness in separating the groups. When the groups are unknown, Random Forests uses an intrinsic measure of the similarity of the genes to extract useful multivariate structure, including clusters. This chapter summarizes the Random Forests methodology and illustrates its use on freely available data sets.

Proceedings ArticleDOI
01 Sep 2006
TL;DR: The results indicate that utilizing multiple data types is beneficial when the disease model is complex and the phenotypic outcome-associated data type is unknown, and RF is adept at identifying relevant features in high-dimensional data with small main effects and low heritability.
Abstract: Complex clinical phenotypes arise from the concerted interactions among the myriad components of a biological system. Therefore, comprehensive models can only be developed through the integrated study of multiple types of experimental data gathered from the system in question. The Random Foreststrade(RF) method is adept at identifying relevant features having only slight main effects in high-dimensional data. This method is well-suited to integrated analysis, as relevant attributes may be selected from categorical or continuous data, and there may be interactions across data types. RF is a natural approach for studying gene-gene, gene-protein, or protein-protein interactions because importance scores for particular attributes take interactions into account. Thus, Random Forests is a promising solution to the analysis challenge posed by high-dimensional datasets including interactions among attributes of different types. In this study, we characterize the performance of RF on a range of simulated genetic and/or proteomic datasets. We compare the performance of RF in identifying relevant attributes when given genetic data alone, proteomic data alone, or a combined dataset of genetic plus proteomic data. Our results indicate that utilizing multiple data types is beneficial when the disease model is complex and the phenotypic outcome-associated data type is unknown. The results of this study also show that RF is adept at identifying relevant features in high-dimensional data with small main effects and low heritability

Journal ArticleDOI
TL;DR: A global sensitivity analysis with regional properties is introduced and it can be shown that an uncertainty analysis based on one-dimensional scatter plots and correlation analyses such as the Spearman Rank Correlation coefficient can lead to misinterpretations of any model results.
Abstract: A global sensitivity analysis with regional properties is introduced. This method is demonstrated on two synthetic and one hydraulic example. It can be shown that an uncertainty analysis based on one-dimensional scatter plots and correlation analyses such as the Spearman Rank Correlation coefficient can lead to misinterpretations of any model results. The method which has been proposed in this paper is based on multiple regression trees (so called Random Forests). The splits at each node of the regression tree are sampled from a probability distribution. Several criteria are enforced at each level of splitting to ensure positive information gain and also to distinguish between behavioural and non-behavioural model representations. The latter distinction is applied in the generalized likelihood uncertainty estimation (GLUE) and regional sensitivity analysis (RSA) framework to analyse model results and is used here to derive regression tree (model) structures. Two methods of sensitivity analysis are used: in the first method the total information gain achieved by each parameter is evaluated. In the second method parameters and parameter sets are permuted and an error rate computed. This error rate is compared to values without permutation. This latter method allows the evaluation of the sensitivity of parameter combinations and thus gives an insight into the structure of the response surface. The examples demonstrate the capability of this methodology and stress the importance of the application of sensitivity analysis.

Journal Article
TL;DR: It is demonstrated that the prediction performance of RF may still be improved in some domains by replacing the combination function with dynamic integration, which is based on local performance estimates.
Abstract: Random Forests (RF) are a successful ensemble prediction technique that uses majority voting or averaging as a combination function. However, it is clear that each tree in a random forest may have a different contribution in processing a certain instance. In this paper, we demonstrate that the prediction performance of RF may still be improved in some domains by replacing the combination function with dynamic integration, which is based on local performance estimates. Our experiments also demonstrate that the RF Intrinsic Similarity is better than the commonly used Heterogeneous Euclidean/Overlap Metric in finding a neighbourhood for local estimates in the context of dynamic integration of classification random forests.

01 Jan 2006
TL;DR: In this article, a simple feature extraction for high-dimensional image representation is proposed. But feature selection still needs to be considered for 3D recovery, and the feature selection benefits of feature selection has not yet been evaluated.
Abstract: Invited Contributions.- Discrete Component Analysis.- Overview and Recent Advances in Partial Least Squares.- Random Projection, Margins, Kernels, and Feature-Selection.- Some Aspects of Latent Structure Analysis.- Feature Selection for Dimensionality Reduction.- Contributed Papers.- Auxiliary Variational Information Maximization for Dimensionality Reduction.- Constructing Visual Models with a Latent Space Approach.- Is Feature Selection Still Necessary?.- Class-Specific Subspace Discriminant Analysis for High-Dimensional Data.- Incorporating Constraints and Prior Knowledge into Factorization Algorithms - An Application to 3D Recovery.- A Simple Feature Extraction for High Dimensional Image Representations.- Identifying Feature Relevance Using a Random Forest.- Generalization Bounds for Subspace Selection and Hyperbolic PCA.- Less Biased Measurement of Feature Selection Benefits.

Book ChapterDOI
18 Sep 2006
TL;DR: In this paper, the authors demonstrate that the prediction performance of RF may still be improved in some domains by replacing the combination function with dynamic integration, which is based on local performance estimates, and demonstrate that RF Intrinsic Similarity is better than the commonly used Heterogeneous Euclidean/Overlap Metric in finding a neighbourhood for local estimates in the context of dynamic integration of classification random forests.
Abstract: Random Forests (RF) are a successful ensemble prediction technique that uses majority voting or averaging as a combination function. However, it is clear that each tree in a random forest may have a different contribution in processing a certain instance. In this paper, we demonstrate that the prediction performance of RF may still be improved in some domains by replacing the combination function with dynamic integration, which is based on local performance estimates. Our experiments also demonstrate that the RF Intrinsic Similarity is better than the commonly used Heterogeneous Euclidean/Overlap Metric in finding a neighbourhood for local estimates in the context of dynamic integration of classification random forests.

Proceedings Article
01 Nov 2006
TL;DR: The experimental results show that all ensemble methods outperform C4.5.5, and Wilcoxon signed rank test is better than sign test for such purpose, and all five methods benefit from data preprocessing, including gene selection and discretization, in classification accuracy.
Abstract: In response to the rapid development of DNA Microarray technology, many classification methods have been used for Microarray classification. SVMs, decision trees, Bagging, Boosting and Random Forest are commonly used methods. In this paper, we conduct experimental comparison of LibSVMs, C4.5, BaggingC4.5, AdaBoostingC4.5, and Random Forest on seven Microarray cancer data sets. The experimental results show that all ensemble methods outperform C4.5. The experimental results also show that all five methods benefit from data preprocessing, including gene selection and discretization, in classification accuracy. In addition to comparing the average accuracies of ten-fold cross validation tests on seven data sets, we use two statistical tests to validate findings. We observe that Wilcoxon signed rank test is better than sign test for such purpose.