scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2011"


Journal ArticleDOI
TL;DR: Random forests has become a popular technique for classification, prediction, studying variable importance, variable selection, and outlier detection, and results of new tests regarding variable rankings based on RF variable importance measures are presented.

622 citations


Journal ArticleDOI
TL;DR: A supervised workflow is proposed in this study to reduce manual labor and objectify the choice of significant object features and classification thresholds and resulted in accuracies between 73% and 87% for the affected areas, and approximately balanced commission and omission errors.

569 citations


Journal ArticleDOI
TL;DR: Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC), and has the advantage of computing the importance of each variable in the classification process.
Abstract: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

521 citations


Journal ArticleDOI
TL;DR: The Random Forests algorithm is chosen as a classifier: it runs efficiently on large datasets, and provides measures of feature importance for each class, and the relevance of full-waveform lidar features is demonstrated for building and vegetation area discrimination.
Abstract: Airborne lidar systems have become a source for the acquisition of elevation data. They provide georeferenced, irregularly distributed 3D point clouds of high altimetric accuracy. Moreover, these systems can provide for a single laser pulse, multiple returns or echoes, which correspond to different illuminated objects. In addition to multi-echo laser scanners, full-waveform systems are able to record 1D signals representing a train of echoes caused by reflections at different targets. These systems provide more information about the structure and the physical characteristics of the targets. Many approaches have been developed, for urban mapping, based on aerial lidar solely or combined with multispectral image data. However, they have not assessed the importance of input features. In this paper, we focus on a multi-source framework using aerial lidar (multi-echo and full waveform) and aerial multispectral image data. We aim to study the feature relevance for dense urban scenes. The Random Forests algorithm is chosen as a classifier: it runs efficiently on large datasets, and provides measures of feature importance for each class. The margin theory is used as a confidence measure of the classifier, and to confirm the relevance of input features for urban classification. The quantitative results confirm the importance of the joint use of optical multispectral and lidar data. Moreover, the relevance of full-waveform lidar features is demonstrated for building and vegetation area discrimination.

394 citations


Journal ArticleDOI
TL;DR: It is demonstrated that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias, and two related methods for group selection based on feature clustering can be used for correcting the correlation bias.
Abstract: Motivation: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking. Results: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype. Availability: R code can be found at: http://www.mpi-inf.mpg.de/~laura/Clustering.r. Contact: laura.tolosi@mpi-inf.mpg.de Supplementary information:Supplementary data are available at Bioinformatics online.

356 citations


Book ChapterDOI
31 Aug 2011
TL;DR: A system for estimating location and orientation of a person's head, from depth data acquired by a low quality device, based on discriminative random regression forests based on ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation.
Abstract: We present a system for estimating location and orientation of a person's head, from depth data acquired by a low quality device Our approach is based on discriminative random regression forests: ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation We evaluate three different approaches to jointly take classification and regression performance into account during training For evaluation, we acquired a new dataset and propose a method for its automatic annotation

336 citations


Journal ArticleDOI
TL;DR: When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests.
Abstract: Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5. When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing.

331 citations


Journal ArticleDOI
TL;DR: In this paper, an approach for predicting individual tree attributes, i.e., tree height, diameter at breast height (DBH) and stem volume, based on both physical and statistical features derived from airborne laser-scanning data utilizing a new detection method for finding individual trees together with random forests as an estimation method.
Abstract: This paper depicts an approach for predicting individual tree attributes, i.e., tree height, diameter at breast height (DBH) and stem volume, based on both physical and statistical features derived from airborne laser-scanning data utilizing a new detection method for finding individual trees together with random forests as an estimation method. The random forests (also called regression forests) technique is a nonparametric regression method consisting of a set of individual regression trees. Tests of the method were performed, using 1476 trees in a boreal forest area in southern Finland and laser data with a density of 2.6 points per m 2 . Correlation coefficients ( R ) between the observed and predicted values of 0.93, 0.79 and 0.87 for individual tree height, DBH and stem volume, respectively, were achieved, based on 26 laser-derived features. The corresponding relative root-mean-squared errors (RMSEs) were 10.03%, 21.35% and 45.77% (38% in best cases), which are similar to those obtained with the linear regression method, with maximum laser heights, laser-estimated DBH or crown diameters as predictors. With random forests, however, the forest models currently used for deriving the tree attributes are not needed. Based on the results, we conclude that the method is capable of providing a stable and consistent solution for determining individual tree attributes using small-footprint laser data.

324 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: Results show that the proposed random forest with discriminative decision trees algorithm identifies semantically meaningful visual information and outperforms state-of-the-art algorithms on various datasets.
Abstract: In this paper, we study the problem of fine-grained image categorization. The goal of our method is to explore fine image statistics and identify the discriminative image patches for recognition. We achieve this goal by combining two ideas, discriminative feature mining and randomization. Discriminative feature mining allows us to model the detailed information that distinguishes different classes of images, while randomization allows us to handle the huge feature space and prevents over-fitting. We propose a random forest with discriminative decision trees algorithm, where every tree node is a discriminative classifier that is trained by combining the information in this node as well as all upstream nodes. Our method is tested on both subordinate categorization and activity recognition datasets. Experimental results show that our method identifies semantically meaningful visual information and outperforms state-of-the-art algorithms on various datasets.

297 citations


Journal ArticleDOI
TL;DR: In this article, a methodology for variable-star classification using machine learning techniques has been proposed, which can quickly and automatically produce calibrated classification probabilities for newly observed variables based on small numbers of time-series measurements.
Abstract: With the coming data deluge from synoptic surveys, there is a need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly observed variables based on small numbers of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics (features), detail methods to robustly estimate periodic features, introduce tree-ensemble methods for accurate variable-star classification, and show how to rigorously evaluate a classifier using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% error rate using the random forest (RF) classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. The RF classifier is superior to other methods in terms of accuracy, speed, and relative immunity to irrelevant features; the RF can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which reduces the catastrophic error rate from 8% to 7.8%. Excluding low-amplitude sources, the overall error rate improves to 14%, with a catastrophic error rate of 3.5%.

281 citations


Journal ArticleDOI
TL;DR: In an a posteriori analysis, it is shown how selected features during classification can be ranked according to their discriminative power and reveal the most important ones.

Book ChapterDOI
05 Sep 2011
TL;DR: This work proposes to employ "oblique" random forests (oRF) built from multivariate trees which explicitly learn optimal split directions at internal nodes using linear discriminative models, rather than using random coefficients as the original oRF.
Abstract: In his original paper on random forests, Breiman proposed two different decision tree ensembles: one generated from "orthogonal" trees with thresholds on individual features in every split, and one from "oblique" trees separating the feature space by randomly oriented hyperplanes. In spite of a rising interest in the random forest framework, however, ensembles built from orthogonal trees (RF) have gained most, if not all, attention so far. In the present work we propose to employ "oblique" random forests (oRF) built from multivariate trees which explicitly learn optimal split directions at internal nodes using linear discriminative models, rather than using random coefficients as the original oRF. This oRF outperforms RF, as well as other classifiers, on nearly all data sets but those with discrete factorial features. Learned node models perform distinctively better than random splits. An oRF feature importance score shows to be preferable over standard RF feature importance scores such as Gini or permutation importance. The topology of the oRF decision space appears to be smoother and better adapted to the data, resulting in improved generalization performance. Overall, the oRF propose here may be preferred over standard RF on most learning tasks involving numerical and spectral data.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work provides a way to incorporate structural information in the popular random forest framework for performing low-level, unary classification and provides two possibilities for integrating the structured output predictions into concise, semantic labellings.
Abstract: In this paper we propose a simple and effective way to integrate structural information in random forests for semantic image labelling. By structural information we refer to the inherently available, topological distribution of object classes in a given image. Different object class labels will not be randomly distributed over an image but usually form coherently labelled regions. In this work we provide a way to incorporate this topological information in the popular random forest framework for performing low-level, unary classification. Our paper has several contributions: First, we show how random forests can be augmented with structured label information. In the second part, we introduce a novel data splitting function that exploits the joint distributions observed in the structured label space for learning typical label transitions between object classes. Finally, we provide two possibilities for integrating the structured output predictions into concise, semantic labellings. In our experiments on the challenging MSRC and CamVid databases, we compare our method to standard random forest and conditional random field classification results.

Journal ArticleDOI
TL;DR: A methodology for variable-star classification, drawing from modern machine-learning techniques, which is effective for identifying samples of specific science classes and presents the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier.
Abstract: With the coming data deluge from synoptic surveys, there is a growing need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly-observed variables based on a small number of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics ("feature"), detail methods to robustly estimate periodic light-curve features, introduce tree-ensemble methods for accurate variable star classification, and show how to rigorously evaluate the classification results using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% overall classification error using the random forest classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is superior to other machine-learned methods in terms of accuracy, speed, and relative immunity to features with no useful class information; the RF classifier can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which further reduces the catastrophic error rate to 7.8%. Excluding low-amplitude sources, our overall error rate improves to 14%, with a catastrophic error rate of 3.5%.

Journal ArticleDOI
TL;DR: The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.
Abstract: Random forests have emerged as a versatile and highly accurate classificationand regression methodology, requiring little tuning and providing interpretableoutputs. Here, we briefly outline the genesis of, and motivation for, the randomforest paradigm as an outgrowth from earlier tree-structured techniques. Weelaborate on aspects of prediction error and attendant tuning parameter issues.However,ouremphasisisonextendingtherandomforestschematothemultipleresponse setting. We provide a simple illustrative example from ecology thatshowcases the improved fit and enhanced interpretation afforded by the randomforest framework.

Journal ArticleDOI
TL;DR: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses.
Abstract: Summary Background—Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. Objectives—The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. Methods—Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. Results—Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. Conclusions—Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

Journal ArticleDOI
TL;DR: In this paper, the Hipparcos periodic variable stars were classified into 26 types, including SPB and ACV blue variables and between eclipsing binaries, ellipsoidal variables and other variability types.
Abstract: We present an evaluation of the performance of an automated classification of the Hipparcos periodic variable stars into 26 types. The sub-sample with the most reliable variability types available in the literature is used to train supervised algorithms to characterize the type dependencies on a number of attributes. The most useful attributes evaluated with the random forest methodology include, in decreasing order of importance, the period, the amplitude, the V − I colour index, the absolute magnitude, the residual around the folded light-curve model, the magnitude distribution skewness and the amplitude of the second harmonic of the Fourier series model relative to that of the fundamental frequency. Random forests and a multistage scheme involving Bayesian network and Gaussian mixture methods lead to statistically equivalent results. In standard 10-fold cross-validation (CV) experiments, the rate of correct classification is between 90 and 100 per cent, depending on the variability type. The main mis-classification cases, up to a rate of about 10 per cent, arise due to confusion between SPB and ACV blue variables and between eclipsing binaries, ellipsoidal variables and other variability types. Our training set and the predicted types for the other Hipparcos periodic stars are available online.

Proceedings ArticleDOI
03 Oct 2011
TL;DR: This paper uses the German Traffic Sign Benchmark data set to evaluate the performance of K-d trees and Random Forests for traffic sign classification using different size Histogram of Oriented Gradients (HOG) descriptors and Distance Transforms.
Abstract: In this paper, we evaluate the performance of K-d trees and Random Forests for traffic sign classification using different size Histogram of Oriented Gradients (HOG) descriptors and Distance Transforms. We use the German Traffic Sign Benchmark data set [1] containing 43 classes and more than 50,000 images. The K-d tree is fast to build and search in. We combine the tree classifiers with the HOG descriptors as well as the Distance Transforms and achieve classification rates of up to 97% and 81.8% respectively.

Book ChapterDOI
03 Jul 2011
TL;DR: The entangled decision forest (EDF) is proposed as a new discriminative classifier which augments the state of the art decision forest, resulting in higher prediction accuracy and shortened decision time, and injecting randomness in a guided way, in which node feature types and parameters are randomly drawn from a learned (nonuniform) distribution.
Abstract: This work addresses the challenging problem of simultaneously segmenting multiple anatomical structures in highly varied CT scans. We propose the entangled decision forest (EDF) as a new discriminative classifier which augments the state of the art decision forest, resulting in higher prediction accuracy and shortened decision time. Our main contribution is two-fold. First, we propose entangling the binary tests applied at each tree node in the forest, such that the test result can depend on the result of tests applied earlier in the same tree and at image points offset from the voxel to be classified. This is demonstrated to improve accuracy and capture long-range semantic context. Second, during training, we propose injecting randomness in a guided way, in which node feature types and parameters are randomly drawn from a learned (non-uniform) distribution. This further improves classification accuracy. We assess our probabilistic anatomy segmentation technique using a labeled database of CT image volumes of 250 different patients from various scan protocols and scanner vendors. In each volume, 12 anatomical structures have been manually segmented. The database comprises highly varied body shapes and sizes, a wide array of pathologies, scan resolutions, and diverse contrast agents. Quantitative comparisons with state of the art algorithms demonstrate both superior test accuracy and computational efficiency.

Journal ArticleDOI
TL;DR: This work discusses effective ways to regularize forests and discusses how to properly tune the RF parameters ‘nodesize’ and ‘mtry’, and introduces new graphical ways of using minimal depth for exploring variable relationships.
Abstract: Minimal depth is a dimensionless order statistic that measures the predictiveness of a variable in a survival tree. It can be used to select variables in high-dimensional problems using Random Survival Forests (RSF), a new extension of Breiman's Random Forests (RF) to survival settings. We review this methodology and demonstrate its use in high-dimensional survival problems using a public domain R-language package randomSurvivalForest . We discuss effective ways to regularize forests and discuss how to properly tune the RF parameters ‘nodesize’ and ‘mtry’. We also introduce new graphical ways of using minimal depth for exploring variable relationships. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 115–132 2011 © 2011 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: A novel multistep algorithm is developed that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART) and has general utility for any application that seeks to relate gene expression data to a continuous output variable.
Abstract: Motivation: Panels of cell lines such as the NCI-60 have long been used to test drug candidates for their ability to inhibit proliferation. Predictive models of in vitro drug sensitivity have previously been constructed using gene expression signatures generated from gene expression microarrays. These statistical models allow the prediction of drug response for cell lines not in the original NCI-60. We improve on existing techniques by developing a novel multistep algorithm that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART). Results: This method proved successful in predicting drug response for both a panel of 19 Breast Cancer and 7 Glioma cell lines, outperformed other methods based on differential gene expression, and has general utility for any application that seeks to relate gene expression data to a continuous output variable. Implementation: Software was written in the R language and will be available together with associated gene expression and drug response data as the package ivDrug at http://r-forge.r-project.org. Contact: riddickgp@mail.nih.gov Supplementary Information:Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The experiments show that the proposed HMC system outperforms the best-performing approach from the literature (a collection of SVMs, each predicting one label at the lowest level of the hierarchy), both in terms of error and efficiency.

Book ChapterDOI
TL;DR: This chapter briefly review decision tree and related ensemble algorithms and shows the successful applications of such approaches on solving biological problems, and aims to provide a platform to bridge the gap between biologists and computer scientists.
Abstract: Machine learning approaches have wide applications in bioinformatics, and decision tree is one of the successful approaches applied in this field. In this chapter, we briefly review decision tree and related ensemble algorithms and show the successful applications of such approaches on solving biological problems. We hope that by learning the algorithms of decision trees and ensemble classifiers, biologists can get the basic ideas of how machine learning algorithms work. On the other hand, by being exposed to the applications of decision trees and ensemble algorithms in bioinformatics, computer scientists can get better ideas of which bioinformatics topics they may work on in their future research directions. We aim to provide a platform to bridge the gap between biologists and computer scientists.

26 Jan 2011
TL;DR: In this paper, the authors combine the two algorithms by first learning a ranking function with Random Forests and using it as initialization for GBRT, which yields surprisingly accurate ranking results.
Abstract: In May 2010 Yahoo! Inc. hosted the Learning to Rank Challenge. This paper summarizes the approach by the highly placed team Washington University in St. Louis. We investigate Random Forests (RF) as a low-cost alternative algorithm to Gradient Boosted Regression Trees (GBRT) (the de facto standard of web-search ranking). We demonstrate that it yields surprisingly accurate ranking results -- comparable to or better than GBRT. We combine the two algorithms by first learning a ranking function with RF and using it as initialization for GBRT. We refer to this setting as iGBRT. Following a recent discussion by Li et al. (2007), we show that the results of iGBRT can be improved upon even further when the web-search ranking task is cast as classification instead of regression. We provide an upper bound of the Expected Reciprocal Rank (Chapelle et al., 2009) in terms of classification error and demonstrate that iGBRT outperforms GBRT and RF on the Microsoft Learning to Rank and Yahoo Ranking Competition data sets with surprising consistency.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper introduces a new formulation for discrete image labeling tasks, the Decision Tree Field (DTF), that combines and generalizes random forests and conditional random fields which have been widely used in computer vision.
Abstract: This paper introduces a new formulation for discrete image labeling tasks, the Decision Tree Field (DTF), that combines and generalizes random forests and conditional random fields (CRF) which have been widely used in computer vision. In a typical CRF model the unary potentials are derived from sophisticated random forest or boosting based classifiers, however, the pairwise potentials are assumed to (1) have a simple parametric form with a pre-specified and fixed dependence on the image data, and (2) to be defined on the basis of a small and fixed neighborhood. In contrast, in DTF, local interactions between multiple variables are determined by means of decision trees evaluated on the image data, allowing the interactions to be adapted to the image content. This results in powerful graphical models which are able to represent complex label structure. Our key technical contribution is to show that the DTF model can be trained efficiently and jointly using a convex approximate likelihood function, enabling us to learn over a million free model parameters. We show experimentally that for applications which have a rich and complex label structure, our model achieves excellent results.

Journal ArticleDOI
TL;DR: The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed, and among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions.
Abstract: Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context. This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models' predictive ability. The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem.

Journal ArticleDOI
01 Jan 2011-Surgery
TL;DR: It is demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making.

Journal ArticleDOI
TL;DR: In this article, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction, namely Rotation Forests and RotBoost, which are applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost.
Abstract: Several studies have demonstrated the superior performance of ensemble classification algorithms, whereby multiple member classifiers are combined into one aggregated and powerful classification model, over single models. In this paper, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction. In Rotation Forests, feature extraction is applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost. In an experimental validation based on data sets from four real-life customer churn prediction projects, Rotation Forest and RotBoost are compared to a set of well-known benchmark classifiers. Moreover, variations of Rotation Forest and RotBoost are compared, implementing three alternative feature extraction algorithms: principal component analysis (PCA), independent component analysis (ICA) and sparse random projections (SRP). The performance of rotation-based ensemble classifier is found to depend upon: (i) the performance criterion used to measure classification performance, and (ii) the implemented feature extraction algorithm. In terms of accuracy, RotBoost outperforms Rotation Forest, but none of the considered variations offers a clear advantage over the benchmark algorithms. However, in terms of AUC and top-decile lift, results clearly demonstrate the competitive performance of Rotation Forests compared to the benchmark algorithms. Moreover, ICA-based Rotation Forests outperform all other considered classifiers and are therefore recommended as a well-suited alternative classification technique for the prediction of customer churn that allows for improved marketing decision making.

Journal ArticleDOI
TL;DR: This work proposes a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF), which implements a backward elimination process based on the initial ranking of variables.
Abstract: Objective: Genomic profiling, the use of genetic variants at multiple loci simultaneously for the prediction of disease risk, requires the selection of a set of genetic variants that best predicts disease status. The goal of this work was to provide a new selection algorithm for genomic profiling. Methods: We propose a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF). The proposed strategy implements a backward elimination process based on the initial ranking of variables. Results and Conclusions: We demonstrate the advantage of using the AUC instead of the classification error as a measure of predictive accuracy of RF. In particular, we show that the use of the classification error is especially inappropriate when dealing with unbalanced data sets. The new procedure for variable selection and prediction, namely AUC-RF, is illustrated with data from a bladder cancer study and also with simulated data. The algorithm is publicly available as an R package, named AUCRF, at http://cran.r-project.org/.

Book ChapterDOI
18 Sep 2011
TL;DR: This work proposes an efficient approach for estimating location and size of multiple anatomical structures in MR scans, and adapts random ferns to produce multidimensional regression output and compares them with random regression forests.
Abstract: Automatic localization of multiple anatomical structures in medical images provides important semantic information with potential benefits to diverse clinical applications. Aiming at organ-specific attenuation correction in PET/MR imaging, we propose an efficient approach for estimating location and size of multiple anatomical structures in MR scans. Our contribution is three-fold: (1) we apply supervised regression techniques to the problem of anatomy detection and localization in whole-body MR, (2) we adapt random ferns to produce multidimensional regression output and compare them with random regression forests, and (3) introduce the use of 3D LBP descriptors in multi-channel MR Dixon sequences. The localization accuracy achieved with both fern- and forest-based approaches is evaluated by direct comparison with state of the art atlas-based registration, on ground-truth data from 33 patients. Our results demonstrate improved anatomy localization accuracy with higher efficiency and robustness.