Showing papers on "Random forest published in 2011"

PDF

Open Access

Journal Article•DOI•

Mining data with random forests: A survey and results of new tests

[...]

Antanas Verikas¹, Adas Gelzinis¹, Marija Bacauskiene¹•Institutions (1)

01 Feb 2011-Pattern Recognition

TL;DR: Random forests has become a popular technique for classification, prediction, studying variable importance, variable selection, and outlier detection, and results of new tests regarding variable rankings based on RF variable importance measures are presented.

...read moreread less

622 citations

Journal Article•DOI•

Object-oriented mapping of landslides using Random Forests

[...]

André Stumpf¹, André Stumpf², Norman Kerle²•Institutions (2)

Ecole et Observatoire des Sciences de la Terre¹, University of Twente²

17 Oct 2011-Remote Sensing of Environment

TL;DR: A supervised workflow is proposed in this study to reduce manual labor and objectify the choice of significant object features and classification thresholds and resulted in accuracies between 73% and 87% for the affected areas, and approximately balanced commission and omission errors.

...read moreread less

569 citations

Journal Article•DOI•

Predicting disease risks from highly imbalanced data using random forest

[...]

Mohammed Khalilia¹, Sounak Chakraborty¹, Mihail Popescu¹•Institutions (1)

University of Missouri¹

29 Jul 2011-BMC Medical Informatics and Decision Making

TL;DR: Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC), and has the advantage of computing the importance of each variable in the classification process.

...read moreread less

Abstract: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

...read moreread less

521 citations

Journal Article•DOI•

Relevance of airborne lidar and multispectral image data for urban scene classification using Random Forests

[...]

Li Guo¹, Nesrine Chehata¹, Nesrine Chehata², Clément Mallet², Samia Boukir¹ - Show less +1 more•Institutions (2)

University of Bordeaux¹, University of Paris²

01 Jan 2011-Isprs Journal of Photogrammetry and Remote Sensing

TL;DR: The Random Forests algorithm is chosen as a classifier: it runs efficiently on large datasets, and provides measures of feature importance for each class, and the relevance of full-waveform lidar features is demonstrated for building and vegetation area discrimination.

...read moreread less

Abstract: Airborne lidar systems have become a source for the acquisition of elevation data. They provide georeferenced, irregularly distributed 3D point clouds of high altimetric accuracy. Moreover, these systems can provide for a single laser pulse, multiple returns or echoes, which correspond to different illuminated objects. In addition to multi-echo laser scanners, full-waveform systems are able to record 1D signals representing a train of echoes caused by reflections at different targets. These systems provide more information about the structure and the physical characteristics of the targets. Many approaches have been developed, for urban mapping, based on aerial lidar solely or combined with multispectral image data. However, they have not assessed the importance of input features. In this paper, we focus on a multi-source framework using aerial lidar (multi-echo and full waveform) and aerial multispectral image data. We aim to study the feature relevance for dense urban scenes. The Random Forests algorithm is chosen as a classifier: it runs efficiently on large datasets, and provides measures of feature importance for each class. The margin theory is used as a confidence measure of the classifier, and to confirm the relevance of input features for urban classification. The quantitative results confirm the importance of the joint use of optical multispectral and lidar data. Moreover, the relevance of full-waveform lidar features is demonstrated for building and vegetation area discrimination.

...read moreread less

394 citations

Journal Article•DOI•

Classification with correlated features

[...]

Laura Tolosi¹, Thomas Lengauer¹•Institutions (1)

Max Planck Society¹

01 Jul 2011-Bioinformatics

TL;DR: It is demonstrated that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias, and two related methods for group selection based on feature clustering can be used for correcting the correlation bias.

...read moreread less

Abstract: Motivation: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking. Results: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype. Availability: R code can be found at: http://www.mpi-inf.mpg.de/~laura/Clustering.r. Contact: laura.tolosi@mpi-inf.mpg.de Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

356 citations

Book Chapter•DOI•

Real time head pose estimation from consumer depth cameras

[...]

Gabriele Fanelli¹, Thibaut Weise², Juergen Gall¹, Luc Van Gool³•Institutions (3)

ETH Zurich¹, École Polytechnique Fédérale de Lausanne², Katholieke Universiteit Leuven³

31 Aug 2011

TL;DR: A system for estimating location and orientation of a person's head, from depth data acquired by a low quality device, based on discriminative random regression forests based on ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation.

...read moreread less

Abstract: We present a system for estimating location and orientation of a person's head, from depth data acquired by a low quality device Our approach is based on discriminative random regression forests: ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation We evaluate three different approaches to jointly take classification and regression performance into account during training For evaluation, we acquired a new dataset and propose a method for its automatic annotation

...read moreread less

336 citations

Journal Article•DOI•

Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

[...]

João Maroco¹, Dina Silva², Ana João Rodrigues³, Manuela Guerreiro², Isabel Santana³, Alexandre de Mendonça² - Show less +2 more•Institutions (3)

International Sleep Products Association¹, Instituto de Medicina Molecular², Hospitais da Universidade de Coimbra³

17 Aug 2011-BMC Research Notes

TL;DR: When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests.

...read moreread less

Abstract: Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed overall classification accuracy above a median value of 0.63, but for most sensitivity was around or even lower than a median value of 0.5. When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing.

...read moreread less

331 citations

Journal Article•DOI•

Predicting individual tree attributes from airborne laser point clouds based on the random forests technique

[...]

Xiaowei Yu¹, Juha Hyyppä¹, Mikko Vastaranta², Markus Holopainen², Risto Viitala³ - Show less +1 more•Institutions (3)

Finnish Geodetic Institute¹, University of Helsinki², HAMK University of Applied Sciences³

01 Jan 2011-Isprs Journal of Photogrammetry and Remote Sensing

TL;DR: In this paper, an approach for predicting individual tree attributes, i.e., tree height, diameter at breast height (DBH) and stem volume, based on both physical and statistical features derived from airborne laser-scanning data utilizing a new detection method for finding individual trees together with random forests as an estimation method.

...read moreread less

Abstract: This paper depicts an approach for predicting individual tree attributes, i.e., tree height, diameter at breast height (DBH) and stem volume, based on both physical and statistical features derived from airborne laser-scanning data utilizing a new detection method for finding individual trees together with random forests as an estimation method. The random forests (also called regression forests) technique is a nonparametric regression method consisting of a set of individual regression trees. Tests of the method were performed, using 1476 trees in a boreal forest area in southern Finland and laser data with a density of 2.6 points per m 2 . Correlation coefficients ( R ) between the observed and predicted values of 0.93, 0.79 and 0.87 for individual tree height, DBH and stem volume, respectively, were achieved, based on 26 laser-derived features. The corresponding relative root-mean-squared errors (RMSEs) were 10.03%, 21.35% and 45.77% (38% in best cases), which are similar to those obtained with the linear regression method, with maximum laser heights, laser-estimated DBH or crown diameters as predictors. With random forests, however, the forest models currently used for deriving the tree attributes are not needed. Based on the results, we conclude that the method is capable of providing a stable and consistent solution for determining individual tree attributes using small-footprint laser data.

...read moreread less

324 citations

Proceedings Article•DOI•

Combining randomization and discrimination for fine-grained image categorization

[...]

Bangpeng Yao¹, Aditya Khosla¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

20 Jun 2011

TL;DR: Results show that the proposed random forest with discriminative decision trees algorithm identifies semantically meaningful visual information and outperforms state-of-the-art algorithms on various datasets.

...read moreread less

Abstract: In this paper, we study the problem of fine-grained image categorization. The goal of our method is to explore fine image statistics and identify the discriminative image patches for recognition. We achieve this goal by combining two ideas, discriminative feature mining and randomization. Discriminative feature mining allows us to model the detailed information that distinguishes different classes of images, while randomization allows us to handle the huge feature space and prevents over-fitting. We propose a random forest with discriminative decision trees algorithm, where every tree node is a discriminative classifier that is trained by combining the information in this node as well as all upstream nodes. Our method is tested on both subordinate categorization and activity recognition datasets. Experimental results show that our method identifies semantically meaningful visual information and outperforms state-of-the-art algorithms on various datasets.

...read moreread less

297 citations

Journal Article•DOI•

On machine-learned classification of variable stars with sparse and noisy time-series data

[...]

Joseph W. Richards¹, Dan L. Starr¹, Nathaniel R. Butler¹, Joshua S. Bloom¹, John M. Brewer², Arien Crellin-Quick¹, Justin Higgins¹, Rachel Kennedy¹, Maxime Rischard¹ - Show less +5 more•Institutions (2)

University of California, Berkeley¹, Yale University²

20 May 2011-The Astrophysical Journal

TL;DR: In this article, a methodology for variable-star classification using machine learning techniques has been proposed, which can quickly and automatically produce calibrated classification probabilities for newly observed variables based on small numbers of time-series measurements.

...read moreread less

Abstract: With the coming data deluge from synoptic surveys, there is a need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly observed variables based on small numbers of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics (features), detail methods to robustly estimate periodic features, introduce tree-ensemble methods for accurate variable-star classification, and show how to rigorously evaluate a classifier using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% error rate using the random forest (RF) classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. The RF classifier is superior to other methods in terms of accuracy, speed, and relative immunity to irrelevant features; the RF can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which reduces the catastrophic error rate from 8% to 7.8%. Excluding low-amplitude sources, the overall error rate improves to 14%, with a catastrophic error rate of 3.5%.

...read moreread less

281 citations

Journal Article•DOI•

Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonance images.

[...]

Ezequiel Geremia, Olivier Clatz, Bjoern H. Menze, Ender Konukoglu, Antonio Criminisi, Nicholas Ayache - Show less +2 more

15 Jul 2011-NeuroImage

TL;DR: In an a posteriori analysis, it is shown how selected features during classification can be ranked according to their discriminative power and reveal the most important ones.

...read moreread less

Book Chapter•DOI•

On oblique random forests

[...]

Bjoern H. Menze¹, B. Michael Kelm¹, Daniel N. Splitthoff¹, Ullrich Koethe¹, Fred A. Hamprecht¹ - Show less +1 more•Institutions (1)

Interdisciplinary Center for Scientific Computing¹

05 Sep 2011

TL;DR: This work proposes to employ "oblique" random forests (oRF) built from multivariate trees which explicitly learn optimal split directions at internal nodes using linear discriminative models, rather than using random coefficients as the original oRF.

...read moreread less

Abstract: In his original paper on random forests, Breiman proposed two different decision tree ensembles: one generated from "orthogonal" trees with thresholds on individual features in every split, and one from "oblique" trees separating the feature space by randomly oriented hyperplanes. In spite of a rising interest in the random forest framework, however, ensembles built from orthogonal trees (RF) have gained most, if not all, attention so far. In the present work we propose to employ "oblique" random forests (oRF) built from multivariate trees which explicitly learn optimal split directions at internal nodes using linear discriminative models, rather than using random coefficients as the original oRF. This oRF outperforms RF, as well as other classifiers, on nearly all data sets but those with discrete factorial features. Learned node models perform distinctively better than random splits. An oRF feature importance score shows to be preferable over standard RF feature importance scores such as Gini or permutation importance. The topology of the oRF decision space appears to be smoother and better adapted to the data, resulting in improved generalization performance. Overall, the oRF propose here may be preferred over standard RF on most learning tasks involving numerical and spectral data.

...read moreread less

Proceedings Article•DOI•

Structured class-labels in random forests for semantic image labelling

[...]

Peter Kontschieder¹, Samuel Rota Bulò², Horst Bischof¹, Marcello Pelillo²•Institutions (2)

Graz University of Technology¹, Ca' Foscari University of Venice²

06 Nov 2011

TL;DR: This work provides a way to incorporate structural information in the popular random forest framework for performing low-level, unary classification and provides two possibilities for integrating the structured output predictions into concise, semantic labellings.

...read moreread less

Abstract: In this paper we propose a simple and effective way to integrate structural information in random forests for semantic image labelling. By structural information we refer to the inherently available, topological distribution of object classes in a given image. Different object class labels will not be randomly distributed over an image but usually form coherently labelled regions. In this work we provide a way to incorporate this topological information in the popular random forest framework for performing low-level, unary classification. Our paper has several contributions: First, we show how random forests can be augmented with structured label information. In the second part, we introduce a novel data splitting function that exploits the joint distributions observed in the structured label space for learning typical label transitions between object classes. Finally, we provide two possibilities for integrating the structured output predictions into concise, semantic labellings. In our experiments on the challenging MSRC and CamVid databases, we compare our method to standard random forest and conditional random field classification results.

...read moreread less

Journal Article•DOI•

On Machine-Learned Classification of Variable Stars with Sparse and Noisy Time-Series Data

[...]

University of California, Berkeley¹, Yale University²

10 Jan 2011-arXiv: Instrumentation and Methods for Astrophysics

TL;DR: A methodology for variable-star classification, drawing from modern machine-learning techniques, which is effective for identifying samples of specific science classes and presents the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier.

...read moreread less

Abstract: With the coming data deluge from synoptic surveys, there is a growing need for frameworks that can quickly and automatically produce calibrated classification probabilities for newly-observed variables based on a small number of time-series measurements. In this paper, we introduce a methodology for variable-star classification, drawing from modern machine-learning techniques. We describe how to homogenize the information gleaned from light curves by selection and computation of real-numbered metrics ("feature"), detail methods to robustly estimate periodic light-curve features, introduce tree-ensemble methods for accurate variable star classification, and show how to rigorously evaluate the classification results using cross validation. On a 25-class data set of 1542 well-studied variable stars, we achieve a 22.8% overall classification error using the random forest classifier; this represents a 24% improvement over the best previous classifier on these data. This methodology is effective for identifying samples of specific science classes: for pulsational variables used in Milky Way tomography we obtain a discovery efficiency of 98.2% and for eclipsing systems we find an efficiency of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is superior to other machine-learned methods in terms of accuracy, speed, and relative immunity to features with no useful class information; the RF classifier can also be used to estimate the importance of each feature in classification. Additionally, we present the first astronomical use of hierarchical classification methods to incorporate a known class taxonomy in the classifier, which further reduces the catastrophic error rate to 7.8%. Excluding low-amplitude sources, our overall error rate improves to 14%, with a catastrophic error rate of 3.5%.

...read moreread less

Journal Article•DOI•

Multivariate random forests

[...]

Mark R. Segal¹, Yuanyuan Xiao¹•Institutions (1)

University of California, San Francisco¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.

...read moreread less

Abstract: Random forests have emerged as a versatile and highly accurate classiﬁcationand regression methodology, requiring little tuning and providing interpretableoutputs. Here, we brieﬂy outline the genesis of, and motivation for, the randomforest paradigm as an outgrowth from earlier tree-structured techniques. Weelaborate on aspects of prediction error and attendant tuning parameter issues.However,ouremphasisisonextendingtherandomforestschematothemultipleresponse setting. We provide a simple illustrative example from ecology thatshowcases the improved ﬁt and enhanced interpretation afforded by the randomforest framework.

...read moreread less

Journal Article•DOI•

Probability machines: consistent probability estimation using nonparametric learning machines.

[...]

James D. Malley¹, Jochen Kruppa, A. Dasgupta, K. G. Malley, Andreas Ziegler - Show less +1 more•Institutions (1)

Center for Information Technology¹

14 Sep 2011-Methods of Information in Medicine

TL;DR: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses.

...read moreread less

Abstract: Summary Background—Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. Objectives—The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. Methods—Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. Results—Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. Conclusions—Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

...read moreread less

Journal Article•DOI•

Random forest automated supervised classification of Hipparcos periodic variable stars

[...]

Pierre Dubath¹, Lorenzo Rimoldini¹, Maria Süveges¹, J. Blomme², M. López³, L. M. Sarro⁴, J. De Ridder², J. Cuypers⁵, Leanne P. Guy¹, I. Lecoeur¹, K. Nienartowicz¹, A. Jan¹, Mattias Beck¹, Nami Mowlavi¹, P. De Cat⁵, Thomas Lebzelter⁶, Laurent Eyer¹ - Show less +13 more•Institutions (6)

University of Geneva¹, Katholieke Universiteit Leuven², Spanish National Research Council³, National University of Distance Education⁴, Royal Observatory of Belgium⁵, University of Vienna⁶

01 Jul 2011-Monthly Notices of the Royal Astronomical Society

TL;DR: In this paper, the Hipparcos periodic variable stars were classified into 26 types, including SPB and ACV blue variables and between eclipsing binaries, ellipsoidal variables and other variability types.

...read moreread less

Abstract: We present an evaluation of the performance of an automated classification of the Hipparcos periodic variable stars into 26 types. The sub-sample with the most reliable variability types available in the literature is used to train supervised algorithms to characterize the type dependencies on a number of attributes. The most useful attributes evaluated with the random forest methodology include, in decreasing order of importance, the period, the amplitude, the V − I colour index, the absolute magnitude, the residual around the folded light-curve model, the magnitude distribution skewness and the amplitude of the second harmonic of the Fourier series model relative to that of the fundamental frequency. Random forests and a multistage scheme involving Bayesian network and Gaussian mixture methods lead to statistically equivalent results. In standard 10-fold cross-validation (CV) experiments, the rate of correct classification is between 90 and 100 per cent, depending on the variability type. The main mis-classification cases, up to a rate of about 10 per cent, arise due to confusion between SPB and ACV blue variables and between eclipsing binaries, ellipsoidal variables and other variability types. Our training set and the predicted types for the other Hipparcos periodic stars are available online.

...read moreread less

Proceedings Article•DOI•

Traffic sign classification using K-d trees and Random Forests

[...]

Fatin Zaklouta¹, Bogdan Stanciulescu¹, Omar Hamdoun¹•Institutions (1)

Mines ParisTech¹

03 Oct 2011

TL;DR: This paper uses the German Traffic Sign Benchmark data set to evaluate the performance of K-d trees and Random Forests for traffic sign classification using different size Histogram of Oriented Gradients (HOG) descriptors and Distance Transforms.

...read moreread less

Abstract: In this paper, we evaluate the performance of K-d trees and Random Forests for traffic sign classification using different size Histogram of Oriented Gradients (HOG) descriptors and Distance Transforms. We use the German Traffic Sign Benchmark data set [1] containing 43 classes and more than 50,000 images. The K-d tree is fast to build and search in. We combine the tree classifiers with the HOG descriptors as well as the Distance Transforms and achieve classification rates of up to 97% and 81.8% respectively.

...read moreread less

Book Chapter•DOI•

Entangled decision forests and their application for semantic segmentation of CT images

[...]

Albert Montillo¹, Jamie Shotton², John Winn², Juan Eugenio Iglesias³, Dimitri Metaxas⁴, Antonio Criminisi² - Show less +2 more•Institutions (4)

General Electric¹, Microsoft², University of California, Los Angeles³, Rutgers University⁴

03 Jul 2011

TL;DR: The entangled decision forest (EDF) is proposed as a new discriminative classifier which augments the state of the art decision forest, resulting in higher prediction accuracy and shortened decision time, and injecting randomness in a guided way, in which node feature types and parameters are randomly drawn from a learned (nonuniform) distribution.

...read moreread less

Abstract: This work addresses the challenging problem of simultaneously segmenting multiple anatomical structures in highly varied CT scans. We propose the entangled decision forest (EDF) as a new discriminative classifier which augments the state of the art decision forest, resulting in higher prediction accuracy and shortened decision time. Our main contribution is two-fold. First, we propose entangling the binary tests applied at each tree node in the forest, such that the test result can depend on the result of tests applied earlier in the same tree and at image points offset from the voxel to be classified. This is demonstrated to improve accuracy and capture long-range semantic context. Second, during training, we propose injecting randomness in a guided way, in which node feature types and parameters are randomly drawn from a learned (non-uniform) distribution. This further improves classification accuracy. We assess our probabilistic anatomy segmentation technique using a labeled database of CT image volumes of 250 different patients from various scan protocols and scanner vendors. In each volume, 12 anatomical structures have been manually segmented. The database comprises highly varied body shapes and sizes, a wide array of pathologies, scan resolutions, and diverse contrast agents. Quantitative comparisons with state of the art algorithms demonstrate both superior test accuracy and computational efficiency.

...read moreread less

Journal Article•DOI•

Random survival forests for high-dimensional data

[...]

Hemant Ishwaran¹, Udaya B. Kogalur¹, Xi Chen², Andy J. Minn³•Institutions (3)

Cleveland Clinic¹, Vanderbilt University², University of Pennsylvania³

01 Feb 2011-Statistical Analysis and Data Mining

TL;DR: This work discusses effective ways to regularize forests and discusses how to properly tune the RF parameters ‘nodesize’ and ‘mtry’, and introduces new graphical ways of using minimal depth for exploring variable relationships.

...read moreread less

Abstract: Minimal depth is a dimensionless order statistic that measures the predictiveness of a variable in a survival tree. It can be used to select variables in high-dimensional problems using Random Survival Forests (RSF), a new extension of Breiman's Random Forests (RF) to survival settings. We review this methodology and demonstrate its use in high-dimensional survival problems using a public domain R-language package randomSurvivalForest . We discuss effective ways to regularize forests and discuss how to properly tune the RF parameters ‘nodesize’ and ‘mtry’. We also introduce new graphical ways of using minimal depth for exploring variable relationships. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 115–132 2011 © 2011 Wiley Periodicals, Inc.

...read moreread less

Journal Article•DOI•

Predicting in vitro drug sensitivity using Random Forests

[...]

Gregory Riddick¹, Hua Song¹, Susie Ahn¹, Jennifer Walling¹, Diego Borges-Rivera², Wei Zhang¹, Howard A. Fine¹ - Show less +3 more•Institutions (2)

National Institutes of Health¹, Carnegie Mellon University²

01 Jan 2011-Bioinformatics

TL;DR: A novel multistep algorithm is developed that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART) and has general utility for any application that seeks to relate gene expression data to a continuous output variable.

...read moreread less

Abstract: Motivation: Panels of cell lines such as the NCI-60 have long been used to test drug candidates for their ability to inhibit proliferation. Predictive models of in vitro drug sensitivity have previously been constructed using gene expression signatures generated from gene expression microarrays. These statistical models allow the prediction of drug response for cell lines not in the original NCI-60. We improve on existing techniques by developing a novel multistep algorithm that builds regression models of drug response using Random Forest, an ensemble approach based on classification and regression trees (CART). Results: This method proved successful in predicting drug response for both a panel of 19 Breast Cancer and 7 Glioma cell lines, outperformed other methods based on differential gene expression, and has general utility for any application that seeks to relate gene expression data to a continuous output variable. Implementation: Software was written in the R language and will be available together with associated gene expression and drug response data as the package ivDrug at http://r-forge.r-project.org. Contact: riddickgp@mail.nih.gov Supplementary Information:Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Hierarchical annotation of medical images

[...]

Ivica Dimitrovski, Dragi Kocev¹, Suzana Loskovska, Sašo Deroski¹•Institutions (1)

Jožef Stefan Institute¹

01 Oct 2011-Pattern Recognition

TL;DR: The experiments show that the proposed HMC system outperforms the best-performing approach from the literature (a collection of SVMs, each predicting one label at the lowest level of the hierarchy), both in terms of error and efficiency.

...read moreread less

Book Chapter•DOI•

Decision tree and ensemble learning algorithms with their applications in bioinformatics.

[...]

Dongsheng Che¹, Qi Liu, Khaled Rasheed, Xiuping Tao•Institutions (1)

East Stroudsburg University of Pennsylvania¹

01 Jan 2011-Advances in Experimental Medicine and Biology

TL;DR: This chapter briefly review decision tree and related ensemble algorithms and shows the successful applications of such approaches on solving biological problems, and aims to provide a platform to bridge the gap between biologists and computer scientists.

...read moreread less

Abstract: Machine learning approaches have wide applications in bioinformatics, and decision tree is one of the successful approaches applied in this field. In this chapter, we briefly review decision tree and related ensemble algorithms and show the successful applications of such approaches on solving biological problems. We hope that by learning the algorithms of decision trees and ensemble classifiers, biologists can get the basic ideas of how machine learning algorithms work. On the other hand, by being exposed to the applications of decision trees and ensemble algorithms in bioinformatics, computer scientists can get better ideas of which bioinformatics topics they may work on in their future research directions. We aim to provide a platform to bridge the gap between biologists and computer scientists.

...read moreread less

Web-Search Ranking with Initialized Gradient Boosted Regression Trees

[...]

Ananth Mohan, Zheng Chen, Kilian Q. Weinberger

26 Jan 2011

TL;DR: In this paper, the authors combine the two algorithms by first learning a ranking function with Random Forests and using it as initialization for GBRT, which yields surprisingly accurate ranking results.

...read moreread less

Abstract: In May 2010 Yahoo! Inc. hosted the Learning to Rank Challenge. This paper summarizes the approach by the highly placed team Washington University in St. Louis. We investigate Random Forests (RF) as a low-cost alternative algorithm to Gradient Boosted Regression Trees (GBRT) (the de facto standard of web-search ranking). We demonstrate that it yields surprisingly accurate ranking results -- comparable to or better than GBRT. We combine the two algorithms by first learning a ranking function with RF and using it as initialization for GBRT. We refer to this setting as iGBRT. Following a recent discussion by Li et al. (2007), we show that the results of iGBRT can be improved upon even further when the web-search ranking task is cast as classification instead of regression. We provide an upper bound of the Expected Reciprocal Rank (Chapelle et al., 2009) in terms of classification error and demonstrate that iGBRT outperforms GBRT and RF on the Microsoft Learning to Rank and Yahoo Ranking Competition data sets with surprising consistency.

...read moreread less

Proceedings Article•DOI•

Decision tree fields

[...]

Sebastian Nowozin¹, Carsten Rother¹, Shai Bagon², Toby Sharp¹, Bangpeng Yao³, Pushmeet Kohli¹ - Show less +2 more•Institutions (3)

Microsoft¹, Weizmann Institute of Science², Stanford University³

06 Nov 2011

TL;DR: This paper introduces a new formulation for discrete image labeling tasks, the Decision Tree Field (DTF), that combines and generalizes random forests and conditional random fields which have been widely used in computer vision.

...read moreread less

Abstract: This paper introduces a new formulation for discrete image labeling tasks, the Decision Tree Field (DTF), that combines and generalizes random forests and conditional random fields (CRF) which have been widely used in computer vision. In a typical CRF model the unary potentials are derived from sophisticated random forest or boosting based classifiers, however, the pairwise potentials are assumed to (1) have a simple parametric form with a pre-specified and fixed dependence on the image data, and (2) to be defined on the basis of a small and fixed neighborhood. In contrast, in DTF, local interactions between multiple variables are determined by means of decision trees evaluated on the image data, allowing the interactions to be adapted to the image content. This results in powerful graphical models which are able to represent complex label structure. Our key technical contribution is to show that the DTF model can be trained efficiently and jointly using a convex approximate likelihood function, enabling us to learn over a million free model parameters. We show experimentally that for applications which have a rich and complex label structure, our model achieves excellent results.

...read moreread less

Journal Article•DOI•

Genome-wide prediction of discrete traits using bayesian regressions and machine learning

[...]

Oscar González-Recio, Selma Forni¹•Institutions (1)

Genus plc¹

17 Feb 2011-Genetics Selection Evolution

TL;DR: The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed, and among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions.

...read moreread less

Abstract: Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context. This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models' predictive ability. The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem.

...read moreread less

Journal Article•DOI•

Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks.

[...]

Chung Ho Hsieh¹, Ruey Hwa Lu, Nai Hsin Lee, Wen Ta Chiu², Min-Huei Hsu², Yu-Chuan Li², Yu-Chuan Li¹ - Show less +3 more•Institutions (2)

National Yang-Ming University¹, Taipei Medical University²

01 Jan 2011-Surgery

TL;DR: It is demonstrated that random forest can predict acute appendicitis with good accuracy and, deployed appropriately, can be an effective tool in clinical decision making.

...read moreread less

Journal Article•DOI•

An empirical evaluation of rotation-based ensemble classifiers for customer churn prediction

[...]

Koen W. De Bock¹, Dirk Van den Poel²•Institutions (2)

Lille Catholic University¹, Ghent University²

01 Sep 2011-Expert Systems With Applications

TL;DR: In this article, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction, namely Rotation Forests and RotBoost, which are applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost.

...read moreread less

Abstract: Several studies have demonstrated the superior performance of ensemble classification algorithms, whereby multiple member classifiers are combined into one aggregated and powerful classification model, over single models. In this paper, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction. In Rotation Forests, feature extraction is applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost. In an experimental validation based on data sets from four real-life customer churn prediction projects, Rotation Forest and RotBoost are compared to a set of well-known benchmark classifiers. Moreover, variations of Rotation Forest and RotBoost are compared, implementing three alternative feature extraction algorithms: principal component analysis (PCA), independent component analysis (ICA) and sparse random projections (SRP). The performance of rotation-based ensemble classifier is found to depend upon: (i) the performance criterion used to measure classification performance, and (ii) the implemented feature extraction algorithm. In terms of accuracy, RotBoost outperforms Rotation Forest, but none of the considered variations offers a clear advantage over the benchmark algorithms. However, in terms of AUC and top-decile lift, results clearly demonstrate the competitive performance of Rotation Forests compared to the benchmark algorithms. Moreover, ICA-based Rotation Forests outperform all other considered classifiers and are therefore recommended as a well-suited alternative classification technique for the prediction of customer churn that allows for improved marketing decision making.

...read moreread less

Journal Article•DOI•

AUC-RF: a new strategy for genomic profiling with random forest.

[...]

M. Luz Calle¹, Victor Urrea, Anne-Laure Boulesteix, Núria Malats•Institutions (1)

University of Vic¹

11 Oct 2011-Human Heredity

TL;DR: This work proposes a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF), which implements a backward elimination process based on the initial ranking of variables.

...read moreread less

Abstract: Objective: Genomic profiling, the use of genetic variants at multiple loci simultaneously for the prediction of disease risk, requires the selection of a set of genetic variants that best predicts disease status. The goal of this work was to provide a new selection algorithm for genomic profiling. Methods: We propose a new algorithm for genomic profiling based on optimizing the area under the receiver operating characteristic curve (AUC) of the random forest (RF). The proposed strategy implements a backward elimination process based on the initial ranking of variables. Results and Conclusions: We demonstrate the advantage of using the AUC instead of the classification error as a measure of predictive accuracy of RF. In particular, we show that the use of the classification error is especially inappropriate when dealing with unbalanced data sets. The new procedure for variable selection and prediction, namely AUC-RF, is illustrated with data from a bladder cancer study and also with simulated data. The algorithm is publicly available as an R package, named AUCRF, at http://cran.r-project.org/.

...read moreread less

Book Chapter•DOI•

Fast multiple organ detection and localization in whole-body MR Dixon sequences

[...]

Olivier Pauly¹, Ben Glocker², Antonio Criminisi², Diana Mateus¹, Axel Martinez Möller¹, Stephan G. Nekolla¹, Nassir Navab¹ - Show less +3 more•Institutions (2)

Technische Universität München¹, Microsoft²

18 Sep 2011

TL;DR: This work proposes an efficient approach for estimating location and size of multiple anatomical structures in MR scans, and adapts random ferns to produce multidimensional regression output and compares them with random regression forests.

...read moreread less

Abstract: Automatic localization of multiple anatomical structures in medical images provides important semantic information with potential benefits to diverse clinical applications. Aiming at organ-specific attenuation correction in PET/MR imaging, we propose an efficient approach for estimating location and size of multiple anatomical structures in MR scans. Our contribution is three-fold: (1) we apply supervised regression techniques to the problem of anatomy detection and localization in whole-body MR, (2) we adapt random ferns to produce multidimensional regression output and compare them with random regression forests, and (3) introduce the use of 3D LBP descriptors in multi-channel MR Dixon sequences. The localization accuracy achieved with both fern- and forest-based approaches is evaluated by direct comparison with state of the art atlas-based registration, on ground-truth data from 33 patients. Our results demonstrate improved anatomy localization accuracy with higher efficiency and robustness.

...read moreread less

Collapse