scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2010"


Journal ArticleDOI
TL;DR: The Boruta package provides a convenient interface to the Boruta algorithm, implementing a novel feature selection algorithm for finding emph{all relevant variables}.
Abstract: This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

2,832 citations


Journal ArticleDOI
TL;DR: This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection, and proposes a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.

1,766 citations


Journal ArticleDOI
TL;DR: The authors examine the performance of various CART‐based propensity score models using simulated data and suggest that ensemble methods, especially boosted CART, may be useful for propensity score weighting.
Abstract: Machine learning techniques such as classification and regression trees (CART) have been suggested as promising alternatives to logistic regression for the estimation of propensity scores. The authors examined the performance of various CART-based propensity score models using simulated data. Hypothetical studies of varying sample sizes (n=500, 1000, 2000) with a binary exposure, continuous outcome, and 10 covariates were simulated under seven scenarios differing by degree of non-linear and non-additive associations between covariates and the exposure. Propensity score weights were estimated using logistic regression (all main effects), CART, pruned CART, and the ensemble methods of bagged CART, random forests, and boosted CART. Performance metrics included covariate balance, standard error, per cent absolute bias, and 95 per cent confidence interval (CI) coverage. All methods displayed generally acceptable performance under conditions of either non-linearity or non-additivity alone. However, under conditions of both moderate non-additivity and moderate non-linearity, logistic regression had subpar performance, whereas ensemble methods provided substantially better bias reduction and more consistent 95 per cent CI coverage. The results suggest that ensemble methods, especially boosted CART, may be useful for propensity score weighting.

713 citations


Posted Content
TL;DR: In this paper, an in-depth analysis of a random forests model suggested by Breiman in the early 2000's is presented, showing that the procedure is consistent and adapts to sparsity, and that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000's for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman in \cite{Bre04}, which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.

667 citations


Book
24 Feb 2010
TL;DR: IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed, and explains the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.
Abstract: Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithms, but this book focuses on decision trees to explain them most clearly. After describing trees and their strengths and weaknesses, the authors provide an overview of regularization -- today understood to be a key reason for the superior performance of modern ensembling algorithms. The book continues with a clear description of two recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed. REs are linear rule models derived from decision tree ensembles. They are the most interpretable version of ensembles, which is essential to applications such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.This book is aimed at novice and advanced analytic researchers and practitioners -- especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the techniques. (edited by author)

471 citations


Book ChapterDOI
20 Sep 2010
TL;DR: This paper introduces a new, continuous parametrization of the anatomy localization task which is effectively addressed by regression forests, and shows to be a more natural approach than classification.
Abstract: This paper proposes multi-class random regression forests as an algorithm for the efficient, automatic detection and localization of anatomical structures within three-dimensional CT scans Regression forests are similar to the more popular classification forests, but trained to predict continuous outputs We introduce a new, continuous parametrization of the anatomy localization task which is effectively addressed by regression forests This is shown to be a more natural approach than classification A single pass of our probabilistic algorithm enables the direct mapping from voxels to organ location and size; with training focusing on maximizing the confidence of output predictions As a by-product, our method produces salient anatomical landmarks; ie automatically selected "anchor" regions which help localize organs of interest with high confidence Quantitative validation is performed on a database of 100 highly variable CT scans Localization errors are shown to be lower (and more stable) than those from global affine registration approaches The regressor's parallelism and the simplicity of its context-rich visual features yield typical runtimes of only 1s Applications include semantic visual navigation, image tagging for retrieval, and initializing organ-specific processing

343 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A new variant of bagging is proposed, called leveraging bagging, which combines the simplicity of baging with adding more randomization to the input, and output of the classifiers.
Abstract: Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the classifiers. We test our method by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.

305 citations


Journal ArticleDOI
TL;DR: Experimental results clearly demonstrate that the generation of an SVM-based classifier system with RFS significantly improves overall classification accuracy as well as producer's and user's accuracies.
Abstract: The accuracy of supervised land cover classifications depends on factors such as the chosen classification algorithm, adequate training data, the input data characteristics, and the selection of features. Hyperspectral imaging provides more detailed spectral and spatial information on the land cover than other remote sensing resources. Over the past ten years, traditional and formerly widely accepted statistical classification methods have been superseded by more recent machine learning algorithms, e.g., support vector machines (SVMs), or by multiple classifier systems (MCS). This can be explained by limitations of statistical approaches with regard to high-dimensional data, multimodal classes, and often limited availability of training data. In the presented study, MCSs based on SVM and random feature selection (RFS) are applied to explore the potential of a synergetic use of the two concepts. We investigated how the number of selected features and the size of the MCS influence classification accuracy using two hyperspectral data sets, from different environmental settings. In addition, experiments were conducted with a varying number of training samples. Accuracies are compared with regular SVM and random forests. Experimental results clearly demonstrate that the generation of an SVM-based classifier system with RFS significantly improves overall classification accuracy as well as producer's and user's accuracies. In addition, the ensemble strategy results in smoother, i.e., more realistic, classification maps than those from stand-alone SVM. Findings from the experiments were successfully transferred onto an additional hyperspectral data set.

294 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a semi-ITC method for forest inventory, which overcomes the main problems related to ITC by imputing ground truth data within crown segments from the nearest neighboring segment.

246 citations


Journal ArticleDOI
TL;DR: It is found that RS with support vector machines (SVM) as the base classifier outperformed single classifiers as well as some of the most widely used classifier ensembles such as bagging, AdaBoost, random forest, and rotation forest.
Abstract: Classification of brain images obtained through functional magnetic resonance imaging (fMRI) poses a serious challenge to pattern recognition and machine learning due to the extremely large feature-to-instance ratio. This calls for revision and adaptation of the current state-of-the-art classification methods. We investigate the suitability of the random subspace (RS) ensemble method for fMRI classification. RS samples from the original feature set and builds one (base) classifier on each subset. The ensemble assigns a class label by either majority voting or averaging of output probabilities. Looking for guidelines for setting the two parameters of the method-ensemble size and feature sample size-we introduce three criteria calculated through these parameters: usability of the selected feature sets, coverage of the set of ?important? features, and feature set diversity. Optimized together, these criteria work toward producing accurate and diverse individual classifiers. RS was tested on three fMRI datasets from single-subject experiments: the Haxby data (Haxby, 2001.) and two datasets collected in-house. We found that RS with support vector machines (SVM) as the base classifier outperformed single classifiers as well as some of the most widely used classifier ensembles such as bagging, AdaBoost, random forest, and rotation forest. The closest rivals were the single SVM and bagging of SVM classifiers. We use kappa-error diagrams to understand the success of RS.

222 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used the k-Most Similar Neighbor (k-MSN) and the Random Forest (RF) methods for the simultaneous estimation of species, diameter at breast height (DBH), height and stem volume using airborne laser scanning (ALS) data.

Journal ArticleDOI
01 Oct 2010-Forestry
TL;DR: In this article, a mixed temperate forest landscape in southwestern Germany, multiple remote sensing variables from aerial orthoimages, Thematic Mapper data and small footprint light detection and ranging (LiDAR) were used for plot-level nonparametric predictions of the total volume and biomass using three distance measures of Euclidean, Mahalanobis and Most Similar Neighbors as well as a regression tree-based classifier (Random Forest).
Abstract: Summary In a mixed temperate forest landscape in southwestern Germany, multiple remote sensing variables from aerial orthoimages, Thematic Mapper data and small footprint light detection and ranging (LiDAR) were used for plot-level nonparametric predictions of the total volume and biomass using three distance measures of Euclidean, Mahalanobis and Most Similar Neighbour as well as a regression tree-based classifier (Random Forest). The performances of nearest neighbour (NN) approaches were examined by means of relative bias and root mean squared error. The original highdimensional dataset was pruned using an evolutionary genetic algorithm search with a NN classification scenario, as well as by a stepwise selection. The genetic algorithm (GA)-selected variables showed improved performance when applying Euclidean and Mahalanobis distances for predictions, whereas the Most Similar Neighbour and Random Forests worked more precise with the full dataset. The GA search proved to be unstable in multiple runs because of intercorrelations among the high-dimensional predictors. The selected datasets are dominated by LiDAR height metrics. Furthermore, The LiDAR-based metrics showed major relevance in predicting both response variables examined here. The Random Forest proved to be superior to the other examined NN methods, which was eventually used for a wallto-wall mapping of predictions on a grid of 20 × 20 m spatial resolution.

Journal ArticleDOI
TL;DR: In this paper, a novel way of incorporating spatial dependence in a heterogeneous region is tested using an ensemble learning technique called random forests and a measure of local spatial dependence called the Getis statistic.
Abstract: Land-cover characterization of large heterogeneous landscapes is challenging because of the confusion caused by high intra-class variability and heterogeneous landscape artefacts. Neighbourhood context can be used to supplement spectral information, and a novel way of incorporating spatial dependence in a heterogeneous region is tested here using an ensemble learning technique called random forests and a measure of local spatial dependence called the Getis statistic. The overall Kappa accuracy of the random forest classifier that used a combination of spectral and local spatial (Getis) variables at three different neighbourhood sizes (3 × 3, 7 × 7, and 11 × 11) ranged from 0.85 to 0.92. This accuracy was higher than that of a non-spatial random forest classifier having an overall Kappa accuracy of 0.78, which was run using the spectral variables only. This study demonstrated that the use of the Getis statistic with different neighbourhood sizes leads to substantial increase in per class classification acc...

Book ChapterDOI
05 Sep 2010
TL;DR: The result shows that only using dense depth information, this framework for semantic scene parsing and object recognition based on dense depth maps can achieve overall better accurate segmentation and recognition than that from sparse 3D features or appearance, advancing state-of-the-art performance.
Abstract: In this paper we present a framework for semantic scene parsing and object recognition based on dense depth maps. Five view-independent 3D features that vary with object class are extracted from dense depth maps at a superpixel level for training a classifier using randomized decision forest technique. Our formulation integrates multiple features in a Markov Random Field (MRF) framework to segment and recognize different object classes in query street scene images. We evaluate our method both quantitatively and qualitatively on the challenging Cambridge-driving Labeled Video Database (CamVid). The result shows that only using dense depth information, we can achieve overall better accurate segmentation and recognition than that from sparse 3D features or appearance, or even the combination of sparse 3D features and appearance, advancing state-of-the-art performance. Furthermore, by aligning 3D dense depth based features into a unified coordinate frame, our algorithm can handle the special case of view changes between training and testing scenarios. Preliminary evaluation in cross training and testing shows promising results.

Book
14 Jul 2010
TL;DR: A Practical Guide to Tree Construction using Tree-Based Analysis for Binary Response and Regression Trees and Adaptive Splines for a Continuous Response.
Abstract: A Practical Guide to Tree Construction.- Logistic Regression.- Classification Trees for a Binary Response.- Examples Using Tree-Based Analysis.- Random and Deterministic Forests.- Analysis of Censored Data: Examples.- Analysis of Censored Data: Concepts and Classical Methods.- Analysis of Censored Data: Survival Trees and Random Forests.- Regression Trees and Adaptive Splines for a Continuous Response.- Analysis of Longitudinal Data.- Analysis of Multiple Discrete Responses.

Journal ArticleDOI
TL;DR: For the four invasive plant species tested, ensemble models were the only models that ranked in the top three models for both field validation and test data, suggesting that ensemble models may be more robust than individual species-environment matching models for risk analysis.
Abstract: Ensemble species distribution models combine the strengths of several species environmental matching models, while minimizing the weakness of any one model. Ensemble models may be particularly useful in risk analysis of recently arrived, harmful invasive species because species may not yet have spread to all suitable habitats, leaving species-environment relationships difficult to determine. We tested five individual models (logistic regression, boosted regression trees, random forest, multivariate adaptive regression splines (MARS), and maximum entropy model or Maxent) and ensemble modeling for selected nonnative plant species in Yellowstone and Grand Teton National Parks, Wyoming; Sequoia and Kings Canyon National Parks, California, and areas of interior Alaska. The models are based on field data provided by the park staffs, combined with topographic, climatic, and vegetation predictors derived from satellite data. For the four invasive plant species tested, ensemble models were the only models that ranked in the top three models for both field validation and test data. Ensemble models may be more robust than individual species-environment matching models for risk analysis.

Proceedings Article
01 Jan 2010
TL;DR: This team is the first prize winner of both tracks (all teams and student teams) of KDD Cup 2010 and combined results of student sub-teams by regularized linear regression.
Abstract: KDD Cup 2010 is an educational data mining competition. Participants are asked to learn a model from students' past behavior and then predict their future performance. At National Taiwan University, we organized a course for this competition. Most student sub-teams expanded features by various binarization and discretization techniques. The resulting sparse feature sets were trained by logistic regression (using LIBLINEAR). One sub-team considered condensed features using simple statistical techniques and applied Random Forest (through Weka) for training. Initial development was conducted on an internal split of training data for training and validation. We identied some useful feature combinations to improve performance. For the nal submission, we combined results of student sub-teams by regularized linear regression. Our team is the rst prize winner of both tracks (all teams and student teams) of KDD Cup 2010.

Journal ArticleDOI
TL;DR: In this paper, an empirical method based on Random Forests is proposed to estimate the photometric redshift of each galaxy by building a set of optimal decision trees on subsets of the available spectroscopic sample.
Abstract: The main challenge today in photometric redshift estimation is not in the accuracy but in understanding the uncertainties. We introduce an empirical method based on Random Forests to address these issues. The training algorithm builds a set of optimal decision trees on subsets of the available spectroscopic sample, which provide independent constraints on the redshift of each galaxy. The combined forest estimates have intriguing statistical properties, notable among which are Gaussian errors. We demonstrate the power of our approach on multi-color measurements of the Sloan Digital Sky Survey.

Book ChapterDOI
05 Sep 2010
TL;DR: A novel multiple-instance learning algorithm for randomized trees called MIForests, which achieves state-of-the-art results while being faster than previous approaches and being able to inherently solve multi-class problems.
Abstract: Multiple-instance learning (MIL) allows for training classifiers from ambiguously labeled data. In computer vision, this learning paradigm has been recently used in many applications such as object classification, detection and tracking. This paper presents a novel multiple-instance learning algorithmfor randomized trees called MIForests. Randomized trees are fast, inherently parallel and multi-class and are thus increasingly popular in computer vision. MIForest combine the advantages of these classifiers with the flexibility of multiple instance learning. In order to leverage the randomized trees for MIL, we define the hidden class labels inside target bags as random variables. These random variables are optimized by training random forests and using a fast iterative homotopy method for solving the non-convex optimization problem. Additionally, most previously proposed MIL approaches operate in batch or off-line mode and thus assume access to the entire training set. This limits their applicability in scenarios where the data arrives sequentially and in dynamic environments.We show that MIForests are not limited to off-line problems and present an on-line extension of our approach. In the experiments, we evaluate MIForests on standard visual MIL benchmark datasets where we achieve state-of-the-art results while being faster than previous approaches and being able to inherently solve multi-class problems. The on-line version of MIForests is evaluated on visual object tracking where we outperform the state-of-the-art method based on boosting.

Journal ArticleDOI
TL;DR: This paper reviews techniques to accelerate concept classification, where the trade-off between computational efficiency and accuracy is shown and the results lead to a 7-fold speed increase without accuracy loss, and a 70- fold speed increase with 3% accuracy loss.
Abstract: As datasets grow increasingly large in content-based image and video retrieval, computational efficiency of concept classification is important. This paper reviews techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis, we use the Bag-of-Words algorithm that in the 2008 benchmarks of TRECVID and PASCAL lead to the best performance scores. We divide the evaluation in three steps: 1) Descriptor Extraction, where we evaluate SIFT, SURF, DAISY, and Semantic Textons. 2) Visual Word Assignment, where we compare a k-means visual vocabulary with a Random Forest and evaluate subsampling, dimension reduction with PCA, and division strategies of the Spatial Pyramid. 3) Classification, where we evaluate the χ2, RBF, and Fast Histogram Intersection kernel for the SVM. Apart from the evaluation, we accelerate the calculation of densely sampled SIFT and SURF, accelerate nearest neighbor assignment, and improve accuracy of the Histogram Intersection kernel. We conclude by discussing whether further acceleration of the Bag-of-Words pipeline is possible. Our results lead to a 7-fold speed increase without accuracy loss, and a 70-fold speed increase with 3% accuracy loss. The latter system does classification in real-time, which opens up new applications for automatic concept classification. For example, this system permits five standard desktop PCs to automatically tag for 20 classes all images that are currently uploaded to Flickr.

Journal ArticleDOI
TL;DR: In this article, the authors prove uniform consistency of Random Survival Forests (RSF) under general splitting rules, bootstrapping, and random selection of variables, and show that the forest ensemble survival function converges uniformly to the true population survival function.

Journal ArticleDOI
TL;DR: Characteristics such as time effort, classifier comprehensibility and method intricacy are evaluated—aspects that determine the success of a classification technique among ecologists and conservation biologists as well as for the communication with managers and decision makers.

Posted Content
TL;DR: In this article, the use of Random Forest as a potential technique for residential estate mass appraisal has been attempted for the first time and the method performed better than such techniques as CHAID, CART, KNN, multiple regression analysis, Artificial Neural Networks (MLP and RBF) and Boosted Trees.
Abstract: To the best knowledge of authors, the use of Random forest as a potential technique for residential estate mass appraisal has been attempted for the first time. In the empirical study using data on residential apartments the method performed better than such techniques as CHAID, CART, KNN, multiple regression analysis, Artificial Neural Networks (MLP and RBF) and Boosted Trees. An approach for automatic detection of segments where a model significantly underperforms and for detecting segments with systematically under- or overestimated prediction is introduced. This segmentational approach is applicable to various expert systems including, but not limited to, those used for the mass appraisal.

Journal ArticleDOI
TL;DR: The proposed multiple classifier system based on a ''forest'' of fuzzy decision trees, i.e., a fuzzy random forest, is proposed, which exhibits a good accuracy classification, comparable to that of the best classifiers when tested with conventional data sets.

Journal ArticleDOI
Alan Smith1
TL;DR: In this article, an approach to using the Random Forest classification algorithm to quantitatively evaluate a range of potential image segmentation scale alternatives in order to identify the segmentation scales(s) that best predict land cover classes of interest was described.
Abstract: This paper describes an approach to using the Random Forest classification algorithm to quantitatively evaluate a range of potential image segmentation scale alternatives in order to identify the segmentation scale(s) that best predict land cover classes of interest. The image segmentation scale selection process was used to identify three critical image object scales that when combined produced an optimal level of land cover classification accuracy. Following segmentation scale optimization, the Random Forest classifier was then used to assign land cover classes to 11 scenes of SPOT satellite imagery in North and South Dakota with an average overall accuracy of 85.2 percent.

Journal ArticleDOI
TL;DR: The method takes advantage of the random forest algorithm and offers a structure for a hybrid random forest based lung nodule classification aided by clustering and a high receiver operating characteristic (ROC) A(z) of 0.9786 has been achieved.

Journal ArticleDOI
TL;DR: The water index and Ratio975 had the best ability to assay the water status of S. noctilio infested trees thus making it possible to remotely predict and quantify the severity of damage caused by the wasp.

01 Oct 2010
TL;DR: Simulation results show that the proposed MERF method provides substantial improvements over standard RF when the random effects are non-negligible.
Abstract: This paper presents an extension of the random forest (RF) method to the case of clustered data. The proposed ‘mixed-effects random forest’ (MERF) is implemented using a standard RF algorithm within the framework of the expectation–maximization algorithm. Simulation results show that the proposed MERF method provides substantial improvements over standard RF when the random effects are non-negligible. The use of the method is illustrated to predict the first-week box office revenues of movies.

Journal ArticleDOI
TL;DR: This letter shows a computer aided diagnosis technique for the early detection of the Alzheimer's disease (AD) by means of single photon emission computed tomography (SPECT) image classification based on partial least squares (PLS) regression model and a random forest (RF) predictor.

Journal ArticleDOI
TL;DR: This work investigates how illuminant estimation techniques can be improved taking into account intrinsic, low level properties of the images, and shows how these properties can be used to drive the selection of the best algorithm for a given image.