scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Random Forests for Classification in Ecology

01 Nov 2007-Ecology (John Wiley & Sons, Ltd)-Vol. 88, Iss: 11, pp 2783-2792
TL;DR: High classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods are observed.
Abstract: Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the performance of the random forest classifier for land cover classification of a complex area is explored based on several criteria: mapping accuracy, sensitivity to data set size and noise.
Abstract: Land cover monitoring using remotely sensed data requires robust classification methods which allow for the accurate mapping of complex land cover and land use categories. Random forest (RF) is a powerful machine learning classifier that is relatively unknown in land remote sensing and has not been evaluated thoroughly by the remote sensing community compared to more conventional pattern recognition techniques. Key advantages of RF include: their non-parametric nature; high classification accuracy; and capability to determine variable importance. However, the split rules for classification are unknown, therefore RF can be considered to be black box type classifier. RF provides an algorithm for estimating missing values; and flexibility to perform several types of data analysis, including regression, classification, survival analysis, and unsupervised learning. In this paper, the performance of the RF classifier for land cover classification of a complex area is explored. Evaluation was based on several criteria: mapping accuracy, sensitivity to data set size and noise. Landsat-5 Thematic Mapper data captured in European spring and summer were used with auxiliary variables derived from a digital terrain model to classify 14 different land categories in the south of Spain. Results show that the RF algorithm yields accurate land cover classifications, with 92% overall accuracy and a Kappa index of 0.92. RF is robust to training data reduction and noise because significant differences in kappa values were only observed for data reduction and noise addition values greater than 50 and 20%, respectively. Additionally, variables that RF identified as most important for classifying land cover coincided with expectations. A McNemar test indicates an overall better performance of the random forest model over a single decision tree at the 0.00001 significance level.

1,901 citations

Journal ArticleDOI
19 Apr 2016-Test
TL;DR: The present article reviews the most recent theoretical and methodological developments for random forests, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures.
Abstract: The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.

1,279 citations


Cites background from "Random Forests for Classification i..."

  • ...…hackathon on air quality prediction (http://www.kaggle.com/c/dsg-hackathon), chemoinformatics (Svetnik et al., 2003), ecology (Prasad et al., 2006; Cutler et al., 2007), 3D object recognition (Shotton et al., 2011), and bioinformatics (Dı́az-Uriarte and de Andrés, 2006), just to name a few....

    [...]

Posted Content
TL;DR: A review of the most recent theoretical and methodological developments for random forests can be found in this article, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures.
Abstract: The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.

1,119 citations

Journal ArticleDOI
TL;DR: In this article, the authors tested the predictive accuracies of five consensus methods, namely Weighted Average (WA), Mean(All), Median(All, Median(PCA), and Best, for 28 threatened plant species.
Abstract: Aim Spatial modelling techniques are increasingly used in species distribution modelling. However, the implemented techniques differ in their modelling performance, and some consensus methods are needed to reduce the uncertainty of predictions. In this study, we tested the predictive accuracies of five consensus methods, namely Weighted Average (WA), Mean(All), Median(All), Median(PCA), and Best, for 28 threatened plant species. Location North-eastern Finland, Europe. Methods The spatial distributions of the plant species were forecasted using eight state-of-the-art single-modelling techniques providing an ensemble of predictions. The probability values of occurrence were then combined using five consensus algorithms. The predictive accuracies of the single-model and consensus methods were assessed by computing the area under the curve (AUC) of the receiver-operating characteristic plot. Results The mean AUC values varied between 0.697 (classification tree analysis) and 0.813 (random forest) for the single-models, and from 0.757 to 0.850 for the consensus methods. WA and Mean(All) consensus methods provided significantly more robust predictions than all the single-models and the other consensus methods. Main conclusions Consensus methods based on average function algorithms may increase significantly the accuracy of species distribution forecasts, and thus they show considerable promise for different conservation biological and biogeographical applications.

1,097 citations


Cites methods or result from "Random Forests for Classification i..."

  • ...A general outcome of the model comparisons has been that novel modelling techniques, such as RF and GBM, consistently outperform more established techniques (Cutler et al., 2007)....

    [...]

  • ...Some single-models such as RF are accurate in interpolation modelling (Cutler et al., 2007)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that there is no universal measure of impact and the pattern observed depends on the ecological measure examined, and some species traits, especially life form, stature and pollination syndrome, may provide a means to predict impact, regardless of the particular habitat and geographical region invaded.
Abstract: With the growing body of literature assessing the impact of invasive alien plants on resident species and ecosystems, a comprehensive assessment of the relationship between invasive species traits and environmental settings of invasion on the characteristics of impacts is needed. Based on 287 publications with 1551 individual cases that addressed the impact of 167 invasive plant species belonging to 49 families, we present the first global overview of frequencies of significant and non-significant ecological impacts and their directions on 15 outcomes related to the responses of resident populations, species, communities and ecosystems. Species and community outcomes tend to decline following invasions, especially those for plants, but the abundance and richness of the soil biota, as well as concentrations of soil nutrients and water, more often increase than decrease following invasion. Data mining tools revealed that invasive plants exert consistent significant impacts on some outcomes (survival of resident biota, activity of resident animals, resident community productivity, mineral and nutrient content in plant tissues, and fire frequency and intensity), whereas for outcomes at the community level, such as species richness, diversity and soil resources, the significance of impacts is determined by interactions between species traits and the biome invaded. The latter outcomes are most likely to be impacted by annual grasses, and by wind pollinated trees invading mediterranean or tropical biomes. One of the clearest signals in this analysis is that invasive plants are far more likely to cause significant impacts on resident plant and animal richness on islands rather than mainland. This study shows that there is no universal measure of impact and the pattern observed depends on the ecological measure examined. Although impact is strongly context dependent, some species traits, especially life form, stature and pollination syndrome, may provide a means to predict impact, regardless of the particular habitat and geographical region invaded.

1,067 citations


Cites background from "Random Forests for Classification i..."

  • ...…unlike parametric linear models, collinearity does not prevent reliable parameter estimates because the method guards against the elimination of variables which are good predictors of the response, and may be ecologically important, but are correlated with other predictors (Cutler et al., 2007)....

    [...]

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations


"Random Forests for Classification i..." refers background or methods in this paper

  • ...Random forests (hereafter RF) is one such method (Breiman 2001)....

    [...]

  • ...For the classification situation, Breiman (2001) showed that classification accuracy can be significantly improved by aggregating the results of many classifiers that have little bias by averaging or voting, if the classifiers have low pairwise correlations....

    [...]

Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations

Journal ArticleDOI
TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Abstract: Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion.Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

17,764 citations

01 Jan 2007
TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

14,830 citations

Book
01 Jan 1983
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Abstract: The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

14,825 citations