scispace - formally typeset
Topic

Random forest

About: Random forest is a(n) research topic. Over the lifetime, 13345 publication(s) have been published within this topic receiving 345395 citation(s). The topic is also known as: random forests & randomized trees.

...read more

Papers
  More

Open accessJournal ArticleDOI: 10.1023/A:1010933404324
Leo Breiman1Institutions (1)
01 Oct 2001-
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read more

Topics: Random forest (63%), Multivariate random variable (57%), Random subspace method (57%) ...read more

58,232 Citations


Open access
01 Jan 2007-
Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

...read more

Topics: Random forest (69%), Overfitting (52%), Ensemble learning (52%) ...read more

12,765 Citations


Journal ArticleDOI: 10.1109/34.709601
Tin Kam Ho1Institutions (1)
Abstract: Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy.

...read more

Topics: Random forest (65%), Random subspace method (62%), Decision tree (56%) ...read more

5,108 Citations


Open accessJournal ArticleDOI: 10.1111/J.1365-2656.2008.01390.X
Abstract: Summary 1 Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions 2 This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance) The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion 3 Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods 4 The unique features of BRT raise a number of practical issues in model fitting We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel ( Anguilla australis Richardson), a native freshwater fish of New Zealand We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data We provide code and a tutorial to enable the wider use of BRT by ecologists

...read more

Topics: Statistical model (52%), Random forest (52%), Regression analysis (52%) ...read more

3,816 Citations


Open accessJournal ArticleDOI: 10.1890/07-0539.1
01 Nov 2007-Ecology
Abstract: Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature.

...read more

Topics: Random forest (54%), Missing data (53%), Regression analysis (51%)

2,794 Citations


Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202268
20212,225
20202,238
20191,960
20181,508
20171,168

Top Attributes

Show by:

Topic's top 5 most impactful authors

Antonio Criminisi

19 papers, 3.2K citations

Jon Atli Benediktsson

17 papers, 2.2K citations

Henrik Boström

15 papers, 244 citations

Sašo Džeroski

14 papers, 538 citations

Robin Genuer

14 papers, 1.9K citations

Network Information
Related Topics (5)
Support vector machine

73.6K papers, 1.7M citations

92% related
Feature selection

41.4K papers, 1M citations

91% related
Decision tree

26.1K papers, 588.1K citations

91% related
Deep learning

79.8K papers, 2.1M citations

90% related
Overfitting

11.6K papers, 441.8K citations

90% related