How many trees in a random forest

doi:10.1007/978-3-642-31537-4_13

Book ChapterDOI

How many trees in a random forest

- pp 154-168

TLDR

Analysis of whether there is an optimal number of trees within a Random Forest finds an experimental relationship for the AUC gain when doubling the number of Trees in any forest and states there is a threshold beyond which there is no significant gain, unless a huge computational environment is available.

Abstract:

Random Forest is a computationally efficient technique that can operate quickly over large datasets. It has been used in many recent research projects and real-world applications in diverse domains. However, the associated literature provides almost no directions about how many trees should be used to compose a Random Forest. The research reported here analyzes whether there is an optimal number of trees within a Random Forest, i.e., a threshold from which increasing the number of trees would bring no significant performance gain, and would only increase the computational cost. Our main conclusions are: as the number of trees grows, it does not always mean the performance of the forest is significantly better than previous forests (fewer trees), and doubling the number of trees is worthless. It is also possible to state there is a threshold beyond which there is no significant gain, unless a huge computational environment is available. In addition, it was found an experimental relationship for the AUC gain when doubling the number of trees in any forest. Furthermore, as the number of trees grows, the full set of attributes tend to be used within a Random Forest, which may not be interesting in the biomedical domain. Additionally, datasets' density-based metrics proposed here probably capture some aspects of the VC dimension on decision trees and low-density datasets may require large capacity machines whilst the opposite also seems to be true.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Machine learning

Thomas G. Dietterich

- 01 Dec 1996 -

ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Journal ArticleDOI

A survey on semi-supervised learning

Jesper E. van Engelen, +2 more

- 01 Feb 2020 -

Machine Learning

TL;DR: This survey aims to provide researchers and practitioners new to the field as well as more advanced readers with a solid understanding of the main approaches and algorithms developed over the past two decades, with an emphasis on the most prominent and currently relevant work.

...read moreread less

Journal ArticleDOI

Hyperparameters and tuning strategies for random forest

Philipp Probst, +2 more

- 01 May 2019 -

Wiley Interdisciplinary Reviews-Data Min...

TL;DR: A literature review on the parameters' influence on the prediction performance and on variable importance measures is provided, and the application of one of the most established tuning strategies, model‐based optimization (MBO), is demonstrated.

...read moreread less

Journal ArticleDOI

Hyperparameters and Tuning Strategies for Random Forest

Philipp Probst, +2 more

- 10 Apr 2018 -

arXiv: Machine Learning

TL;DR: In this article, the authors provide a literature review on the parameters' influence on the prediction performance and on variable importance measures, and demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO).

...read moreread less

Journal ArticleDOI

A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring

Naomi Zimmerman, +7 more

- 15 Jan 2018 -

Atmospheric Measurement Techniques

TL;DR: In this paper, the Real-time Affordable Multi-Pollutant (RAMP) sensor package is used to measure CO, NO2, O3, and CO2.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Journal ArticleDOI

Random Forests

Leo Breiman

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Journal ArticleDOI

The WEKA data mining software: an update

Mark Hall, +5 more

- 16 Nov 2009 -

Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

UCI Machine Learning Repository

A. Asuncion

Journal ArticleDOI

Bagging predictors

Leo Breiman

TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.

...read moreread less

Collapse

How many trees in a random forest

Citations

Machine learning

A survey on semi-supervised learning

Hyperparameters and tuning strategies for random forest

Hyperparameters and Tuning Strategies for Random Forest

A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Random Forests

The WEKA data mining software: an update

UCI Machine Learning Repository

Bagging predictors

Related Papers (5)

Random Forests

Scikit-learn: Machine Learning in Python

Bagging predictors

Support-Vector Networks

The Elements of Statistical Learning

Trending Questions (3)