New machine learning algorithm: random forest

doi:10.1007/978-3-642-34062-8_32

Home
/
Papers
/
New machine learning algorithm: random forest

Book Chapter•DOI•

New machine learning algorithm: random forest

Yanli Liu¹, Yourong Wang¹, Jian Zhang¹•Institutions (1)

Tangshan College¹

14 Sep 2012-pp 246-252

TL;DR: This Paper gives an introduction of Random Forest, a new Machine Learning Algorithm and a new combination Algorithm that has been wildly used in classification and prediction, and used in regression too.

read less

Abstract: This Paper gives an introduction of Random Forest. Random Forest is a new Machine Learning Algorithm and a new combination Algorithm. Random Forest is a combination of a series of tree structure classifiers. Random Forest has many good characters. Random Forest has been wildly used in classification and prediction, and used in regression too. Compared with the traditional algorithms Random Forest has many good virtues. Therefore the scope of application of Random Forest is very extensive.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection

[...]

Iftikhar Ahmad¹, Mohammad Basheri¹, Muhammad Iqbal², Aneel Rahim³•Institutions (3)

King Abdulaziz University¹, University of Engineering and Technology², Dublin Institute of Technology³

30 May 2018-IEEE Access

TL;DR: Well-known machine learning techniques, namely, SVM, random forest, and extreme learning machine (ELM) are applied and the results indicate that ELM outperforms other approaches in intrusion detection mechanisms.

...read moreread less

Abstract: Intrusion detection is a fundamental part of security tools, such as adaptive security appliances, intrusion detection systems, intrusion prevention systems, and firewalls. Various intrusion detection techniques are used, but their performance is an issue. Intrusion detection performance depends on accuracy, which needs to improve to decrease false alarms and to increase the detection rate. To resolve concerns on performance, multilayer perceptron, support vector machine (SVM), and other techniques have been used in recent work. Such techniques indicate limitations and are not efficient for use in large data sets, such as system and network data. The intrusion detection system is used in analyzing huge traffic data; thus, an efficient classification technique is necessary to overcome the issue. This problem is considered in this paper. Well-known machine learning techniques, namely, SVM, random forest, and extreme learning machine (ELM) are applied. These techniques are well-known because of their capability in classification. The NSL–knowledge discovery and data mining data set is used, which is considered a benchmark in the evaluation of intrusion detection mechanisms. The results indicate that ELM outperforms other approaches.

...read moreread less

379 citations

Cites background from "New machine learning algorithm: ran..."

...RF works by creating various decision trees in the training phase and output class labels those have the majority vote [12]....
[...]

Journal Article•DOI•

Recommendations on benchmarks for numerical air quality model applications in China – Part 1: PM 2.5 and chemical species

[...]

Ling Huang¹, Yonghui Zhu¹, Hehe Zhai¹, Shuhui Xue¹, Tianyi Zhu¹, Yun Shao¹, Ziyi Liu¹, Chris Emery, Greg Yarwood, Yangjun Wang¹, Joshua S. Fu, Kun Zhang¹, Li Li¹ - Show less +9 more•Institutions (1)

Shanghai University¹

24 Feb 2021-Atmospheric Chemistry and Physics

TL;DR: Analysis using a random forest method shows that the choices of emission inventory, grid resolution, and aerosol- and gas-phase chemistry are the top three factors affecting model performance for PM 2.5 and its chemical species in China.

...read moreread less

Abstract: . Numerical air quality models (AQMs) have been applied more frequently over the past decade to address diverse scientific and regulatory issues associated with deteriorated air quality in China. Thorough evaluation of a model's ability to replicate monitored conditions (i.e., a model performance evaluation or MPE) helps to illuminate the robustness and reliability of the baseline modeling results and subsequent analyses. However, with numerous input data requirements, diverse model configurations, and the scientific evolution of the models themselves, no two AQM applications are the same and their performance results should be expected to differ. MPE procedures have been developed for Europe and North America, but there is currently no uniform set of MPE procedures and associated benchmarks for China. Here we present an extensive review of model performance for fine particulate matter (PM 2.5 ) AQM applications to China and, from this context, propose a set of statistical benchmarks that can be used to objectively evaluate model performance for PM 2.5 AQM applications in China. We compiled MPE results from 307 peer-reviewed articles published between 2006 and 2019, which applied five of the most frequently used AQMs in China. We analyze influences on the range of reported statistics from different model configurations, including modeling regions and seasons, spatial resolution of modeling grids, temporal resolution of the MPE, etc. Analysis using a random forest method shows that the choices of emission inventory, grid resolution, and aerosol- and gas-phase chemistry are the top three factors affecting model performance for PM 2.5 . We propose benchmarks for six frequently used evaluation metrics for AQM applications in China, including two tiers – “goals” and “criteria” – where goals represent the best model performance that a model is currently expected to achieve and criteria represent the model performance that the majority of studies can meet. Our results formed a benchmark framework for the modeling performance of PM 2.5 and its chemical species in China. For instance, in order to meet the goal and criteria, the normalized mean bias (NMB) for total PM 2.5 should be within 10 % and 20 %, while the normalized mean error (NME) should be within 35 % and 45 %, respectively. The goal and criteria values of correlation coefficients for evaluating hourly and daily PM 2.5 are 0.70 and 0.60, respectively; corresponding values are higher when the index of agreement (IOA) is used (0.80 for goal and 0.70 for criteria). Results from this study will support the ever-growing modeling community in China by providing a more objective assessment and context for how well their results compare with previous studies and to better demonstrate the credibility and robustness of their AQM applications prior to subsequent regulatory assessments.

...read moreread less

41 citations

Cites background or methods from "New machine learning algorithm: ran..."

...MNB) and MNE were only used in 15 and 10 studies, respectively, since as mentioned in Simon et al. (2012), these two metrics tend to give more weight to data at low values....
[...]
...Random forest is a machine learning method suitable for classification and regression (Liu et al., 2012)....
[...]
...Random Forest is a machine learning method suitable for classification and regression (Liu et al., 2012)....
[...]

Journal Article•DOI•

A Machine Learning Approach for Improving the Performance of Network Intrusion Detection Systems

[...]

Adnan Helmi Azizan, Salama A. Mostafa, Aida Mustapha, Cik Feresa Mohd Foozy, Mohd Helmy Abd Wahab, Mazin Abed Mohammed, Bashar Ahmad Khalaf - Show less +3 more

20 Mar 2021

TL;DR: The SVM algorithm is found to be the best algorithm that can be used to detect an intrusion in the system and gets a higher average precision compared with the other two algorithms.

...read moreread less

Abstract: Intrusion detection systems (IDS) are used in analyzing huge data and diagnose anomaly traffic such as DDoS attack; thus, an efficient traffic classification method is necessary for the IDS. The IDS models attempt to decrease false alarm and increase true alarm rates in order to improve the performance accuracy of the system. To resolve this concern, three machine learning algorithms have been tested and evaluated in this research which are decision jungle (DJ), random forest (RF) and support vector machine (SVM). The main objective is to propose a ML-based network intrusion detection system (ML-based NIDS) model that compares the performance of the three algorithms based on their accuracy and precision of anomaly traffics. The knowledge discovery in databases (KDD) methodology and intrusion detection evaluation dataset (CIC-IDS2017) are used in the testing which both are considered as a benchmark in the evaluation of IDS. The average accuracy results of the SVM is 98.18%, RF is 96.76% and DJ is 96.50% in which the highest accuracy is achieved by the SVM. The average precision results of the SVM is 98.74, RF is 97.96 and DJ is 97.82 in which the SVM got a higher average precision compared with the other two algorithms. The average recall results of the SVM is 95.63, RF is 97.62 and DJ is 95.77 in which the RF achieves the highest average of recall than SVM and DJ. In overall, the SVM algorithm is found to be the best algorithm that can be used to detect an intrusion in the system.

...read moreread less

33 citations

Cites result from "New machine learning algorithm: ran..."

...Irregular or random process of the RF has a low characteristic error, in contrast to some other classification algorithms [18]....
[...]

Journal Article•DOI•

Impact of rescanning and normalization on convolutional neural network performance in multi-center, whole-slide classification of prostate cancer.

[...]

Zaneta Swiderska-Chadaj¹, Thomas de Bel¹, Lionel Blanchet², Alexi Baidoshvili, Dirk L. J. Vossen², Jeroen van der Laak³, Jeroen van der Laak¹, Geert Litjens¹ - Show less +4 more•Institutions (3)

Radboud University Nijmegen¹, Philips², Linköping University³

01 Sep 2020-Scientific Reports

TL;DR: This paper investigates the impact of scanning systems (scanners) and cycle-GAN-based normalization on algorithm performance, by comparing different deep learning models to automatically detect prostate cancer in whole-slide images and investigates the application of normalization as a pre-processing step by two techniques.

...read moreread less

Abstract: Algorithms can improve the objectivity and efficiency of histopathologic slide analysis. In this paper, we investigated the impact of scanning systems (scanners) and cycle-GAN-based normalization on algorithm performance, by comparing different deep learning models to automatically detect prostate cancer in whole-slide images. Specifically, we compare U-Net, DenseNet and EfficientNet. Models were developed on a multi-center cohort with 582 WSIs and subsequently evaluated on two independent test sets including 85 and 50 WSIs, respectively, to show the robustness of the proposed method to differing staining protocols and scanner types. We also investigated the application of normalization as a pre-processing step by two techniques, the whole-slide image color standardizer (WSICS) algorithm, and a cycle-GAN based method. For the two independent datasets we obtained an AUC of 0.92 and 0.83 respectively. After rescanning the AUC improves to 0.91/0.88 and after style normalization to 0.98/0.97. In the future our algorithm could be used to automatically pre-screen prostate biopsies to alleviate the workload of pathologists.

...read moreread less

31 citations

Journal Article•DOI•

A Review on Machine Learning, Artificial Intelligence, and Smart Technology in Water Treatment and Monitoring

[...]

Matthew Lowe, Ruwen Qin, Xinwei Mao

24 Apr 2022-Water

TL;DR: This review offers a cross-section of peer reviewed, critical water-based applications that have been coupled with AI or ML, including chlorination, adsorption, membrane filtration, water-quality-index monitoring,Water- quality-parameter modeling, river-level monitoring, and aquaponics/hydroponics automation/monitoring.

...read moreread less

Abstract: Artificial-intelligence methods and machine-learning models have demonstrated their ability to optimize, model, and automate critical water- and wastewater-treatment applications, natural-systems monitoring and management, and water-based agriculture such as hydroponics and aquaponics. In addition to providing computer-assisted aid to complex issues surrounding water chemistry and physical/biological processes, artificial intelligence and machine-learning (AI/ML) applications are anticipated to further optimize water-based applications and decrease capital expenses. This review offers a cross-section of peer reviewed, critical water-based applications that have been coupled with AI or ML, including chlorination, adsorption, membrane filtration, water-quality-index monitoring, water-quality-parameter modeling, river-level monitoring, and aquaponics/hydroponics automation/monitoring. Although success in control, optimization, and modeling has been achieved with the AI methods, ML models, and smart technologies (including the Internet of Things (IoT), sensors, and systems based on these technologies) that are reviewed herein, key challenges and limitations were common and pervasive throughout. Poor data management, low explainability, poor model reproducibility and standardization, as well as a lack of academic transparency are all important hurdles to overcome in order to successfully implement these intelligent applications. Recommendations to aid explainability, data management, reproducibility, and model causality are offered in order to overcome these hurdles and continue the successful implementation of these powerful tools.

...read moreread less

30 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

"New machine learning algorithm: ran..." refers background in this paper

...s is the strength of the set of classifiers { ( , )} h θ x , ρ is the mean value of the correlation [6]....
[...]

Journal Article•DOI•

Bagging predictors

[...]

Leo Breiman

01 Aug 1996

TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.

...read moreread less

Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

...read moreread less

16,118 citations

"New machine learning algorithm: ran..." refers methods in this paper

...[ 1,1] i k ∈ − ....
[...]
...In 1996, Leo Breiman advanced Bagging algorithm which is one of the early stage algorithm [1]....
[...]

Proceedings Article•

Experiments with a new boosting algorithm

[...]

Yoav Freund¹, Robert E. Schapire¹•Institutions (1)

AT&T¹

03 Jul 1996

TL;DR: This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers.

...read moreread less

Abstract: In an earlier paper, we introduced a new "boosting" algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that con- sistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a "pseudo-loss" which is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman's "bagging" method when used to aggregate various classifiers (including decision trees and single attribute- value tests). We compared the performance of the two methods on a collection of machine-learning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearest-neighbor classifier on an OCR problem.

...read moreread less

7,601 citations

Experiment with a new boosting algorithm

[...]

Y. Freund

01 Jan 1996

7,386 citations

Journal Article•DOI•

The random subspace method for constructing decision forests

[...]

Tin Kam Ho¹•Institutions (1)

Bell Labs¹

01 Aug 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity.

...read moreread less

Abstract: Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy.

...read moreread less

5,984 citations