Showing papers on "Random forest published in 2018"

PDF

Open Access

Journal Article•DOI•

[...]

01 Jul 2018-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: The concept of ensemble learning is introduced, traditional, novel and state‐of‐the‐art ensemble methods are reviewed and current challenges and trends in the field are discussed.

...read moreread less

Abstract: Ensemble methods are considered the state‐of‐the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state‐of‐the‐art ensemble methods and discusses current challenges and trends in the field.

...read moreread less

1,381 citations

Journal Article•DOI•

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

[...]

Stefan Wager¹, Susan Athey¹•Institutions (1)

Stanford University¹

06 Jun 2018-Journal of the American Statistical Association

TL;DR: This paper developed a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm, and showed that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution.

...read moreread less

Abstract: Many scientific and engineering challenges—ranging from personalized medicine to customized marketing recommendations—require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical infe...

...read moreread less

1,156 citations

Journal Article•DOI•

Implementation of machine-learning classification in remote sensing: an applied review

[...]

Aaron E. Maxwell¹, Timothy A. Warner¹, Fang Fang¹•Institutions (1)

West Virginia University¹

02 Feb 2018-International Journal of Remote Sensing

TL;DR: An overview of machine learning from an applied perspective focuses on the relatively mature methods of support vector machines, single decision trees (DTs), Random Forests, boosted DTs, artificial neural networks, and k-nearest neighbours (k-NN).

...read moreread less

Abstract: Machine learning offers the potential for effective and efficient classification of remotely sensed imagery. The strengths of machine learning include the capacity to handle data of high dimensionality and to map classes with very complex characteristics. Nevertheless, implementing a machine-learning classification is not straightforward, and the literature provides conflicting advice regarding many key issues. This article therefore provides an overview of machine learning from an applied perspective. We focus on the relatively mature methods of support vector machines, single decision trees (DTs), Random Forests, boosted DTs, artificial neural networks, and k-nearest neighbours (k-NN). Issues considered include the choice of algorithm, training data requirements, user-defined parameter selection and optimization, feature space impacts and reduction, and computational costs. We illustrate these issues through applying machine-learning classification to two publically available remotely sensed dat...

...read moreread less

919 citations

Journal Article•DOI•

Sentinel-2 cropland mapping using pixel-based and object-based time-weighted dynamic time warping analysis

[...]

Mariana Belgiu¹, Ovidiu Csillik², Ovidiu Csillik³•Institutions (3)

University of Twente¹, University of California, Berkeley², University of Salzburg³

01 Jan 2018-Remote Sensing of Environment

TL;DR: Object-based time-weighted dynamic time warping (TWDTW) method achieved comparable classification results to RF in Romania and Italy, but RF achieved better results in the USA, where the classified crops present high intra-class spectral variability.

...read moreread less

556 citations

Journal Article•DOI•

Predicting reaction performance in C–N cross-coupling using machine learning

[...]

Derek T. Ahneman¹, Jesús G. Estrada¹, Shishi Lin², Spencer D. Dreher², Abigail G. Doyle¹ - Show less +1 more•Institutions (2)

Princeton University¹, Merck & Co.²

13 Apr 2018-Science

TL;DR: It is demonstrated that machine learning can be used to predict the performance of a synthetic reaction in multidimensional chemical space using data obtained via high-throughput experimentation and provides significantly improved predictive performance over linear regression analysis.

...read moreread less

Abstract: Machine learning methods are becoming integral to scientific inquiry in numerous disciplines. We demonstrated that machine learning can be used to predict the performance of a synthetic reaction in multidimensional chemical space using data obtained via high-throughput experimentation. We created scripts to compute and extract atomic, molecular, and vibrational descriptors for the components of a palladium-catalyzed Buchwald-Hartwig cross-coupling of aryl halides with 4-methylaniline in the presence of various potentially inhibitory additives. Using these descriptors as inputs and reaction yield as output, we showed that a random forest algorithm provides significantly improved predictive performance over linear regression analysis. The random forest model was also successfully applied to sparse training sets and out-of-sample prediction, suggesting its value in facilitating adoption of synthetic methodology.

...read moreread less

536 citations

Journal Article•DOI•

Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables

[...]

Tomislav Hengl, Madlene Nussbaum¹, Marvin N. Wright², Gerard B. M. Heuvelink, Benedikt Gräler - Show less +1 more•Institutions (2)

Bern University of Applied Sciences¹, Leibniz Association²

29 Aug 2018-PeerJ

TL;DR: A random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process, and appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields.

...read moreread less

Abstract: Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality-especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

...read moreread less

453 citations

Journal Article•DOI•

Random forest versus logistic regression: a large-scale benchmark experiment.

[...]

Raphaël Couronné¹, Philipp Probst¹, Anne-Laure Boulesteix¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jan 2018-BMC Bioinformatics

TL;DR: A large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools suggests a significantly better performance of RF.

...read moreread less

Abstract: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

...read moreread less

449 citations

Journal Article•DOI•

The revival of the Gini importance

[...]

Stefano Nembrini¹, Inke R. König², Marvin N. Wright², Marvin N. Wright³•Institutions (3)

University of Florida¹, University of Lübeck², Leibniz Association³

01 Nov 2018-Bioinformatics

TL;DR: A fast approach to debias impurity‐based variable importance measures for classification, regression and survival forests is set up, showing that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance.

...read moreread less

Abstract: Motivation Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

374 citations

Journal Article•DOI•

Hyperparameters and Tuning Strategies for Random Forest

[...]

Philipp Probst¹, Marvin N. Wright², Anne-Laure Boulesteix¹•Institutions (2)

Ludwig Maximilian University of Munich¹, Leibniz Association²

10 Apr 2018-arXiv: Machine Learning

TL;DR: In this article, the authors provide a literature review on the parameters' influence on the prediction performance and on variable importance measures, and demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO).

...read moreread less

Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

...read moreread less

366 citations

Journal Article•DOI•

A machine learning method to estimate PM2.5 concentrations across China with remote sensing, meteorological and land use information.

[...]

Gongbo Chen¹, Shanshan Li¹, Luke D. Knibbs², Nicholas A. S. Hamm³, Wei Cao⁴, Tiantian Li⁵, Jianping Guo, Hongyan Ren⁴, Michael J. Abramson¹, Yuming Guo¹ - Show less +6 more•Institutions (5)

Monash University¹, University of Queensland², The University of Nottingham Ningbo China³, Chinese Academy of Sciences⁴, Chinese Center for Disease Control and Prevention⁵

15 Sep 2018-Science of The Total Environment

TL;DR: Taking advantage of a novel application of modeling framework and the most recent ground-level PM2.5 observations, the machine learning method showed higher predictive ability than previous studies.

...read moreread less

331 citations

Journal Article•DOI•

A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees

[...]

Arno De Caigny¹, Kristof Coussement¹, Koen W. De Bock•Institutions (1)

Lille Catholic University¹

01 Sep 2018-European Journal of Operational Research

TL;DR: In this article, the authors proposed a new hybrid algorithm, the logit leaf model (LLM), which consists of two stages: a segmentation phase and a prediction phase, where in the first stage customer segments are identified using decision rules and in the second stage a model is created for every leaf of this tree.

...read moreread less

Journal Article•DOI•

Prediction of the landslide susceptibility: Which algorithm, which precision?

[...]

Hamid Reza Pourghasemi¹, Omid Rahmati²•Institutions (2)

Shiraz University¹, Lorestan University²

01 Mar 2018-Catena

TL;DR: The first comprehensive comparison among the performances of ten advanced machine learning techniques (MLTs) including artificial neural networks (ANNs), boosted regression tree (BRT), classification and regression trees (CART), generalized linear model (GLM), generalized additive model (GAM), multivariate adaptive regression splines (MARS), naive Bayes (NB), quadratic discriminant analysis (QDA), random forest (RF), and support vector machines (SVM) is presented.

...read moreread less

Abstract: Coupling machine learning algorithms with spatial analytical techniques for landslide susceptibility modeling is a worth considering issue. So, the current research intend to present the first comprehensive comparison among the performances of ten advanced machine learning techniques (MLTs) including artificial neural networks (ANNs), boosted regression tree (BRT), classification and regression trees (CART), generalized linear model (GLM), generalized additive model (GAM), multivariate adaptive regression splines (MARS), naive Bayes (NB), quadratic discriminant analysis (QDA), random forest (RF), and support vector machines (SVM) for modeling landslide susceptibility and evaluating the importance of variables in GIS and R open source software. This study was carried out in the Ghaemshahr Region, Iran. The performance of MLTs has been evaluated using the area under ROC curve (AUC-ROC) approach. The results showed that AUC values for ten MLTs vary from 62.4 to 83.7%. It has been found that the RF (AUC = 83.7%) and BRT (AUC = 80.7%) have the best performances comparison to other MLTs.

...read moreread less

Journal Article•DOI•

Performance evaluation of the GIS-based data mining techniques of best-first decision tree, random forest, and naïve Bayes tree for landslide susceptibility modeling

[...]

Wei Chen¹, Shuai Zhang¹, Renwei Li¹, Himan Shahabi²•Institutions (2)

Xi'an University of Science and Technology¹, University of Kurdistan²

10 Dec 2018-Science of The Total Environment

TL;DR: The main aim of the present study is to explore and compare three state-of-the art data mining techniques, best-first decision tree, random forest, and naïve Bayes tree, for landslide susceptibility assessment in the Longhai area of China.

...read moreread less

Journal Article•DOI•

Deep Learning Approach Combining Sparse Autoencoder With SVM for Network Intrusion Detection

[...]

Majjed Al-Qatf¹, Yu Lasheng¹, Mohammed Al-Habib¹, Kamal Al-Sabahi¹•Institutions (1)

Central South University¹

24 Sep 2018-IEEE Access

TL;DR: The proposed STL-IDS approach improves network intrusion detection and provides a new research method for intrusion detection, and has accelerated SVM training and testing times and performed better than most of the previous approaches in terms of performance metrics in binary and multiclass classification.

...read moreread less

Abstract: Network intrusion detection systems (NIDSs) provide a better solution to network security than other traditional network defense technologies, such as firewall systems The success of NIDS is highly dependent on the performance of the algorithms and improvement methods used to increase the classification accuracy and decrease the training and testing times of the algorithms We propose an effective deep learning approach, self-taught learning (STL)-IDS, based on the STL framework The proposed approach is used for feature learning and dimensionality reduction It reduces training and testing time considerably and effectively improves the prediction accuracy of support vector machines (SVM) with regard to attacks The proposed model is built using the sparse autoencoder mechanism, which is an effective learning algorithm for reconstructing a new feature representation in an unsupervised manner After the pre-training stage, the new features are fed into the SVM algorithm to improve its detection capability for intrusion and classification accuracy Moreover, the efficiency of the approach in binary and multiclass classification is studied and compared with that of shallow classification methods, such as J48, naive Bayesian, random forest, and SVM Results show that our approach has accelerated SVM training and testing times and performed better than most of the previous approaches in terms of performance metrics in binary and multiclass classification The proposed STL-IDS approach improves network intrusion detection and provides a new research method for intrusion detection

...read moreread less

Book Chapter•DOI•

Gradient Boosting Machine

[...]

V Kishore Ayyadevara

01 Jan 2018

TL;DR: This chapter considers decision trees and random forest algorithms, which are bagging (bootstrap aggregating) algorithms that combine the output of multiple decision trees to give the prediction.

...read moreread less

Abstract: So far, we’ve considered decision trees and random forest algorithms. We saw that random forest is a bagging (bootstrap aggregating) algorithm—it combines the output of multiple decision trees to give the prediction. Typically, in a bagging algorithm trees are grown in parallel to get the average prediction across all trees, where each tree is built on a sample of original data.

...read moreread less

Journal Article•DOI•

Predictive modelling for solar thermal energy systems: A comparison of support vector regression, random forest, extra trees and regression trees

[...]

Muhammad Waseem Ahmad¹, Jonathan Reynolds¹, Yacine Rezgui¹•Institutions (1)

Cardiff University¹

01 Dec 2018-Journal of Cleaner Production

TL;DR: It was found that RF and ET have comparable predictive power and are equally applicable for predicting useful solar thermal energy (USTE), with root mean square error values of 6.86 and 7.12 on the testing dataset, respectively.

...read moreread less

Proceedings Article•DOI•

Plant Disease Detection Using Machine Learning

[...]

Shima Ramesh Maniyath¹, Vinod P², Niveditha M¹, Pooja R¹, Prasad Bhat N¹, Shashank N¹, Ramachandra Hebbar² - Show less +3 more•Institutions (2)

MVJ College of Engineering¹, Indian Space Research Organisation²

25 Apr 2018

TL;DR: Using machine learning to train the large data sets available publicly gives a clear way to detect the disease present in plants in a colossal scale.

...read moreread less

Abstract: Crop diseases are a noteworthy risk to sustenance security, however their quick distinguishing proof stays troublesome in numerous parts of the world because of the non attendance of the important foundation. Emergence of accurate techniques in the field of leaf-based image classification has shown impressive results. This paper makes use of Random Forest in identifying between healthy and diseased leaf from the data sets created. Our proposed paper includes various phases of implementation namely dataset creation, feature extraction, training the classifier and classification. The created datasets of diseased and healthy leaves are collectively trained under Random Forest to classify the diseased and healthy images. For extracting features of an image we use Histogram of an Oriented Gradient (HOG). Overall, using machine learning to train the large data sets available publicly gives us a clear way to detect the disease present in plants in a colossal scale.

...read moreread less

Journal Article•DOI•

Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time Monitoring System in Automotive Manufacturing.

[...]

Muhammad Syafrudin¹, Ganjar Alfian¹, Norma Latif Fitriyani¹, Jongtae Rhee¹•Institutions (1)

Dongguk University¹

04 Sep 2018-Sensors

TL;DR: The results showed that IoT-based sensors and the proposed big data processing system are sufficiently efficient to monitor the manufacturing process and that the proposed hybrid prediction model has better fault prediction accuracy than other models given the sensor data as input.

...read moreread less

Abstract: With the increase in the amount of data captured during the manufacturing process, monitoring systems are becoming important factors in decision making for management Current technologies such as Internet of Things (IoT)-based sensors can be considered a solution to provide efficient monitoring of the manufacturing process In this study, a real-time monitoring system that utilizes IoT-based sensors, big data processing, and a hybrid prediction model is proposed Firstly, an IoT-based sensor that collects temperature, humidity, accelerometer, and gyroscope data was developed The characteristics of IoT-generated sensor data from the manufacturing process are: real-time, large amounts, and unstructured type The proposed big data processing platform utilizes Apache Kafka as a message queue, Apache Storm as a real-time processing engine and MongoDB to store the sensor data from the manufacturing process Secondly, for the proposed hybrid prediction model, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)-based outlier detection and Random Forest classification were used to remove outlier sensor data and provide fault detection during the manufacturing process, respectively The proposed model was evaluated and tested at an automotive manufacturing assembly line in Korea The results showed that IoT-based sensors and the proposed big data processing system are sufficiently efficient to monitor the manufacturing process Furthermore, the proposed hybrid prediction model has better fault prediction accuracy than other models given the sensor data as input The proposed system is expected to support management by improving decision-making and will help prevent unexpected losses caused by faults during the manufacturing process

...read moreread less

Journal Article•DOI•

A Survey of Random Forest Based Methods for Intrusion Detection Systems

[...]

Paulo Angelo Alves Resende¹, André C. Drummond¹•Institutions (1)

University of Brasília¹

23 May 2018-ACM Computing Surveys

TL;DR: This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics, and commonly used methods.

...read moreread less

Abstract: Over the past decades, researchers have been proposing different Intrusion Detection approaches to deal with the increasing number and complexity of threats for computer systems. In this context, Random Forest models have been providing a notable performance on their applications in the realm of the behaviour-based Intrusion Detection Systems. Specificities of the Random Forest model are used to provide classification, feature selection, and proximity metrics. This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics, and commonly used methods. It also provides a survey of Random Forest based methods applied in this context, considering the particularities involved in these models. Finally, some open questions and challenges are posed combined with possible directions to deal with them, which may guide future works on the area.

...read moreread less

Journal Article•DOI•

Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery:

[...]

Te Han¹, Dongxiang Jiang¹, Qi Zhao², Lei Wang², Kai Yin² - Show less +1 more•Institutions (2)

Tsinghua University¹, Arkansas Electric Cooperative Corporation²

01 May 2018-Transactions of the Institute of Measurement and Control

TL;DR: Random forest has been proven to outperform the comparative classifiers in terms of recognition accuracy, stability and robustness to features, especially with a small training set, and the user-friendly parameters in random forest offer great convenience for practical engineering.

...read moreread less

Abstract: Nowadays, the data-driven diagnosis method, exploiting pattern recognition method to diagnose the fault patterns automatically, achieves much success for rotating machinery Some popular classifica

...read moreread less

Journal Article•DOI•

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers

[...]

Timo M. Deist¹, Timo M. Deist², Frank J. W. M. Dankers², Frank J. W. M. Dankers³, Gilmer Valdes⁴, R. Wijsman³, I-Chow Hsu⁴, Cary Oberije², Tim Lustberg², Johan van Soest², Frank J. P. Hoebers², Arthur Jochems², Arthur Jochems¹, Issam El Naqa⁵, Leonard Wee², Olivier Morin⁴, David R. Raleigh⁴, Wouter Bots³, Johannes H.A.M. Kaanders³, José Belderbos⁶, Margriet Kwint⁶, Timothy D. Solberg⁴, René Monshouwer³, Johan Bussink³, Andre Dekker², Philippe Lambin¹ - Show less +22 more•Institutions (6)

Maastricht University Medical Centre¹, Maastricht University², Radboud University Nijmegen³, University of California, San Francisco⁴, University of Michigan⁵, Netherlands Cancer Institute⁶

01 Jul 2018-Medical Physics

TL;DR: Random forest and elastic net logistic regression yield higher discriminative performance in (chemo)radiotherapy outcome and toxicity prediction than other studied classifiers, and one of these two classifiers should be the first choice for investigators when building classification models or to benchmark one's own modeling results against.

...read moreread less

Abstract: Purpose: Machine learning classification algorithms (classifiers) for prediction of treatment response are becoming more popular in radiotherapy literature. General Machine learning literature provides evidence in favor of some classifier families (random forest, support vector machine, gradient boosting) in terms of classification performance. The purpose of this study is to compare such classifiers specifically for (chemo)radiotherapy datasets and to estimate their average discriminative performance for radiation treatment outcome prediction.

...read moreread less

Journal Article•DOI•

Improved Random Forest for Classification.

[...]

Angshuman Paul¹, Dipti Prasad Mukherjee¹, Prasun Das¹, Abhinandan Gangopadhyay², Appa Rao Chintha³, Saurabh Kundu³ - Show less +2 more•Institutions (3)

Indian Statistical Institute¹, Arizona State University², Tata Steel³

10 May 2018-IEEE Transactions on Image Processing

TL;DR: It is proved that further addition of trees or further reduction of features does not improve classification performance, and a novel theoretical upper limit on the number of trees to be added to the forest is formulated to ensure improvement in classification accuracy.

...read moreread less

Abstract: We propose an improved random forest classifier that performs classification with minimum number of trees. The proposed method iteratively removes some unimportant features. Based on the number of important and unimportant features, we formulate a novel theoretical upper limit on the number of trees to be added to the forest to ensure improvement in classification accuracy. Our algorithm converges with a reduced but important set of features. We prove that further addition of trees or further reduction of features does not improve classification performance. The efficacy of the proposed approach is demonstrated through experiments on benchmark datasets. We further use the proposed classifier to detect mitotic nuclei in the histopathological datasets of breast tissues. We also apply our method on the industrial dataset of dual phase steel microstructures to classify different phases. Results of our method on different datasets show significant reduction in average classification error compared to a number of competing methods.

...read moreread less

Journal Article•DOI•

An Ensemble Machine-Learning Model To Predict Historical PM2.5 Concentrations in China from Satellite Data.

[...]

Qingyang Xiao, Howard H. Chang, Guannan Geng, Yang Liu

24 Oct 2018-Environmental Science & Technology

TL;DR: This study proposed an ensemble machine learning approach that provided reliable PM2.5 hindcast capabilities and provided more accurate out-of-range predictions at the daily level and monthly level.

...read moreread less

Abstract: The long satellite aerosol data record enables assessments of historical PM2.5 level in regions where routine PM2.5 monitoring began only recently. However, most previous models reported decreased prediction accuracy when predicting PM2.5 levels outside the model-training period. In this study, we proposed an ensemble machine learning approach that provided reliable PM2.5 hindcast capabilities. The missing satellite data were first filled by multiple imputation. Then the modeling domain, China, was divided into seven regions using a spatial clustering method to control for unobserved spatial heterogeneity. A set of machine learning models including random forest, generalized additive model, and extreme gradient boosting were trained in each region separately. Finally, a generalized additive ensemble model was developed to combine predictions from different algorithms. The ensemble prediction characterized the spatiotemporal distribution of daily PM2.5 well with the cross-validation (CV) R2 (RMSE) of 0.79 ...

...read moreread less

Proceedings Article•DOI•

Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification

[...]

Arash Habibi Lashkari¹, Andi Fitriah Abdul Kadir¹, Laya Taheri¹, Ali A. Ghorbani¹•Institutions (1)

University of New Brunswick¹

01 Oct 2018

TL;DR: The main goal of this paper is to propose a systematic approach to generate Android malware datasets using real smartphones instead of emulators and develop a new dataset, namely CI-CAndMal2017, which covers all the shortcomings and limitations of previous datasets.

...read moreread less

Abstract: Malware detection is one of the most important factors in the security of smartphones. Academic researchers have extensively studied Android malware detection problems. Machine learning methods proposed in previous work typically reported high detection performance and fast prediction times on fixed and defective datasets. Therefore, based on these shortcomings most datasets are not suitable for real-world deployment. The main goal of this paper is to propose a systematic approach to generate Android malware datasets using real smartphones instead of emulators and develop a new dataset, namely CI-CAndMal2017, which covers all the shortcomings and limitations of previous datasets. Also, we offer 80 traffic features to select the best feature sets for detecting and classifying the malicious families just by traffic analysis. The proposed method showed an average precision of 85% and recall of 88% for three classifiers, namely Random Forest(RF), K-Nearest Neighbor (KNN), and Decision Tree (DT).

...read moreread less

Journal Article•DOI•

Mapping forest change using stacked generalization: An ensemble approach

[...]

Sean P. Healey¹, Warren B. Cohen¹, Zhiqiang Yang², C. Kenneth Brewer¹, Evan B. Brooks³, Noel Gorelick⁴, Alexander J. Hernandez⁵, Chengquan Huang⁶, M. Joseph Hughes², Robert E. Kennedy², Thomas R. Loveland⁷, Gretchen G. Moisen¹, Todd A. Schroeder¹, Stephen V. Stehman⁸, James E. Vogelmann⁷, Curtis E. Woodcock⁹, Limin Yang⁷, Zhe Zhu¹⁰ - Show less +14 more•Institutions (10)

United States Forest Service¹, Oregon State University², Virginia Tech³, Google⁴, Utah State University⁵, University of Maryland, College Park⁶, United States Geological Survey⁷, State University of New York College of Environmental Science and Forestry⁸, Boston University⁹, Texas Tech University¹⁰

01 Jan 2018-Remote Sensing of Environment

TL;DR: Stacking using a Random Forests model cut omission and commission error rates in half in many cases in relation to individual change detection algorithms, and cut error rates by one quarter compared to more conventional parametric stacking.

...read moreread less

Journal Article•DOI•

Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea

[...]

Jeong-Cheol Kim¹, Sunmin Lee¹, Hyung-Sup Jung¹, Saro Lee²•Institutions (2)

Seoul National University¹, Korea University of Science and Technology²

02 Sep 2018-Geocarto International

TL;DR: In this article, landslide susceptibility maps were constructed in the Pyeong-Chang area, Korea, using the Random Forest and Boosted Tree models, where landslide locations were randomly selected in a 50/50 ratio for training and validation of the models.

...read moreread less

Abstract: Landslides susceptibility maps were constructed in the Pyeong-Chang area, Korea, using the Random Forest and Boosted Tree models. Landslide locations were randomly selected in a 50/50 ratio for training and validation of the models. Seventeen landslide-related factors were extracted and constructed in a spatial database. The relationships between the observed landslide locations and these factors were identified by using the two models. The models were used to generate a landslide susceptibility map and the importance of the factors was calculated. Finally, the landslide susceptibility maps were validated. Finally, landslide susceptibility maps were generated. For the Random Forest model, the validation accuracy in regression and classification algorithms showed 79.34 and 79.18%, respectively, and for the Boosted Tree model, these were 84.87 and 85.98%, respectively. The two models showed satisfactory accuracies, and the Boosted Tree model showed better results than the Random Forest model.

...read moreread less

Journal Article•DOI•

Classification of focal and non-focal EEG signals using neighborhood component analysis and machine learning algorithms

[...]

Shivarudhrappa Raghu¹, Natarajan Sriraam¹•Institutions (1)

M. S. Ramaiah Institute of Technology¹

15 Dec 2018-Expert Systems With Applications

TL;DR: A computerized automated detection of focal epileptic seizures in real-time using MATLAB based software tool referred to as CADFES, which is expected to perform better at the hospitals for automated classification of focal and non-focal seizures.

...read moreread less

Abstract: Background: Classification and localization of focal epileptic seizures provide a proper diagnostic procedure for epilepsy patients. Visual identification of seizure activity from long-term electroencephalography (EEG) is tedious, time-consuming and leads to human error. Therefore, there is a need for an automated classification system. Methods: In this paper, we introduce a tool called CADFES: computerized automated detection of focal epileptic seizures. For the study, total 41.66 hours of EEG data from the Bern-Barcelona database was used. Set of 28 features were extracted from time, frequency, and statistical domain and significant features were selected using neighborhood component analysis (NCA). In NCA, optimization of regularization parameter ensured better classification accuracy (less classification loss) with seven features. The performance of the algorithm was assessed using support vector machine (SVM), K-nearest neighbor (K-NN), random forest and adaptive boosting (AdaBoost) classifiers. Results: Experimental results revealed sensitivity, specificity, accuracy, positive predictive rate, negative predictive rate, and area under the curve of 97.6%, 94.4%, 96.1%, 92.9%, 98.8% and 0.96 respectively using the SVM classifier. Finally, MATLAB based software tool referred to as CADFES was introduced for automated classification of focal and non-focal seizures. Comparison results ensure that proposed study is superior to existing methods. Hence, it is expected to perform better at the hospitals for automated classification of focal epileptic seizures in real-time.

...read moreread less

Journal Article•DOI•

Deep Convolutional Neural Network for Complex Wetland Classification Using Optical Remote Sensing Imagery

[...]

Mohammad Rezaee¹, Masoud Mahdianpari², Yun Zhang¹, Bahram Salehi²•Institutions (2)

University of New Brunswick¹, St. John's University²

02 Jul 2018-IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

TL;DR: The proposed classification scheme is the first attempt, investigating the potential of fine-tuning pre-existing CNN, for land cover mapping and serves as a baseline framework to facilitate further scientific research using the latest state-of-art machine learning tools for processing remote sensing data.

...read moreread less

Abstract: The synergistic use of spatial features with spectral properties of satellite images enhances thematic land cover information, which is of great significance for complex land cover mapping. Incorporating spatial features within the classification scheme have been mainly carried out by applying just low-level features, which have shown improvement in the classification result. By contrast, the application of high-level spatial features for classification of satellite imagery has been underrepresented. This study aims to address the lack of high-level features by proposing a classification framework based on convolutional neural network (CNN) to learn deep spatial features for wetland mapping using optical remote sensing data. Designing a fully trained new convolutional network is infeasible due to the limited amount of training data in most remote sensing studies. Thus, we applied fine tuning of a pre-existing CNN. Specifically, AlexNet was used for this purpose. The classification results obtained by the deep CNN were compared with those based on well-known ensemble classifiers, namely random forest (RF), to evaluate the efficiency of CNN. Experimental results demonstrated that CNN was superior to RF for complex wetland mapping even by incorporating the small number of input features (i.e., three features) for CNN compared to RF (i.e., eight features). The proposed classification scheme is the first attempt, investigating the potential of fine-tuning pre-existing CNN, for land cover mapping. It also serves as a baseline framework to facilitate further scientific research using the latest state-of-art machine learning tools for processing remote sensing data.

...read moreread less

Journal Article•DOI•

Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances

[...]

Yunxin Xie¹, Chenyang Zhu², Wen Zhou¹, Zhongdong Li¹, Xuan Liu¹, Mei Tu³ - Show less +2 more•Institutions (3)

Chengdu University of Technology¹, University of Southampton², Sinopec³

01 Jan 2018-Journal of Petroleum Science and Engineering

TL;DR: The results suggest that ensemble methods are good algorithm choices for supervised classification of lithology using well log data because the classification accuracy is remarkably similar across the lithology classes for both the Random Forest and Gradient Tree Boosting models.

...read moreread less

Journal Article•DOI•

A novel ensemble method for credit scoring: Adaption of different imbalance ratios

[...]

He Hongliang¹, Wenyu Zhang¹, Shuai Zhang¹•Institutions (1)

Zhejiang University of Finance and Economics¹

15 May 2018-Expert Systems With Applications

TL;DR: This study aims to generate a novel ensemble model for credit scoring to obtain superior performance and high robustness, adapting to different imbalance ratio datasets, and demonstrates that the proposed model is robust and represents a positive development in credit scoring.

...read moreread less

Abstract: In the past few decades, credit scoring has become an increasing concern for financial institutions and is currently a popular topic of research. This study aims to generate a novel ensemble model for credit scoring, to obtain superior performance and high robustness, adapting to different imbalance ratio datasets. First, according to the credit scoring data characteristics, the proposed model extends the BalanceCascade approach to generate adjustable balanced subsets based on the imbalance ratios of training data. Further, it reduces the negative effect of imbalanced data and improves the comprehensive performance of the predictive model. Second, the proposed model adopts two kinds of tree-based classifiers, random forest and extreme gradient boosting, as the base classifiers for a three-stage ensemble model. This includes the use of stacking to generate predicted results of the former layer as new explanatory features in the latter layer, and the use of a particle swarm optimization algorithm for parameters optimization of the base classifiers. Finally, the results indicate that the average performance of the proposed model is superior to other comparative algorithms as reflected in most evaluation measures for different datasets. It demonstrates that the proposed model is robust and represents a positive development in credit scoring.

...read moreread less

Collapse