scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2018"


Journal ArticleDOI
TL;DR: The concept of ensemble learning is introduced, traditional, novel and state‐of‐the‐art ensemble methods are reviewed and current challenges and trends in the field are discussed.
Abstract: Ensemble methods are considered the state‐of‐the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state‐of‐the‐art ensemble methods and discusses current challenges and trends in the field.

1,381 citations


Journal ArticleDOI
TL;DR: This paper developed a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm, and showed that causal forests are pointwise consistent for the true treatment effect and have an asymptotically Gaussian and centered sampling distribution.
Abstract: Many scientific and engineering challenges—ranging from personalized medicine to customized marketing recommendations—require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical infe...

1,156 citations


Journal ArticleDOI
TL;DR: An overview of machine learning from an applied perspective focuses on the relatively mature methods of support vector machines, single decision trees (DTs), Random Forests, boosted DTs, artificial neural networks, and k-nearest neighbours (k-NN).
Abstract: Machine learning offers the potential for effective and efficient classification of remotely sensed imagery. The strengths of machine learning include the capacity to handle data of high dimensionality and to map classes with very complex characteristics. Nevertheless, implementing a machine-learning classification is not straightforward, and the literature provides conflicting advice regarding many key issues. This article therefore provides an overview of machine learning from an applied perspective. We focus on the relatively mature methods of support vector machines, single decision trees (DTs), Random Forests, boosted DTs, artificial neural networks, and k-nearest neighbours (k-NN). Issues considered include the choice of algorithm, training data requirements, user-defined parameter selection and optimization, feature space impacts and reduction, and computational costs. We illustrate these issues through applying machine-learning classification to two publically available remotely sensed dat...

919 citations


Journal ArticleDOI
TL;DR: Object-based time-weighted dynamic time warping (TWDTW) method achieved comparable classification results to RF in Romania and Italy, but RF achieved better results in the USA, where the classified crops present high intra-class spectral variability.

556 citations


Journal ArticleDOI
13 Apr 2018-Science
TL;DR: It is demonstrated that machine learning can be used to predict the performance of a synthetic reaction in multidimensional chemical space using data obtained via high-throughput experimentation and provides significantly improved predictive performance over linear regression analysis.
Abstract: Machine learning methods are becoming integral to scientific inquiry in numerous disciplines. We demonstrated that machine learning can be used to predict the performance of a synthetic reaction in multidimensional chemical space using data obtained via high-throughput experimentation. We created scripts to compute and extract atomic, molecular, and vibrational descriptors for the components of a palladium-catalyzed Buchwald-Hartwig cross-coupling of aryl halides with 4-methylaniline in the presence of various potentially inhibitory additives. Using these descriptors as inputs and reaction yield as output, we showed that a random forest algorithm provides significantly improved predictive performance over linear regression analysis. The random forest model was also successfully applied to sparse training sets and out-of-sample prediction, suggesting its value in facilitating adoption of synthetic methodology.

536 citations


Journal ArticleDOI
29 Aug 2018-PeerJ
TL;DR: A random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process, and appears to be especially attractive for building multivariate spatial prediction models that can be used as “knowledge engines” in various geoscience fields.
Abstract: Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality-especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

453 citations


Journal ArticleDOI
TL;DR: A large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools suggests a significantly better performance of RF.
Abstract: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

449 citations


Journal ArticleDOI
TL;DR: A fast approach to debias impurity‐based variable importance measures for classification, regression and survival forests is set up, showing that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance.
Abstract: Motivation Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information Supplementary data are available at Bioinformatics online.

374 citations


Journal ArticleDOI
TL;DR: In this article, the authors provide a literature review on the parameters' influence on the prediction performance and on variable importance measures, and demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO).
Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

366 citations


Journal ArticleDOI
TL;DR: Taking advantage of a novel application of modeling framework and the most recent ground-level PM2.5 observations, the machine learning method showed higher predictive ability than previous studies.

331 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new hybrid algorithm, the logit leaf model (LLM), which consists of two stages: a segmentation phase and a prediction phase, where in the first stage customer segments are identified using decision rules and in the second stage a model is created for every leaf of this tree.

Journal ArticleDOI
01 Mar 2018-Catena
TL;DR: The first comprehensive comparison among the performances of ten advanced machine learning techniques (MLTs) including artificial neural networks (ANNs), boosted regression tree (BRT), classification and regression trees (CART), generalized linear model (GLM), generalized additive model (GAM), multivariate adaptive regression splines (MARS), naive Bayes (NB), quadratic discriminant analysis (QDA), random forest (RF), and support vector machines (SVM) is presented.
Abstract: Coupling machine learning algorithms with spatial analytical techniques for landslide susceptibility modeling is a worth considering issue. So, the current research intend to present the first comprehensive comparison among the performances of ten advanced machine learning techniques (MLTs) including artificial neural networks (ANNs), boosted regression tree (BRT), classification and regression trees (CART), generalized linear model (GLM), generalized additive model (GAM), multivariate adaptive regression splines (MARS), naive Bayes (NB), quadratic discriminant analysis (QDA), random forest (RF), and support vector machines (SVM) for modeling landslide susceptibility and evaluating the importance of variables in GIS and R open source software. This study was carried out in the Ghaemshahr Region, Iran. The performance of MLTs has been evaluated using the area under ROC curve (AUC-ROC) approach. The results showed that AUC values for ten MLTs vary from 62.4 to 83.7%. It has been found that the RF (AUC = 83.7%) and BRT (AUC = 80.7%) have the best performances comparison to other MLTs.

Journal ArticleDOI
TL;DR: The main aim of the present study is to explore and compare three state-of-the art data mining techniques, best-first decision tree, random forest, and naïve Bayes tree, for landslide susceptibility assessment in the Longhai area of China.

Journal ArticleDOI
TL;DR: The proposed STL-IDS approach improves network intrusion detection and provides a new research method for intrusion detection, and has accelerated SVM training and testing times and performed better than most of the previous approaches in terms of performance metrics in binary and multiclass classification.
Abstract: Network intrusion detection systems (NIDSs) provide a better solution to network security than other traditional network defense technologies, such as firewall systems The success of NIDS is highly dependent on the performance of the algorithms and improvement methods used to increase the classification accuracy and decrease the training and testing times of the algorithms We propose an effective deep learning approach, self-taught learning (STL)-IDS, based on the STL framework The proposed approach is used for feature learning and dimensionality reduction It reduces training and testing time considerably and effectively improves the prediction accuracy of support vector machines (SVM) with regard to attacks The proposed model is built using the sparse autoencoder mechanism, which is an effective learning algorithm for reconstructing a new feature representation in an unsupervised manner After the pre-training stage, the new features are fed into the SVM algorithm to improve its detection capability for intrusion and classification accuracy Moreover, the efficiency of the approach in binary and multiclass classification is studied and compared with that of shallow classification methods, such as J48, naive Bayesian, random forest, and SVM Results show that our approach has accelerated SVM training and testing times and performed better than most of the previous approaches in terms of performance metrics in binary and multiclass classification The proposed STL-IDS approach improves network intrusion detection and provides a new research method for intrusion detection

Book ChapterDOI
01 Jan 2018
TL;DR: This chapter considers decision trees and random forest algorithms, which are bagging (bootstrap aggregating) algorithms that combine the output of multiple decision trees to give the prediction.
Abstract: So far, we’ve considered decision trees and random forest algorithms. We saw that random forest is a bagging (bootstrap aggregating) algorithm—it combines the output of multiple decision trees to give the prediction. Typically, in a bagging algorithm trees are grown in parallel to get the average prediction across all trees, where each tree is built on a sample of original data.

Journal ArticleDOI
TL;DR: It was found that RF and ET have comparable predictive power and are equally applicable for predicting useful solar thermal energy (USTE), with root mean square error values of 6.86 and 7.12 on the testing dataset, respectively.

Proceedings ArticleDOI
25 Apr 2018
TL;DR: Using machine learning to train the large data sets available publicly gives a clear way to detect the disease present in plants in a colossal scale.
Abstract: Crop diseases are a noteworthy risk to sustenance security, however their quick distinguishing proof stays troublesome in numerous parts of the world because of the non attendance of the important foundation. Emergence of accurate techniques in the field of leaf-based image classification has shown impressive results. This paper makes use of Random Forest in identifying between healthy and diseased leaf from the data sets created. Our proposed paper includes various phases of implementation namely dataset creation, feature extraction, training the classifier and classification. The created datasets of diseased and healthy leaves are collectively trained under Random Forest to classify the diseased and healthy images. For extracting features of an image we use Histogram of an Oriented Gradient (HOG). Overall, using machine learning to train the large data sets available publicly gives us a clear way to detect the disease present in plants in a colossal scale.

Journal ArticleDOI
04 Sep 2018-Sensors
TL;DR: The results showed that IoT-based sensors and the proposed big data processing system are sufficiently efficient to monitor the manufacturing process and that the proposed hybrid prediction model has better fault prediction accuracy than other models given the sensor data as input.
Abstract: With the increase in the amount of data captured during the manufacturing process, monitoring systems are becoming important factors in decision making for management Current technologies such as Internet of Things (IoT)-based sensors can be considered a solution to provide efficient monitoring of the manufacturing process In this study, a real-time monitoring system that utilizes IoT-based sensors, big data processing, and a hybrid prediction model is proposed Firstly, an IoT-based sensor that collects temperature, humidity, accelerometer, and gyroscope data was developed The characteristics of IoT-generated sensor data from the manufacturing process are: real-time, large amounts, and unstructured type The proposed big data processing platform utilizes Apache Kafka as a message queue, Apache Storm as a real-time processing engine and MongoDB to store the sensor data from the manufacturing process Secondly, for the proposed hybrid prediction model, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)-based outlier detection and Random Forest classification were used to remove outlier sensor data and provide fault detection during the manufacturing process, respectively The proposed model was evaluated and tested at an automotive manufacturing assembly line in Korea The results showed that IoT-based sensors and the proposed big data processing system are sufficiently efficient to monitor the manufacturing process Furthermore, the proposed hybrid prediction model has better fault prediction accuracy than other models given the sensor data as input The proposed system is expected to support management by improving decision-making and will help prevent unexpected losses caused by faults during the manufacturing process

Journal ArticleDOI
TL;DR: This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics, and commonly used methods.
Abstract: Over the past decades, researchers have been proposing different Intrusion Detection approaches to deal with the increasing number and complexity of threats for computer systems. In this context, Random Forest models have been providing a notable performance on their applications in the realm of the behaviour-based Intrusion Detection Systems. Specificities of the Random Forest model are used to provide classification, feature selection, and proximity metrics. This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics, and commonly used methods. It also provides a survey of Random Forest based methods applied in this context, considering the particularities involved in these models. Finally, some open questions and challenges are posed combined with possible directions to deal with them, which may guide future works on the area.

Journal ArticleDOI
TL;DR: Random forest has been proven to outperform the comparative classifiers in terms of recognition accuracy, stability and robustness to features, especially with a small training set, and the user-friendly parameters in random forest offer great convenience for practical engineering.
Abstract: Nowadays, the data-driven diagnosis method, exploiting pattern recognition method to diagnose the fault patterns automatically, achieves much success for rotating machinery Some popular classifica

Journal ArticleDOI
TL;DR: Random forest and elastic net logistic regression yield higher discriminative performance in (chemo)radiotherapy outcome and toxicity prediction than other studied classifiers, and one of these two classifiers should be the first choice for investigators when building classification models or to benchmark one's own modeling results against.
Abstract: Purpose: Machine learning classification algorithms (classifiers) for prediction of treatment response are becoming more popular in radiotherapy literature. General Machine learning literature provides evidence in favor of some classifier families (random forest, support vector machine, gradient boosting) in terms of classification performance. The purpose of this study is to compare such classifiers specifically for (chemo)radiotherapy datasets and to estimate their average discriminative performance for radiation treatment outcome prediction.

Journal ArticleDOI
TL;DR: It is proved that further addition of trees or further reduction of features does not improve classification performance, and a novel theoretical upper limit on the number of trees to be added to the forest is formulated to ensure improvement in classification accuracy.
Abstract: We propose an improved random forest classifier that performs classification with minimum number of trees. The proposed method iteratively removes some unimportant features. Based on the number of important and unimportant features, we formulate a novel theoretical upper limit on the number of trees to be added to the forest to ensure improvement in classification accuracy. Our algorithm converges with a reduced but important set of features. We prove that further addition of trees or further reduction of features does not improve classification performance. The efficacy of the proposed approach is demonstrated through experiments on benchmark datasets. We further use the proposed classifier to detect mitotic nuclei in the histopathological datasets of breast tissues. We also apply our method on the industrial dataset of dual phase steel microstructures to classify different phases. Results of our method on different datasets show significant reduction in average classification error compared to a number of competing methods.

Journal ArticleDOI
TL;DR: This study proposed an ensemble machine learning approach that provided reliable PM2.5 hindcast capabilities and provided more accurate out-of-range predictions at the daily level and monthly level.
Abstract: The long satellite aerosol data record enables assessments of historical PM2.5 level in regions where routine PM2.5 monitoring began only recently. However, most previous models reported decreased prediction accuracy when predicting PM2.5 levels outside the model-training period. In this study, we proposed an ensemble machine learning approach that provided reliable PM2.5 hindcast capabilities. The missing satellite data were first filled by multiple imputation. Then the modeling domain, China, was divided into seven regions using a spatial clustering method to control for unobserved spatial heterogeneity. A set of machine learning models including random forest, generalized additive model, and extreme gradient boosting were trained in each region separately. Finally, a generalized additive ensemble model was developed to combine predictions from different algorithms. The ensemble prediction characterized the spatiotemporal distribution of daily PM2.5 well with the cross-validation (CV) R2 (RMSE) of 0.79 ...

Proceedings ArticleDOI
01 Oct 2018
TL;DR: The main goal of this paper is to propose a systematic approach to generate Android malware datasets using real smartphones instead of emulators and develop a new dataset, namely CI-CAndMal2017, which covers all the shortcomings and limitations of previous datasets.
Abstract: Malware detection is one of the most important factors in the security of smartphones. Academic researchers have extensively studied Android malware detection problems. Machine learning methods proposed in previous work typically reported high detection performance and fast prediction times on fixed and defective datasets. Therefore, based on these shortcomings most datasets are not suitable for real-world deployment. The main goal of this paper is to propose a systematic approach to generate Android malware datasets using real smartphones instead of emulators and develop a new dataset, namely CI-CAndMal2017, which covers all the shortcomings and limitations of previous datasets. Also, we offer 80 traffic features to select the best feature sets for detecting and classifying the malicious families just by traffic analysis. The proposed method showed an average precision of 85% and recall of 88% for three classifiers, namely Random Forest(RF), K-Nearest Neighbor (KNN), and Decision Tree (DT).


Journal ArticleDOI
TL;DR: In this article, landslide susceptibility maps were constructed in the Pyeong-Chang area, Korea, using the Random Forest and Boosted Tree models, where landslide locations were randomly selected in a 50/50 ratio for training and validation of the models.
Abstract: Landslides susceptibility maps were constructed in the Pyeong-Chang area, Korea, using the Random Forest and Boosted Tree models. Landslide locations were randomly selected in a 50/50 ratio for training and validation of the models. Seventeen landslide-related factors were extracted and constructed in a spatial database. The relationships between the observed landslide locations and these factors were identified by using the two models. The models were used to generate a landslide susceptibility map and the importance of the factors was calculated. Finally, the landslide susceptibility maps were validated. Finally, landslide susceptibility maps were generated. For the Random Forest model, the validation accuracy in regression and classification algorithms showed 79.34 and 79.18%, respectively, and for the Boosted Tree model, these were 84.87 and 85.98%, respectively. The two models showed satisfactory accuracies, and the Boosted Tree model showed better results than the Random Forest model.

Journal ArticleDOI
TL;DR: A computerized automated detection of focal epileptic seizures in real-time using MATLAB based software tool referred to as CADFES, which is expected to perform better at the hospitals for automated classification of focal and non-focal seizures.
Abstract: Background: Classification and localization of focal epileptic seizures provide a proper diagnostic procedure for epilepsy patients. Visual identification of seizure activity from long-term electroencephalography (EEG) is tedious, time-consuming and leads to human error. Therefore, there is a need for an automated classification system. Methods: In this paper, we introduce a tool called CADFES: computerized automated detection of focal epileptic seizures. For the study, total 41.66 hours of EEG data from the Bern-Barcelona database was used. Set of 28 features were extracted from time, frequency, and statistical domain and significant features were selected using neighborhood component analysis (NCA). In NCA, optimization of regularization parameter ensured better classification accuracy (less classification loss) with seven features. The performance of the algorithm was assessed using support vector machine (SVM), K-nearest neighbor (K-NN), random forest and adaptive boosting (AdaBoost) classifiers. Results: Experimental results revealed sensitivity, specificity, accuracy, positive predictive rate, negative predictive rate, and area under the curve of 97.6%, 94.4%, 96.1%, 92.9%, 98.8% and 0.96 respectively using the SVM classifier. Finally, MATLAB based software tool referred to as CADFES was introduced for automated classification of focal and non-focal seizures. Comparison results ensure that proposed study is superior to existing methods. Hence, it is expected to perform better at the hospitals for automated classification of focal epileptic seizures in real-time.

Journal ArticleDOI
TL;DR: The proposed classification scheme is the first attempt, investigating the potential of fine-tuning pre-existing CNN, for land cover mapping and serves as a baseline framework to facilitate further scientific research using the latest state-of-art machine learning tools for processing remote sensing data.
Abstract: The synergistic use of spatial features with spectral properties of satellite images enhances thematic land cover information, which is of great significance for complex land cover mapping. Incorporating spatial features within the classification scheme have been mainly carried out by applying just low-level features, which have shown improvement in the classification result. By contrast, the application of high-level spatial features for classification of satellite imagery has been underrepresented. This study aims to address the lack of high-level features by proposing a classification framework based on convolutional neural network (CNN) to learn deep spatial features for wetland mapping using optical remote sensing data. Designing a fully trained new convolutional network is infeasible due to the limited amount of training data in most remote sensing studies. Thus, we applied fine tuning of a pre-existing CNN. Specifically, AlexNet was used for this purpose. The classification results obtained by the deep CNN were compared with those based on well-known ensemble classifiers, namely random forest (RF), to evaluate the efficiency of CNN. Experimental results demonstrated that CNN was superior to RF for complex wetland mapping even by incorporating the small number of input features (i.e., three features) for CNN compared to RF (i.e., eight features). The proposed classification scheme is the first attempt, investigating the potential of fine-tuning pre-existing CNN, for land cover mapping. It also serves as a baseline framework to facilitate further scientific research using the latest state-of-art machine learning tools for processing remote sensing data.

Journal ArticleDOI
TL;DR: The results suggest that ensemble methods are good algorithm choices for supervised classification of lithology using well log data because the classification accuracy is remarkably similar across the lithology classes for both the Random Forest and Gradient Tree Boosting models.

Journal ArticleDOI
TL;DR: This study aims to generate a novel ensemble model for credit scoring to obtain superior performance and high robustness, adapting to different imbalance ratio datasets, and demonstrates that the proposed model is robust and represents a positive development in credit scoring.
Abstract: In the past few decades, credit scoring has become an increasing concern for financial institutions and is currently a popular topic of research. This study aims to generate a novel ensemble model for credit scoring, to obtain superior performance and high robustness, adapting to different imbalance ratio datasets. First, according to the credit scoring data characteristics, the proposed model extends the BalanceCascade approach to generate adjustable balanced subsets based on the imbalance ratios of training data. Further, it reduces the negative effect of imbalanced data and improves the comprehensive performance of the predictive model. Second, the proposed model adopts two kinds of tree-based classifiers, random forest and extreme gradient boosting, as the base classifiers for a three-stage ensemble model. This includes the use of stacking to generate predicted results of the former layer as new explanatory features in the latter layer, and the use of a particle swarm optimization algorithm for parameters optimization of the base classifiers. Finally, the results indicate that the average performance of the proposed model is superior to other comparative algorithms as reflected in most evaluation measures for different datasets. It demonstrates that the proposed model is robust and represents a positive development in credit scoring.