scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2017"


Journal ArticleDOI
TL;DR: It is shown that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.
Abstract: We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.

1,512 citations


Journal ArticleDOI
TL;DR: The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification.
Abstract: Intrusion detection plays an important role in ensuring information security, and the key technology is to accurately identify various attacks in the network. In this paper, we explore how to model an intrusion detection system based on deep learning, and we propose a deep learning approach for intrusion detection using recurrent neural networks (RNN-IDS). Moreover, we study the performance of the model in binary classification and multiclass classification, and the number of neurons and different learning rate impacts on the performance of the proposed model. We compare it with those of J48, artificial neural network, random forest, support vector machine, and other machine learning methods proposed by previous researchers on the benchmark data set. The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification. The RNN-IDS model improves the accuracy of the intrusion detection and provides a new research method for intrusion detection.

1,123 citations


Journal ArticleDOI
22 Dec 2017-Sensors
TL;DR: This study examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data and found that SVM produced the highest OA with the least sensitivity to the training sample sizes.
Abstract: In previous classification studies, three non-parametric classifiers, Random Forest (RF), k-Nearest Neighbor (kNN), and Support Vector Machine (SVM), were reported as the foremost classifiers at producing high accuracies. However, only a few studies have compared the performances of these classifiers with different training sample sizes for the same remote sensing images, particularly the Sentinel-2 Multispectral Imager (MSI). In this study, we examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data. An area of 30 × 30 km2 within the Red River Delta of Vietnam with six land use/cover types was classified using 14 different training sample sizes, including balanced and imbalanced, from 50 to over 1250 pixels/class. All classification results showed a high overall accuracy (OA) ranging from 90% to 95%. Among the three classifiers and 14 sub-datasets, SVM produced the highest OA with the least sensitivity to the training sample sizes, followed consecutively by RF and kNN. In relation to the sample size, all three classifiers showed a similar and high OA (over 93.85%) when the training sample size was large enough, i.e., greater than 750 pixels/class or representing an area of approximately 0.25% of the total study area. The high accuracy was achieved with both imbalanced and balanced datasets.

777 citations


Journal ArticleDOI
TL;DR: It is found that supervised object- based classification is currently experiencing rapid advances, while development of the fuzzy technique is limited in the object-based framework, and spatial resolution correlates with the optimal segmentation scale and study area, and Random Forest shows the best performance inobject-based classification.
Abstract: Object-based image classification for land-cover mapping purposes using remote-sensing imagery has attracted significant attention in recent years. Numerous studies conducted over the past decade have investigated a broad array of sensors, feature selection, classifiers, and other factors of interest. However, these research results have not yet been synthesized to provide coherent guidance on the effect of different supervised object-based land-cover classification processes. In this study, we first construct a database with 28 fields using qualitative and quantitative information extracted from 254 experimental cases described in 173 scientific papers. Second, the results of the meta-analysis are reported, including general characteristics of the studies (e.g., the geographic range of relevant institutes, preferred journals) and the relationships between factors of interest (e.g., spatial resolution and study area or optimal segmentation scale, accuracy and number of targeted classes), especially with respect to the classification accuracy of different sensors, segmentation scale, training set size, supervised classifiers, and land-cover types. Third, useful data on supervised object-based image classification are determined from the meta-analysis. For example, we find that supervised object-based classification is currently experiencing rapid advances, while development of the fuzzy technique is limited in the object-based framework. Furthermore, spatial resolution correlates with the optimal segmentation scale and study area, and Random Forest (RF) shows the best performance in object-based classification. The area-based accuracy assessment method can obtain stable classification performance, and indicates a strong correlation between accuracy and training set size, while the accuracy of the point-based method is likely to be unstable due to mixed objects. In addition, the overall accuracy benefits from higher spatial resolution images (e.g., unmanned aerial vehicle) or agricultural sites where it also correlates with the number of targeted classes. More than 95.6% of studies involve an area less than 300 ha, and the spatial resolution of images is predominantly between 0 and 2 m. Furthermore, we identify some methods that may advance supervised object-based image classification. For example, deep learning and type-2 fuzzy techniques may further improve classification accuracy. Lastly, scientists are strongly encouraged to report results of uncertainty studies to further explore the effects of varied factors on supervised object-based image classification.

608 citations


Journal ArticleDOI
TL;DR: In this article, the authors compared the performance of feed-forward back-propagation artificial neural network (ANN) with random forest (RF), an ensemble-based method gaining popularity in prediction, for predicting the hourly HVAC energy consumption of a hotel in Madrid, Spain.

600 citations


Journal ArticleDOI
01 Apr 2017-Catena
TL;DR: In this article, the authors used three state-of-the-art data mining techniques, namely, logistic model tree (LMT), random forest (RF), and classification and regression tree (CART) models, to map landslide susceptibility.
Abstract: The main purpose of the present study is to use three state-of-the-art data mining techniques, namely, logistic model tree (LMT), random forest (RF), and classification and regression tree (CART) models, to map landslide susceptibility. Long County was selected as the study area. First, a landslide inventory map was constructed using history reports, interpretation of aerial photographs, and extensive field surveys. A total of 171 landslide locations were identified in the study area. Twelve landslide-related parameters were considered for landslide susceptibility mapping, including slope angle, slope aspect, plan curvature, profile curvature, altitude, NDVI, land use, distance to faults, distance to roads, distance to rivers, lithology, and rainfall. The 171 landslides were randomly separated into two groups with a 70/30 ratio for training and validation purposes, and different ratios of non-landslides to landslides grid cells were used to obtain the highest classification accuracy. The linear support vector machine algorithm (LSVM) was used to evaluate the predictive capability of the 12 landslide conditioning factors. Second, LMT, RF, and CART models were constructed using training data. Finally, the applied models were validated and compared using receiver operating characteristics (ROC), and predictive accuracy (ACC) methods. Overall, all three models exhibit reasonably good performances; the RF model exhibits the highest predictive capability compared with the LMT and CART models. The RF model, with a success rate of 0.837 and a prediction rate of 0.781, is a promising technique for landslide susceptibility mapping. Therefore, these three models are useful tools for spatial prediction of landslide susceptibility.

591 citations


Journal ArticleDOI
TL;DR: This paper provides a theoretical study of the permutation importance measure for an additive regression model and motivates the use of the recursive feature elimination (RFE) algorithm for variable selection in this context.
Abstract: This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.

525 citations


Posted Content
01 Jan 2017
TL;DR: This article developed a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm and showed that causal forests are pointwise consistent for the true treatment effect, and have an asymptotic Gaussian and centered sampling distribution.
Abstract: Many scientific and engineering challenges--ranging from personalized medicine to customized marketing recommendations--require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

485 citations


Journal ArticleDOI
TL;DR: A novel fuzziness based semi-supervised learning approach by utilizing unlabeled samples assisted with supervised learning algorithm to improve the classifier's performance for the IDSs is proposed.

460 citations


Journal ArticleDOI
TL;DR: This work presents the adaptive random forest (ARF) algorithm, which includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets.
Abstract: Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.

442 citations


Journal ArticleDOI
TL;DR: This study tests machine learning models to predict bankruptcy one year prior to the event, and finds that bagging, boosting, and random forest models outperform the others techniques, and that all prediction accuracy in the testing sample improves when the additional variables are included.
Abstract: Machine learning models show improved bankruptcy prediction accuracy over traditional models.Various models were tested using different accuracy metrics.Boosting, bagging, and random forest models provide better results. There has been intensive research from academics and practitioners regarding models for predicting bankruptcy and default events, for credit risk management. Seminal academic research has evaluated bankruptcy using traditional statistics techniques (e.g. discriminant analysis and logistic regression) and early artificial intelligence models (e.g. artificial neural networks). In this study, we test machine learning models (support vector machines, bagging, boosting, and random forest) to predict bankruptcy one year prior to the event, and compare their performance with results from discriminant analysis, logistic regression, and neural networks. We use data from 1985 to 2013 on North American firms, integrating information from the Salomon Center database and Compustat, analysing more than 10,000 firm-year observations. The key insight of the study is a substantial improvement in prediction accuracy using machine learning techniques especially when, in addition to the original Altmans Z-score variables, we include six complementary financial indicators. Based on Carton and Hofer (2006), we use new variables, such as the operating margin, change in return-on-equity, change in price-to-book, and growth measures related to assets, sales, and number of employees, as predictive variables. Machine learning models show, on average, approximately 10% more accuracy in relation to traditional models. Comparing the best models, with all predictive variables, the machine learning technique related to random forest led to 87% accuracy, whereas logistic regression and linear discriminant analysis led to 69% and 50% accuracy, respectively, in the testing sample. We find that bagging, boosting, and random forest models outperform the others techniques, and that all prediction accuracy in the testing sample improves when the additional variables are included. Our research adds to the discussion of the continuing debate about superiority of computational methods over statistical techniques such as in Tsai, Hsu, and Yen (2014) and Yeh, Chi, and Lin (2014). In particular, for machine learning mechanisms, we do not find SVM to lead to higher accuracy rates than other models. This result contradicts outcomes from Danenas and Garsva (2015) and Cleofas-Sanchez, Garcia, Marques, and Senchez (2016), but corroborates, for instance, Wang, Ma, and Yang (2014), Liang, Lu, Tsai, and Shih (2016), and Cano etal. (2017). Our study supports the applicability of the expert systems by practitioners as in Heo and Yang (2014), Kim, Kang, and Kim (2015) and Xiao, Xiao, and Wang (2016).

Book ChapterDOI
14 Sep 2017
TL;DR: In this paper, a convolutional neural network (CNN) was used for brain tumor segmentation and a dice loss function was used to cope with class imbalances and extensive data augmentation to successfully prevent overfitting.
Abstract: Quantitative analysis of brain tumors is critical for clinical decision making. While manual segmentation is tedious, time consuming and subjective, this task is at the same time very challenging to solve for automatic segmentation methods. In this paper we present our most recent effort on developing a robust segmentation algorithm in the form of a convolutional neural network. Our network architecture was inspired by the popular U-Net and has been carefully modified to maximize brain tumor segmentation performance. We use a dice loss function to cope with class imbalances and use extensive data augmentation to successfully prevent overfitting. Our method beats the current state of the art on BraTS 2015, is one of the leading methods on the BraTS 2017 validation set (dice scores of 0.896, 0.797 and 0.732 for whole tumor, tumor core and enhancing tumor, respectively) and achieves very good Dice scores on the test set (0.858 for whole, 0.775 for core and 0.647 for enhancing tumor). We furthermore take part in the survival prediction subchallenge by training an ensemble of a random forest regressor and multilayer perceptrons on shape features describing the tumor subregions. Our approach achieves 52.6% accuracy, a Spearman correlation coefficient of 0.496 and a mean square error of 209607 on the test set.

Journal ArticleDOI
TL;DR: A random forest model incorporating aerosol optical depth data, meteorological fields, and land use variables to estimate daily 24 h averaged ground-level PM2.5 concentrations over the conterminous United States in 2011 is developed.
Abstract: To estimate PM25 concentrations, many parametric regression models have been developed, while nonparametric machine learning algorithms are used less often and national-scale models are rare In this paper, we develop a random forest model incorporating aerosol optical depth (AOD) data, meteorological fields, and land use variables to estimate daily 24 h averaged ground-level PM25 concentrations over the conterminous United States in 2011 Random forests are an ensemble learning method that provides predictions with high accuracy and interpretability Our results achieve an overall cross-validation (CV) R2 value of 080 Mean prediction error (MPE) and root mean squared prediction error (RMSPE) for daily predictions are 178 and 283 μg/m3, respectively, indicating a good agreement between CV predictions and observations The prediction accuracy of our model is similar to those reported in previous studies using neural networks or regression models on both national and regional scales In addition, the

Journal ArticleDOI
TL;DR: Results indicate that the proposed Bagging-LMT model can be used for sustainable management of flood-prone areas and outperformed all state-of-the-art benchmark soft computing models.
Abstract: A new artificial intelligence (AI) model, called Bagging-LMT - a combination of bagging ensemble and Logistic Model Tree (LMT) - is introduced for mapping flood susceptibility. A spatial database was generated for the Haraz watershed, northern Iran, that included a flood inventory map and eleven flood conditioning factors based on the Information Gain Ratio (IGR). The model was evaluated using precision, sensitivity, specificity, accuracy, Root Mean Square Error, Mean Absolute Error, Kappa and area under the receiver operating characteristic curve criteria. The model was also compared with four state-of-the-art benchmark soft computing models, including LMT, logistic regression, Bayesian logistic regression, and random forest. Results revealed that the proposed model outperformed all these models and indicate that the proposed model can be used for sustainable management of flood-prone areas.

Journal ArticleDOI
TL;DR: Experimental results have shown that RFs can generate more accurate predictions than FFBP ANNs with a single hidden layer and SVR, and experimental results have also shown thatRFs can be more accurate than support vector regression (SVR) without a hidden layer.
Abstract: Manufacturers have faced an increasing need for the development of predictive models that predict mechanical failures and the remaining useful life (RUL) of manufacturing systems or components. Classical model-based or physics-based prognostics often require an in-depth physical understanding of the system of interest to develop closedform mathematical models. However, prior knowledge of system behavior is not always available, especially for complex manufacturing systems and processes. To complement model-based prognostics, data-driven methods have been increasingly applied to machinery prognostics and maintenance management, transforming legacy manufacturing systems into smart manufacturing systems with artificial intelligence. While previous research has demonstrated the effectiveness of data-driven methods, most of these prognostic methods are based on classical machine learning techniques, such as artificial neural networks (ANNs) and support vector regression (SVR). With the rapid advancement in artificial intelligence, various machine learning algorithms have been developed and widely applied in many engineering fields. The objective of this research is to introduce a random forests (RFs)-based prognostic method for tool wear prediction as well as compare the performance of RFs with feed-forward back propagation (FFBP) ANNs and SVR. Specifically, the performance of FFBP ANNs, SVR, and RFs are compared using an experimental data collected from 315 milling tests. Experimental results have shown that RFs can generate more accurate predictions than FFBP ANNs with a single hidden layer and SVR. [DOI: 10.1115/1.4036350]

Journal ArticleDOI
TL;DR: The lesson learnt suggest that when RF was applied on multi-modal data for prediction of Alzheimer's disease (AD) conversion from the Mild Cognitive Impairment (MCI), it produces one of the best accuracies to date.
Abstract: Objective: Machine learning classification has been the most important computational development in the last years to satisfy the primary need of clinicians for automatic early diagnosis and prognosis. Nowadays, Random Forest (RF) algorithm has been successfully applied for reducing high dimensional and multi-source data in many scientific realms. Our aim was to explore the state of the art of the application of RF on single and multi-modal neuroimaging data for the prediction of Alzheimer's disease. Methods: A systematic review following PRISMA guidelines was conducted on this field of study. In particular, we constructed an advanced query using boolean operators as follows: ("random forest" OR "random forests") AND neuroimaging AND ("alzheimer's disease" OR alzheimer's OR alzheimer) AND (prediction OR classification). The query was then searched in four well-known scientific databases: Pubmed, Scopus, Google Scholar and Web of Science. Results: Twelve articles-published between the 2007 and 2017-have been included in this systematic review after a quantitative and qualitative selection. The lesson learnt from these works suggest that when RF was applied on multi-modal data for prediction of Alzheimer's disease (AD) conversion from the Mild Cognitive Impairment (MCI), it produces one of the best accuracies to date. Moreover, the RF has important advantages in terms of robustness to overfitting, ability to handle highly non-linear data, stability in the presence of outliers and opportunity for efficient parallel processing mainly when applied on multi-modality neuroimaging data, such as, MRI morphometric, diffusion tensor imaging, and PET images. Conclusions: We discussed the strengths of RF, considering also possible limitations and by encouraging further studies on the comparisons of this algorithm with other commonly used classification approaches, particularly in the early prediction of the progression from MCI to AD.

Journal ArticleDOI
TL;DR: In this article, the effectiveness of deep neural networks (DNN), gradient-boosted-trees (GBT), random forests (RAF), and several ensembles of these methods in the context of statistical arbitrage was evaluated.

Journal ArticleDOI
TL;DR: In this paper, a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform is presented. And the PRF algorithm is optimized based on a hybrid approach combining dataparallel and task-parallel optimization, and a dual parallel approach is carried out in the training process of RF and a task Directed Acyclic Graph (DAG) is created according to the parallel training process.
Abstract: With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability. With the expansion of the scale of the random forest model and the Spark cluster, the advantage of the PRF algorithm is more obvious.

Journal ArticleDOI
TL;DR: It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines and Random Forests, while being the fastest algorithm in terms of prediction efficiency.
Abstract: Up-to-date report on the accuracy and efficiency of state-of-the-art classifiers.We compare the accuracy of 11 classification algorithms pairwise and groupwise.We examine separately the training, parameter-tuning, and testing time.GBDT and Random Forests yield highest accuracy, outperforming SVM.GBDT is the fastest in testing, Naive Bayes the fastest in training. Current benchmark reports of classification algorithms generally concern common classifiers and their variants but do not include many algorithms that have been introduced in recent years. Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, we carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine (ELM), Sparse Representation based Classification (SRC), and Deep Learning (DL), which have not been thoroughly investigated in existing comparative studies. It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines (SVM) and Random Forests (RF), while being the fastest algorithm in terms of prediction efficiency. ELM also yields good accuracy results, ranking in the top-5, alongside GBDT, RF, SVM, and C4.5 but this performance varies widely across all data sets. Unsurprisingly, top accuracy performers have average or slow training time efficiency. DL is the worst performer in terms of accuracy but second fastest in prediction efficiency. SRC shows good accuracy performance but it is the slowest classifier in both training and testing.

Journal ArticleDOI
TL;DR: In this paper, the authors applied support vector machine (SVM), random forest (RF), and genetic algorithm optimized random forests (RFGA) methods to assess groundwater potential by spring locations.
Abstract: Regarding the ever increasing issue of water scarcity in different countries, the current study plans to apply support vector machine (SVM), random forest (RF), and genetic algorithm optimized random forest (RFGA) methods to assess groundwater potential by spring locations. To this end, 14 effective variables including DEM-derived, river-based, fault-based, land use, and lithology factors were provided. Of 842 spring locations found, 70% (589) were implemented for model training, and the rest of them were used to evaluate the models. The mentioned models were run and groundwater potential maps (GPMs) were produced. At last, receiver operating characteristics (ROC) curve was plotted to evaluate the efficiency of the methods. The results of the current study denoted that RFGA, and RF methods had better efficacy than different kernels of SVM model. Area under curve (AUC) of ROC value for RF and RFGA was estimated as 84.6, and 85.6%, respectively. AUC of ROC was computed as SVM- linear (78.6%), SVM-polynomial (76.8%), SVM-sigmoid (77.1%), and SVM- radial based function (77%). Furthermore, the results represented higher importance of altitude, TWI, and slope angle in groundwater assessment. The methodology created in the current study could be transferred to other places with water scarcity issues for groundwater potential assessment and management.

Journal ArticleDOI
TL;DR: The proposed PSO-NF model is a valid alternative tool that should be considered for tropical forest fire susceptibility modeling and is useful for forest planning and management in forest fire prone areas.

Journal ArticleDOI
TL;DR: In this paper, a comparison of three classification algorithms: support vector machines (SVM), random forest (RF) and artificial neural networks (ANN) for tree species classification using airborne hyperspectral data from the Airborne Prism EXperiment sensor is presented.
Abstract: Knowledge of tree species composition in a forest is an important topic in forest management. Accurate tree species maps allow for much more detailed and in-depth analysis of biophysical forest variables. The paper presents a comparison of three classification algorithms: support vector machines (SVM), random forest (RF) and artificial neural networks (ANN) for tree species classification using airborne hyperspectral data from the Airborne Prism EXperiment sensor. The aim of this paper is to evaluate the three nonparametric classification algorithms (SVM, RF and ANN) in an attempt to classify the five most common tree species of the Szklarska Poreba area: spruce (Picea alba L. Karst), larch (Larix decidua Mill.), alder (Alnus Mill), beech (Fagus sylvatica L.) and birch (Betula pendula Roth). To avoid human introduced biases a 0.632 bootstrap procedure was used during evaluation of each compared classifier. Of all compared classification results, ANN achieved the highest median overall classificati...

Journal ArticleDOI
TL;DR: This paper proposes a novel hybrid approach of a random forests classifier for the fault diagnosis in rolling bearings that reached 88.23% in classification accuracy, and high efficiency and robustness in the models.
Abstract: The faults of rolling bearings can result in the deterioration of rotating machine operating conditions, how to extract the fault feature parameters and identify the fault of the rolling bearing has become a key issue for ensuring the safe operation of modern rotating machineries. This paper proposes a novel hybrid approach of a random forests classifier for the fault diagnosis in rolling bearings. The fault feature parameters are extracted by applying the wavelet packet decomposition, and the best set of mother wavelets for the signal pre-processing is identified by the values of signal-to-noise ratio and mean square error. Then, the mutual dimensionless index is first used as the input feature for the classification problem. In this way, the best features of the five mutual dimensionless indices for the fault diagnosis are selected through the internal voting of the random forests classifier. The approach is tested on simulation and practical bearing vibration signals by considering several fault classes. The comparative experiment results show that the proposed method reached 88.23% in classification accuracy, and high efficiency and robustness in the models.

Journal ArticleDOI
TL;DR: In this paper, a random forest method is proposed to build an hour-ahead wind power predictor, which is based on spatially averaged wind speed and wind direction, and the random forest does not need to be tuned or optimized, unlike most other learning machines.

Journal ArticleDOI
TL;DR: The results suggest that the analysis of variable importance with respect to the different classifiers and travel modes is essential for a better understanding and effective modeling of peoples travel behavior.
Abstract: A comparison of 7 classifiers for travel mode prediction is performed.Prediction accuracy and variable importance for each travel mode is investigated.Among the investigated classifiers, random forest performs best.Trip distance followed by the number of cars are the most important variables.The importance of other variables varies with travel mode and classifier. The analysis of travel mode choice is an important task in transportation planning and policy making in order to understand and predict travel demands. While advances in machine learning have led to numerous powerful classifiers, their usefulness for modeling travel mode choice remains largely unexplored. Using extensive Dutch travel diary data from the years 2010 to 2012, enriched with variables on the built and natural environment as well as on weather conditions, this study compares the predictive performance of seven selected machine learning classifiers for travel mode choice analysis and makes recommendations for model selection. In addition, it addresses the importance of different variables and how they relate to different travel modes. The results show that random forest performs significantly better than any other of the investigated classifiers, including the commonly used multinomial logit model. While trip distance is found to be the most important variable, the importance of the other variables varies with classifiers and travel modes. The importance of the meteorological variables is highest for support vector machine, while temperature is particularly important for predicting bicycle and public transport trips. The results suggest that the analysis of variable importance with respect to the different classifiers and travel modes is essential for a better understanding and effective modeling of peoples travel behavior.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed improved ICFS method shows better performance compared to the conventional Correlation-based method and also outperforms some other state-of-the-art methods of epileptic seizure detection using the same benchmark EEG dataset.

Journal ArticleDOI
TL;DR: A standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols.
Abstract: The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naive Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method (‘DNN_PCM’) performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized ‘DNN_PCM’). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols.

Journal ArticleDOI
24 Jul 2017-PLOS ONE
TL;DR: The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data and applies different techniques to uncover potential predictors of diabetes.
Abstract: Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naive Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naive Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.

Journal Article
TL;DR: The goal of this paper is providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.
Abstract: The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure

Journal ArticleDOI
TL;DR: A novel hybrid model for predicting hourly global solar radiation using random forests technique and firefly algorithm is presented, which shows better performance as compared to the aforementioned models in terms of prediction accuracy and prediction speed.