scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2019"


Journal ArticleDOI
TL;DR: A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.
Abstract: We propose generalized random forests, a method for nonparametric statistical estimation based on random forests (Breiman [Mach. Learn. 45 (2001) 5–32]) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method considers a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: nonparametric quantile regression, conditional average partial effect estimation and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.

840 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel method that aims at finding significant features by applying machine learning techniques resulting in improving the accuracy in the prediction of cardiovascular disease with the hybrid random forest with a linear model (HRFLM).
Abstract: Heart disease is one of the most significant causes of mortality in the world today. Prediction of cardiovascular disease is a critical challenge in the area of clinical data analysis. Machine learning (ML) has been shown to be effective in assisting in making decisions and predictions from the large quantity of data produced by the healthcare industry. We have also seen ML techniques being used in recent developments in different areas of the Internet of Things (IoT). Various studies give only a glimpse into predicting heart disease with ML techniques. In this paper, we propose a novel method that aims at finding significant features by applying machine learning techniques resulting in improving the accuracy in the prediction of cardiovascular disease. The prediction model is introduced with different combinations of features and several known classification techniques. We produce an enhanced performance level with an accuracy level of 88.7% through the prediction model for heart disease with the hybrid random forest with a linear model (HRFLM).

783 citations


Journal ArticleDOI
TL;DR: It is found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies), however, the Random Forest algorithm showed superior accuracy comparatively.
Abstract: Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction. We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naive Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered. This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.

580 citations


Journal ArticleDOI
TL;DR: A literature review on the parameters' influence on the prediction performance and on variable importance measures is provided, and the application of one of the most established tuning strategies, model‐based optimization (MBO), is demonstrated.
Abstract: The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

559 citations


Journal ArticleDOI
TL;DR: A metalearner, the X-learner, is proposed, which can adapt to structural properties, such as the smoothness and sparsity of the underlying treatment effect, and is shown to be easy to use and to produce results that are interpretable.
Abstract: There is growing interest in estimating and analyzing heterogeneous treatment effects in experimental and observational studies. We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate the conditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms-such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks-to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a metaalgorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz-continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the metalearners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our X-learner can be used to target treatment regimes and to shed light on underlying mechanisms. A software package is provided that implements our methods.

546 citations


Journal ArticleDOI
TL;DR: This study shows that the Conv1D-based deep learning framework provides an effective and efficient method of time series representation in multi-temporal classification tasks.

509 citations


Journal ArticleDOI
01 Sep 2019
TL;DR: Performances of several machine learning models have been compared to predict attacks and anomalies on the IoT systems accurately and other metrics prove that Random Forest performs comparatively better.
Abstract: Attack and anomaly detection in the Internet of Things (IoT) infrastructure is a rising concern in the domain of IoT. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing commensurately. Denial of Service, Data Type Probing, Malicious Control, Malicious Operation, Scan, Spying and Wrong Setup are such attacks and anomalies which can cause an IoT system failure. In this paper, performances of several machine learning models have been compared to predict attacks and anomalies on the IoT systems accurately. The machine learning (ML) algorithms that have been used here are Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN). The evaluation metrics used in the comparison of performance are accuracy, precision, recall, f1 score, and area under the Receiver Operating Characteristic Curve. The system obtained 99.4% test accuracy for Decision Tree, Random Forest, and ANN. Though these techniques have the same accuracy, other metrics prove that Random Forest performs comparatively better.

460 citations


Journal ArticleDOI
TL;DR: Based on this study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package, and for datasets with many predictors, the methods implement in the R packages varSelRF and Boruta are preferable due to computational efficiency.
Abstract: Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.

446 citations


Journal ArticleDOI
TL;DR: In this article, the authors study the impact of parameter optimization on defect prediction models and find that automated parameter optimization can substantially shift the importance ranking of variables, with as few as 28 percent of the top-ranked variables in optimized classifiers also being topranked in non-optimized classifiers.
Abstract: Defect prediction models—classifiers that identify defect-prone software modules—have configurable parameters that control their characteristics (e.g., the number of trees in a random forest). Recent studies show that these classifiers underperform when default settings are used. In this paper, we study the impact of automated parameter optimization on defect prediction models. Through a case study of 18 datasets, we find that automated parameter optimization: (1) improves AUC performance by up to 40 percentage points; (2) yields classifiers that are at least as stable as those trained using default settings; (3) substantially shifts the importance ranking of variables, with as few as 28 percent of the top-ranked variables in optimized classifiers also being top-ranked in non-optimized classifiers; (4) yields optimized settings for 17 of the 20 most sensitive parameters that transfer among datasets without a statistically significant drop in performance; and (5) adds less than 30 minutes of additional computation to 12 of the 26 studied classification techniques. While widely-used classification techniques like random forest and support vector machines are not optimization-sensitive, traditionally overlooked techniques like C5.0 and neural networks can actually outperform widely-used techniques after optimization is applied. This highlights the importance of exploring the parameter space when using parameter-sensitive classification techniques.

290 citations


Journal ArticleDOI
Gao Xianwei1, Chun Shan1, Changzhen Hu1, Zequn Niu1, Liu Zhen1 
TL;DR: It is proved that the ensemble model effectively improves detection accuracy, and it is found that the quality of data features is an important factor to determine the detection effect.
Abstract: In recent years, advanced threat attacks are increasing, but the traditional network intrusion detection system based on feature filtering has some drawbacks which make it difficult to find new attacks in time. This paper takes NSL-KDD data set as the research object, analyses the latest progress and existing problems in the field of intrusion detection technology, and proposes an adaptive ensemble learning model. By adjusting the proportion of training data and setting up multiple decision trees, we construct a MultiTree algorithm. In order to improve the overall detection effect, we choose several base classifiers, including decision tree, random forest, kNN, DNN, and design an ensemble adaptive voting algorithm. We use NSL-KDD Test+ to verify our approach, the accuracy of the MultiTree algorithm is 84.2%, while the final accuracy of the adaptive voting algorithm reaches 85.2%. Compared with other research papers, it is proved that our ensemble model effectively improves detection accuracy. In addition, through the analysis of data, it is found that the quality of data features is an important factor to determine the detection effect. In the future, we should optimize the feature selection and preprocessing of intrusion detection data to achieve better results.

238 citations


Journal ArticleDOI
TL;DR: It is concluded that the combination of machine learning with UAV remote sensing is a promising alternative for estimating AGB and suggests that structural and spectral information can be considered simultaneously rather than separately when estimating biophysical crop parameters.
Abstract: Above-ground biomass (AGB) is a basic agronomic parameter for field investigation and is frequently used to indicate crop growth status, the effects of agricultural management practices, and the ability to sequester carbon above and below ground. The conventional way to obtain AGB is to use destructive sampling methods that require manual harvesting of crops, weighing, and recording, which makes large-area, long-term measurements challenging and time consuming. However, with the diversity of platforms and sensors and the improvements in spatial and spectral resolution, remote sensing is now regarded as the best technical means for monitoring and estimating AGB over large areas. In this study, we used structural and spectral information provided by remote sensing from an unmanned aerial vehicle (UAV) in combination with machine learning to estimate maize biomass. Of the 14 predictor variables, six were selected to create a model by using a recursive feature elimination algorithm. Four machine-learning regression algorithms (multiple linear regression, support vector machine, artificial neural network, and random forest) were evaluated and compared to create a suitable model, following which we tested whether the two sampling methods influence the training model. To estimate the AGB of maize, we propose an improved method for extracting plant height from UAV images and a volumetric indicator (i.e., BIOVP). The results show that (1) the random forest model gave the most balanced results, with low error and a high ratio of the explained variance for both the training set and the test set. (2) BIOVP can retain the largest strength effect on the AGB estimate in four different machine learning models by using importance analysis of predictors. (3) Comparing the plant heights calculated by the three methods with manual ground-based measurements shows that the proposed method increased the ratio of the explained variance and reduced errors. These results lead us to conclude that the combination of machine learning with UAV remote sensing is a promising alternative for estimating AGB. This work suggests that structural and spectral information can be considered simultaneously rather than separately when estimating biophysical crop parameters.

Journal ArticleDOI
TL;DR: It is found that random forests trained on regions and years similar in growing degree days transfer to the target region with accuracies consistently exceeding 80%.

Journal ArticleDOI
TL;DR: A robust random forest method is proposed to analyze travel mode choices for examining the prediction capability and model interpretability of people’s travel behavior and results show that the random Forest method performs significantly better in travel mode choice prediction for higher accuracy and less computation cost.
Abstract: The analysis of travel mode choice is important in transportation planning and policy-making in order to understand and forecast travel demands Research in the field of machine learning has been exploring the use of random forest as a framework within which many traffic and transport problems can be investigated The random forest (RF) is a powerful method for constructing an ensemble of random decision trees It de-correlates the decision trees in the ensemble via randomization that leads to an improvement of forecasting and reduces the variance when averaged over the trees However, the usefulness of RF for travel mode choice behavior remains largely unexplored This paper proposes a robust random forest method to analyze travel mode choices for examining the prediction capability and model interpretability Using the travel diary data from Nanjing, China in 2013, enriched with variables on the built environment, the effects of different model parameters on the prediction performance are investigated The comparison results show that the random forest method performs significantly better in travel mode choice prediction for higher accuracy and less computation cost In addition, the proposed method estimates the relative importance of explanatory variables and how they relate to mode choices This is fundamental for a better understanding and effective modeling of people’s travel behavior

Journal ArticleDOI
TL;DR: The overall experimental results suggest that HEFS performs best when it is integrated with Random Forest classifier, where the baseline features correctly distinguish 94.6% of phishing and legitimate websites using only 20.8% of the original features.

Posted Content
TL;DR: Improvements to the interpretability of tree-based models through the first polynomial time algorithm to compute optimal explanations based on game theory, and a new type of explanation that directly measures local feature interaction effects.
Abstract: Tree-based machine learning models such as random forests, decision trees, and gradient boosted trees are the most popular non-linear predictive models used in practice today, yet comparatively little attention has been paid to explaining their predictions. Here we significantly improve the interpretability of tree-based models through three main contributions: 1) The first polynomial time algorithm to compute optimal explanations based on game theory. 2) A new type of explanation that directly measures local feature interaction effects. 3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to i) identify high magnitude but low frequency non-linear mortality risk factors in the general US population, ii) highlight distinct population sub-groups with shared risk characteristics, iii) identify non-linear interaction effects among risk factors for chronic kidney disease, and iv) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model's performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.

Journal ArticleDOI
TL;DR: In this article, a clinical text classification paradigm using weak supervision and deep representation was proposed to reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classi cation.
Abstract: Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

Journal ArticleDOI
03 Mar 2019-Sensors
TL;DR: The experimental results indicate that the proposed method achieves high accuracy in bearing fault diagnosis under complex operational conditions and is superior to traditional methods and standard deep learning methods.
Abstract: Recently, research on data-driven bearing fault diagnosis methods has attracted increasing attention due to the availability of massive condition monitoring data. However, most existing methods still have difficulties in learning representative features from the raw data. In addition, they assume that the feature distribution of training data in source domain is the same as that of testing data in target domain, which is invalid in many real-world bearing fault diagnosis problems. Since deep learning has the automatic feature extraction ability and ensemble learning can improve the accuracy and generalization performance of classifiers, this paper proposes a novel bearing fault diagnosis method based on deep convolutional neural network (CNN) and random forest (RF) ensemble learning. Firstly, time domain vibration signals are converted into two dimensional (2D) gray-scale images containing abundant fault information by continuous wavelet transform (CWT). Secondly, a CNN model based on LeNet-5 is built to automatically extract multi-level features that are sensitive to the detection of faults from the images. Finally, the multi-level features containing both local and global information are utilized to diagnose bearing faults by the ensemble of multiple RF classifiers. In particular, low-level features containing local characteristics and accurate details in the hidden layers are combined to improve the diagnostic performance. The effectiveness of the proposed method is validated by two sets of bearing data collected from reliance electric motor and rolling mill, respectively. The experimental results indicate that the proposed method achieves high accuracy in bearing fault diagnosis under complex operational conditions and is superior to traditional methods and standard deep learning methods.

Journal ArticleDOI
TL;DR: In this article, the effects of spatial autocorrelation on hyperparameter tuning and performance estimation by comparing several widely used machine-learning algorithms such as boosted regression trees (BRT), k-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) with traditional parametric algorithms, such as logistic regression (GLM) and semi-parametric ones like generalized additive models (GAM) in terms of predictive performance.

Journal ArticleDOI
TL;DR: The main goal of the present paper is using deep convolutional neural networks (CNNs) and random forest to automatically optimize feature selection and classification.
Abstract: Identifying a core set of features is one of the most important steps in the development of an automated seizure detector. In most of the published studies describing features and seizure classifiers, the features were hand-engineered, which may not be optimal. The main goal of the present paper is using deep convolutional neural networks (CNNs) and random forest to automatically optimize feature selection and classification. The input of the proposed classifier is raw multi-channel EEG and the output is the class label: seizure/nonseizure. By training this network, the required features are optimized, while fitting a nonlinear classifier on the features. After training the network with EEG recordings of 26 neonates, five end layers performing the classification were replaced with a random forest classifier in order to improve the performance. This resulted in a false alarm rate of 0.9 per hour and seizure detection rate of 77% using a test set of EEG recordings of 22 neonates that also included dubious seizures. The newly proposed CNN classifier outperformed three data-driven feature-based approaches and performed similar to a previously developed heuristic method.

Journal ArticleDOI
TL;DR: The performance of the proposed ensemble models is evaluated on Spanish electricity consumption data for 10 years measured with a 10-minute frequency, and showed that both the dynamic and static ensembles performed well, outperforming the individual ensemble members they combine.
Abstract: This paper presents ensemble models for forecasting big data time series. An ensemble composed of three methods (decision tree, gradient boosted trees and random forest) is proposed due to the good results these methods have achieved in previous big data applications. The weights of the ensemble are computed by a weighted least square method. Two strategies related to the weight update are considered, leading to a static or dynamic ensemble model. The predictions for each ensemble member are obtained by dividing the forecasting problem into h forecasting sub-problems, one for each value of the prediction horizon. These sub-problems have been solved using machine learning algorithms from the big data engine Apache Spark, ensuring the scalability of our methodology. The performance of the proposed ensemble models is evaluated on Spanish electricity consumption data for 10 years measured with a 10-minute frequency. The results showed that both the dynamic and static ensembles performed well, outperforming the individual ensemble members they combine. The dynamic ensemble was the most accurate model achieving a MRE of 2%, which is a very promising result for the prediction of big time series. Proposed ensembles are also evaluated using solar power from Australia for two years measured with 30-min frequency. The results are successfully compared with Artificial Neural Network, Pattern Sequence-based Forecasting and Deep Learning, improving their results.

Journal ArticleDOI
TL;DR: Interestingly the various machine learning algorithms used in this study yielded close accuracy hence these methods could be used as alternative predictive tools in the breast cancer survival studies, particularly in the Asian region.
Abstract: Breast cancer is one of the most common diseases in women worldwide. Many studies have been conducted to predict the survival indicators, however most of these analyses were predominantly performed using basic statistical methods. As an alternative, this study used machine learning techniques to build models for detecting and visualising significant prognostic indicators of breast cancer survival rate. A large hospital-based breast cancer dataset retrieved from the University Malaya Medical Centre, Kuala Lumpur, Malaysia (n = 8066) with diagnosis information between 1993 and 2016 was used in this study. The dataset contained 23 predictor variables and one dependent variable, which referred to the survival status of the patients (alive or dead). In determining the significant prognostic factors of breast cancer survival rate, prediction models were built using decision tree, random forest, neural networks, extreme boost, logistic regression, and support vector machine. Next, the dataset was clustered based on the receptor status of breast cancer patients identified via immunohistochemistry to perform advanced modelling using random forest. Subsequently, the important variables were ranked via variable selection methods in random forest. Finally, decision trees were built and validation was performed using survival analysis. In terms of both model accuracy and calibration measure, all algorithms produced close outcomes, with the lowest obtained from decision tree (accuracy = 79.8%) and the highest from random forest (accuracy = 82.7%). The important variables identified in this study were cancer stage classification, tumour size, number of total axillary lymph nodes removed, number of positive lymph nodes, types of primary treatment, and methods of diagnosis. Interestingly the various machine learning algorithms used in this study yielded close accuracy hence these methods could be used as alternative predictive tools in the breast cancer survival studies, particularly in the Asian region. The important prognostic factors influencing survival rate of breast cancer identified in this study, which were validated by survival curves, are useful and could be translated into decision support tools in the medical domain.

Journal ArticleDOI
TL;DR: A Multi-Class Combined performance metric is proposed to compare various multi-class and binary classification systems through incorporating FAR, DR, Accuracy, and class distribution parameters and a uniform distribution based balancing approach is developed to handle the imbalanced distribution of the minority class instances in the CICIDS2017 network intrusion dataset.
Abstract: The security of networked systems has become a critical universal issue that influences individuals, enterprises and governments. The rate of attacks against networked systems has increased dramatically, and the tactics used by the attackers are continuing to evolve. Intrusion detection is one of the solutions against these attacks. A common and effective approach for designing Intrusion Detection Systems (IDS) is Machine Learning. The performance of an IDS is significantly improved when the features are more discriminative and representative. This study uses two feature dimensionality reduction approaches: (i) Auto-Encoder (AE): an instance of deep learning, for dimensionality reduction, and (ii) Principle Component Analysis (PCA). The resulting low-dimensional features from both techniques are then used to build various classifiers such as Random Forest (RF), Bayesian Network, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for designing an IDS. The experimental findings with low-dimensional features in binary and multi-class classification show better performance in terms of Detection Rate (DR), F-Measure, False Alarm Rate (FAR), and Accuracy. This research effort is able to reduce the CICIDS2017 dataset’s feature dimensions from 81 to 10, while maintaining a high accuracy of 99.6% in multi-class and binary classification. Furthermore, in this paper, we propose a Multi-Class Combined performance metric C o m b i n e d M c with respect to class distribution to compare various multi-class and binary classification systems through incorporating FAR, DR, Accuracy, and class distribution parameters. In addition, we developed a uniform distribution based balancing approach to handle the imbalanced distribution of the minority class instances in the CICIDS2017 network intrusion dataset.

Journal ArticleDOI
TL;DR: The proposed method with two stages to select proper variables, simplify parameter settings, and predict HPCCS shows a strong generalization capacity, and the prediction performance of the model is better when the input variables are expressed as absolute mass.

Journal ArticleDOI
TL;DR: This work proposes to conduct likelihood-free Bayesian inferences about parameters with no prior selection of the relevant components of the summary statistics and bypassing the derivation of the associated tolerance level using the random forest methodology of Breiman (2001).
Abstract: Approximate Bayesian Computation (ABC) has grown into a standard methodology to handle Bayesian inference in models associated with intractable likelihood functions. Most ABC implementations require the selection of a summary statistic as the data itself is too large or too complex to be compared to simulated realisations from the assumed model. The dimension of this statistic is generally constrained to be close to the dimension of the model parameter for efficiency reasons. Furthermore, the tolerance level that governs the acceptance or rejection of parameter values needs to be calibrated and the range of calibration techniques available so far is mostly based on asymptotic arguments. We propose here to conduct Bayesian inference based on an arbitrarily large vector of summary statistics without imposing a selection of the relevant components and bypassing the derivation of a tolerance. The approach relies on the random forest methodology of Breiman (2001) when applied to regression. We advocate the derivation of a new random forest for each component of the parameter vector, a tool from which an approximation to the marginal posterior distribution can be derived. Correlations between parameter components are handled by separate random forests. This technology offers significant gains in terms of robustness to the choice of the summary statistics and of computing time, when compared with more standard ABC solutions.

Journal ArticleDOI
TL;DR: This paper proposes a novel classification model, based on heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of existing anti-phishing techniques and outperformed these methods and also detected zero-day phishing attacks.
Abstract: Phishing is a cyber-attack which targets naive online users tricking into revealing sensitive information such as username, password, social security number or credit card number etc. Attackers fool the Internet users by masking webpage as a trustworthy or legitimate page to retrieve personal information. There are many anti-phishing solutions such as blacklist or whitelist, heuristic and visual similarity-based methods proposed to date, but online users are still getting trapped into revealing sensitive information in phishing websites. In this paper, we propose a novel classification model, based on heuristic features that are extracted from URL, source code, and third-party services to overcome the disadvantages of existing anti-phishing techniques. Our model has been evaluated using eight different machine learning algorithms and out of which, the Random Forest (RF) algorithm performed the best with an accuracy of 99.31%. The experiments were repeated with different (orthogonal and oblique) random forest classifiers to find the best classifier for the phishing website detection. Principal component analysis Random Forest (PCA-RF) performed the best out of all oblique Random Forests (oRFs) with an accuracy of 99.55%. We have also tested our model with the third-party-based features and without third-party-based features to determine the effectiveness of third-party services in the classification of suspicious websites. We also compared our results with the baseline models (CANTINA and CANTINA+). Our proposed technique outperformed these methods and also detected zero-day phishing attacks.

Journal ArticleDOI
TL;DR: In this article, a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, nearest neighbors, regression trees and rules, neural networks, bagging and boosting, deep learning and support vector regression is presented.

Journal ArticleDOI
TL;DR: This work proposes a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals and finds that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampled rates due to its bias correction properties.
Abstract: Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

Journal ArticleDOI
TL;DR: There are alternative data treatment methods, such as support vector machine (SVM), classification and regression tree (CART) and random forest (RF), that show a great potential and more advantages compared to conventional ones.

Journal ArticleDOI
TL;DR: Two computer algorithms are presented; one designed for segmentation of nuclei and the other for classification of whole slide tissue images, both of which were evaluated in the MICCAI 2017 Digital Pathology challenge.
Abstract: High-resolution microscopy images of tissue specimens provide detailed information about the morphology of normal and diseased tissue. Image analysis of tissue morphology can help cancer researchers develop a better understanding of cancer biology. Segmentation of nuclei and classification of tissue images are two common tasks in tissue image analysis. Development of accurate and efficient algorithms for these tasks is a challenging problem because of the complexity of tissue morphology and tumor heterogeneity. In this paper we present two computer algorithms; one designed for segmentation of nuclei and the other for classification of whole slide tissue images. The segmentation algorithm implements a multiscale deep residual aggregation network to accurately segment nuclear material and then separate clumped nuclei into individual nuclei. The classification algorithm initially carries out patch-level classification via a deep learning method, then patch-level statistical and morphological features are used as input to a random forest regression model for whole slide image classification. The segmentation and classification algorithms were evaluated in the MICCAI 2017 Digital Pathology challenge. The segmentation algorithm achieved an accuracy score of 0.78. The classification algorithm achieved an accuracy score of 0.81. These scores were the highest in the challenge.

Posted Content
02 Apr 2019
TL;DR: A novel methodology combining the benefits of correlation-based feature selection and bat algorithm with an ensemble classifier based on C4.5, Random Forest, and Forest by Penalizing Attributes, which can be able to classify both common and rare types of attacks with high accuracy and efficiency.
Abstract: Intrusion detection system (IDS) is one of extensively used techniques in a network topology to safeguard the integrity and availability of sensitive assets in the protected systems. Although many supervised and unsupervised learning approaches from the field of machine learning have been used to increase the efficacy of IDSs, it is still a problem for existing intrusion detection algorithms to achieve good performance. First, lots of redundant and irrelevant data in high-dimensional datasets interfere with the classification process of an IDS. Second, an individual classifier may not perform well in the detection of each type of attacks. Third, many models are built for stale datasets, making them less adaptable for novel attacks. Thus, we propose a new intrusion detection framework in this paper, and this framework is based on the feature selection and ensemble learning techniques. In the first step, a heuristic algorithm called CFS-BA is proposed for dimensionality reduction, which selects the optimal subset based on the correlation between features. Then, we introduce an ensemble approach that combines C4.5, Random Forest (RF), and Forest by Penalizing Attributes (Forest PA) algorithms. Finally, voting technique is used to combine the probability distributions of the base learners for attack recognition. The experimental results, using NSL-KDD, AWID, and CIC-IDS2017 datasets, reveal that the proposed CFS-BA-Ensemble method is able to exhibit better performance than other related and state of the art approaches under several metrics.