Showing papers on "Random forest published in 2020"

PDF

Open Access

Journal Article•DOI•

From Local Explanations to Global Understanding with Explainable AI for Trees.

[...]

Scott M. Lundberg¹, Scott M. Lundberg², Gabriel G. Erion¹, Hugh Chen¹, Alex J. DeGrave¹, Jordan M. Prutkin¹, Bala G. Nair¹, Ronit Katz¹, Jonathan Himmelfarb¹, Nisha Bansal¹, Su-In Lee¹ - Show less +7 more•Institutions (2)

University of Washington¹, Microsoft²

17 Jan 2020-Nature Machine Intelligence

TL;DR: An explanation method for trees is presented that enables the computation of optimal local explanations for individual predictions, and the authors demonstrate their method on three medical datasets.

...read moreread less

Abstract: Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.

...read moreread less

2,548 citations

Journal Article•DOI•

Analysis of Dimensionality Reduction Techniques on Big Data

[...]

G. Thippa Reddy¹, M. Praveen Kumar Reddy¹, Kuruva Lakshmanna¹, Rajesh Kaluri¹, Dharmendra Singh Rajput¹, Gautam Srivastava², Thar Baker³ - Show less +3 more•Institutions (3)

VIT University¹, Brandon University², Liverpool John Moores University³

16 Mar 2020-IEEE Access

TL;DR: Two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms using publicly available Cardiotocography dataset from University of California and Irvine Machine Learning Repository to prove that PCA outperforms LDA in all the measures.

...read moreread less

Abstract: Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.

...read moreread less

414 citations

Journal Article•DOI•

Visualizing the effects of predictor variables in black box supervised learning models

[...]

Daniel W. Apley, Jingyu Zhu¹•Institutions (1)

Northwestern University¹

01 Sep 2020-Journal of The Royal Statistical Society Series B-statistical Methodology

TL;DR: In this paper, the authors present a new visualization approach that is called accumulated local effects plots, which do not require this unreliable extrapolation with correlated predictors and are far less computationally expensive than partial dependence plots.

...read moreread less

Abstract: In many supervised learning applications, understanding and visualizing the effects of the predictor variables on the predicted response is of paramount importance. A shortcoming of black box supervised learning models (e.g. complex trees, neural networks, boosted trees, random forests, nearest neighbours, local kernel‐weighted methods and support vector regression) in this regard is their lack of interpretability or transparency. Partial dependence plots, which are the most popular approach for visualizing the effects of the predictors with black box supervised learning models, can produce erroneous results if the predictors are strongly correlated, because they require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data. As an alternative to partial dependence plots, we present a new visualization approach that we term accumulated local effects plots, which do not require this unreliable extrapolation with correlated predictors. Moreover, accumulated local effects plots are far less computationally expensive than partial dependence plots. We also provide an R package ALEPlot as supplementary material to implement our proposed method.

...read moreread less

410 citations

Journal Article•DOI•

The random forest algorithm for statistical learning

[...]

Matthias Schonlau¹, Rosie Yuyan Zou¹•Institutions (1)

University of Waterloo¹

24 Mar 2020-Stata Journal

TL;DR: This article overviews the random forest algorithm and illustrates its use with two examples, and introduces a corresponding new command, rforest, which is used to predict the logscaled number of shares of online news articles.

...read moreread less

Abstract: Random forests (Breiman, 2001, Machine Learning 45: 5–32) is a statistical- or machine-learning algorithm for prediction. In this article, we introduce a corresponding new command, rforest. We over...

...read moreread less

290 citations

Journal Article•DOI•

Support Vector Machine Versus Random Forest for Remote Sensing Image Classification: A Meta-Analysis and Systematic Review

[...]

Mohammadreza Sheykhmousa, Masoud Mahdianpari¹, Hamid Ghanbari², Fariba Mohammadimanesh¹, Pedram Ghamisi³, Saeid Homayouni⁴ - Show less +2 more•Institutions (4)

St. John's University¹, Laval University², Helmholtz-Zentrum Dresden-Rossendorf³, Institut national de la recherche scientifique⁴

25 Sep 2020-IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

TL;DR: A meta-analysis of 251 peer-reviewed journal papers relevant to remote sensing image classification and a comparative analysis regarding the performances of RF and SVM classification against various parameters is applied.

...read moreread less

Abstract: Several machine-learning algorithms have been proposed for remote sensing image classification during the past two decades. Among these machine learning algorithms, Random Forest (RF) and Support Vector Machines (SVM) have drawn attention to image classification in several remote sensing applications. This article reviews RF and SVM concepts relevant to remote sensing image classification and applies a meta-analysis of 251 peer-reviewed journal papers. A database with more than 40 quantitative and qualitative fields was constructed from these reviewed papers. The meta-analysis mainly focuses on 1) the analysis regarding the general characteristics of the studies, such as geographical distribution, frequency of the papers considering time, journals, application domains, and remote sensing software packages used in the case studies, and 2) a comparative analysis regarding the performances of RF and SVM classification against various parameters, such as data type, RS applications, spatial resolution, and the number of extracted features in the feature engineering step. The challenges, recommendations, and potential directions for future research are also discussed in detail. Moreover, a summary of the results is provided to aid researchers to customize their efforts in order to achieve the most accurate results based on their thematic applications.

...read moreread less

275 citations

Journal Article•DOI•

Selecting critical features for data classification based on machine learning methods

[...]

Rung-Ching Chen¹, Christine Dewi², Christine Dewi¹, Su-Wen Huang¹, Rezzy Eko Caraka¹ - Show less +1 more•Institutions (2)

Chaoyang University of Technology¹, Satya Wacana Christian University²

23 Jul 2020-Journal of Big Data

TL;DR: This paper adopts Random Forest to select the important feature in classification and compares the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination to get the best percentage accuracy and kappa.

...read moreread less

Abstract: Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

...read moreread less

271 citations

Journal Article•DOI•

Modeling flood susceptibility using data-driven approaches of naïve Bayes tree, alternating decision tree, and random forest methods.

[...]

Wei Chen¹, Yang Li¹, Weifeng Xue¹, Himan Shahabi², Shaojun Li³, Haoyuan Hong⁴, Haoyuan Hong⁵, Xiaojing Wang¹, Huiyuan Bian¹, Shuai Zhang¹, Biswajeet Pradhan⁶, Baharin Bin Ahmad⁷ - Show less +8 more•Institutions (7)

Xi'an University of Science and Technology¹, University of Kurdistan², Chinese Academy of Sciences³, Nanjing University⁴, Nanjing Normal University⁵, University of Technology, Sydney⁶, Universiti Teknologi Malaysia⁷

20 Jan 2020-Science of The Total Environment

TL;DR: The results indicated that the RF method is an efficient and reliable model in flood susceptibility assessment, with the highest AUC values, positive predictive rate, negative predictive rates, specificity, and accuracy for the training and validation datasets.

...read moreread less

256 citations

Journal Article•DOI•

Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series

[...]

Matheus Henrique Dal Molin Ribeiro¹, Matheus Henrique Dal Molin Ribeiro², Leandro dos Santos Coelho³, Leandro dos Santos Coelho¹•Institutions (3)

Pontifícia Universidade Católica do Paraná¹, Federal University of Technology - Paraná², Federal University of Paraná³

01 Jan 2020-Applied Soft Computing

TL;DR: The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.

...read moreread less

244 citations

Journal Article•DOI•

Building an Efficient Intrusion Detection System Based on Feature Selection and Ensemble Classifier

[...]

Yuyang Zhou¹, Yuyang Zhou², Guang Cheng¹, Guang Cheng², Shanqing Jiang², Mian Dai¹, Mian Dai² - Show less +3 more•Institutions (2)

Chinese Ministry of Education¹, Southeast University²

19 Jun 2020-Computer Networks

TL;DR: Wang et al. as discussed by the authors proposed a new intrusion detection framework based on the feature selection and ensemble learning techniques, and this framework is able to exhibit better performance than other related and state of the art approaches under several metrics.

...read moreread less

244 citations

Journal Article•DOI•

Deep convolutional neural networks with ensemble learning and transfer learning for capacity estimation of lithium-ion batteries

[...]

Sheng Shen¹, Mohammadkazem Sadoughi¹, Meng Li¹, Zhengdao Wang¹, Chao Hu¹ - Show less +1 more•Institutions (1)

Iowa State University¹

15 Feb 2020-Applied Energy

TL;DR: The verification and comparison results demonstrate that the proposed DCNN-ETL method can produce a higher accuracy and robustness than these other data-driven methods in estimating the capacities of the Li-ion cells in the target task.

...read moreread less

218 citations

Journal Article•DOI•

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

[...]

Kanish Shah¹, Henil Patel¹, Devanshi Sanghvi¹, Manan Shah²•Institutions (2)

Indus University¹, Pandit Deendayal Petroleum University²

01 Dec 2020

TL;DR: The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set and the classifier is termed as the best machine learning algorithm for the BBC news data set.

...read moreread less

Abstract: In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

...read moreread less

Journal Article•DOI•

Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

[...]

Md. Kamrul Hasan¹, Md. Ashraful Alam¹, Dola Das¹, Eklas Hossain², Mahmudul Hasan¹ - Show less +1 more•Institutions (2)

Khulna University of Engineering & Technology¹, Oregon Institute of Technology²

23 Apr 2020-IEEE Access

TL;DR: A robust framework for diabetes prediction is proposed where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers were employed and the weighted ensembling of different ML models were employed to improve the prediction of diabetes.

...read moreread less

Abstract: Diabetes, also known as chronic illness, is a group of metabolic diseases due to a high level of sugar in the blood over a long period. The risk factor and severity of diabetes can be reduced significantly if the precise early prediction is possible. The robust and accurate prediction of diabetes is highly challenging due to the limited number of labeled data and also the presence of outliers (or missing values) in the diabetes datasets. In this literature, we are proposing a robust framework for diabetes prediction where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers (k-nearest Neighbour, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost) and Multilayer Perceptron (MLP) were employed. The weighted ensembling of different ML models is also proposed, in this literature, to improve the prediction of diabetes where the weights are estimated from the corresponding Area Under ROC Curve (AUC) of the ML model. AUC is chosen as the performance metric, which is then maximized during hyperparameter tuning using the grid search technique. All the experiments, in this literature, were conducted under the same experimental conditions using the Pima Indian Diabetes Dataset. From all the extensive experiments, our proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio, and AUC as 0.789, 0.934, 0.092, 66.234, and 0.950 respectively which outperforms the state-of-the-art results by 2.00 % in AUC. Our proposed framework for the diabetes prediction outperforms the other methods discussed in the article. It can also provide better results on the same dataset which can lead to better performance in diabetes prediction. Our source code for diabetes prediction is made publicly available.

...read moreread less

Journal Article•DOI•

A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks

[...]

Muhammad Muzammal¹, Romana Talat¹, Ali Hassan Sodhro², Sandeep Pirbhulal³•Institutions (3)

Bahria University¹, Linköping University², Chinese Academy of Sciences³

01 Jan 2020-Information Fusion

TL;DR: A data fusion enabled Ensemble approach is proposed to work with medical data obtained from BSNs in a fog computing environment and the obtained results are promising, as 98% accuracy when the tree depth is equal to 15, number of estimators is 40, and 8 features are considered for the prediction task.

...read moreread less

Journal Article•DOI•

Predicting Stock Market Trends Using Machine Learning and Deep Learning Algorithms Via Continuous and Binary Data; a Comparative Analysis

[...]

Mojtaba Nabipour¹, Pooyan Nayyeri², Hamed Jabani³, S Shahab⁴, Amirhosein Mosavi⁵ - Show less +1 more•Institutions (5)

Tarbiat Modares University¹, University of Tehran², Payame Noor University³, National Yunlin University of Science and Technology⁴, Selye János University⁵

12 Aug 2020-IEEE Access

TL;DR: Results show that for the continuous data, RNN and LSTM outperform other prediction models with a considerable difference, and results show that in the binary data evaluation, those deep learning methods are the best; however, the difference becomes less because of the noticeable improvement of models’ performance in the second way.

...read moreread less

Abstract: The nature of stock market movement has always been ambiguous for investors because of various influential factors. This study aims to significantly reduce the risk of trend prediction with machine learning and deep learning algorithms. Four stock market groups, namely diversified financials, petroleum, non-metallic minerals and basic metals from Tehran stock exchange, are chosen for experimental evaluations. This study compares nine machine learning models (Decision Tree, Random Forest, Adaptive Boosting (Adaboost), eXtreme Gradient Boosting (XGBoost), Support Vector Classifier (SVC), Naive Bayes, K-Nearest Neighbors (KNN), Logistic Regression and Artificial Neural Network (ANN)) and two powerful deep learning methods (Recurrent Neural Network (RNN) and Long short-term memory (LSTM). Ten technical indicators from ten years of historical data are our input values, and two ways are supposed for employing them. Firstly, calculating the indicators by stock trading values as continuous data, and secondly converting indicators to binary data before using. Each prediction model is evaluated by three metrics based on the input ways. The evaluation results indicate that for the continuous data, RNN and LSTM outperform other prediction models with a considerable difference. Also, results show that in the binary data evaluation, those deep learning methods are the best; however, the difference becomes less because of the noticeable improvement of models' performance in the second way.

...read moreread less

Journal Article•DOI•

Classification models for heart disease prediction using feature selection and PCA

[...]

Anna Karen Gárate-Escamila¹, Amir Hajjam El Hassani¹, Emmanuel Andrès²•Institutions (2)

Universite de technologie de Belfort-Montbeliard¹, University of Strasbourg²

01 Jan 2020-Informatics in Medicine Unlocked

TL;DR: The experimental results proved that the combination of chi-square with PCA obtains greater performance in most classifiers and the usage of PCA directly from the raw data computed lower results and would require greater dimensionality to improve the results.

...read moreread less

Journal Article•DOI•

Flight Delay Prediction Based on Aviation Big Data and Machine Learning

[...]

Guan Gui¹, Fan Liu¹, Jinlong Sun¹, Jie Yang¹, Zhou Ziqi¹, Dongxu Zhao¹ - Show less +2 more•Institutions (1)

Nanjing University of Posts and Telecommunications¹

01 Jan 2020-IEEE Transactions on Vehicular Technology

TL;DR: A broader scope of factors which may potentially influence the flight delay is explored, and several machine learning-based models in designed generalized flight delay prediction tasks are compared, and the proposed random forest-based model can obtain higher prediction accuracy and overcome the overfitting problem.

...read moreread less

Abstract: Accurate flight delay prediction is fundamental to establish the more efficient airline business. Recent studies have been focused on applying machine learning methods to predict the flight delay. Most of the previous prediction methods are conducted in a single route or airport. This paper explores a broader scope of factors which may potentially influence the flight delay, and compares several machine learning-based models in designed generalized flight delay prediction tasks. To build a dataset for the proposed scheme, automatic dependent surveillance-broadcast (ADS-B) messages are received, pre-processed, and integrated with other information such as weather condition, flight schedule, and airport information. The designed prediction tasks contain different classification tasks and a regression task. Experimental results show that long short-term memory (LSTM) is capable of handling the obtained aviation sequence data, but overfitting problem occurs in our limited dataset. Compared with the previous schemes, the proposed random forest-based model can obtain higher prediction accuracy (90.2% for the binary classification) and can overcome the overfitting problem.

...read moreread less

Journal Article•DOI•

Selecting appropriate machine learning methods for digital soil mapping

[...]

Yones Khaledian¹, Bradley A. Miller¹•Institutions (1)

Iowa State University¹

01 May 2020-Applied Mathematical Modelling

TL;DR: This work compares the strengths and weaknesses of multiple linear regression (MLR), k-nearest neighbors (KNN), support vector regression (SVR), Cubist, random forest (RF), and artificial neural networks (ANN) for DSM.

...read moreread less

Journal Article•DOI•

Land Cover Classification using Google Earth Engine and Random Forest Classifier—The Role of Image Composition

[...]

Thanh Noi Phan, Verena Kuch, Lukas W. Lehnert

01 Jul 2020-Remote Sensing

TL;DR: The results indicate that temporal aggregation (e.g., median) is a promising method, which not only significantly reduces data volume (resulting in an easier and faster analysis) but also produces an equally high accuracy as time series data.

...read moreread less

Abstract: Land cover information plays a vital role in many aspects of life, from scientific and economic to political. Accurate information about land cover affects the accuracy of all subsequent applications, therefore accurate and timely land cover information is in high demand. In land cover classification studies over the past decade, higher accuracies were produced when using time series satellite images than when using single date images. Recently, the availability of the Google Earth Engine (GEE), a cloud-based computing platform, has gained the attention of remote sensing based applications where temporal aggregation methods derived from time series images are widely applied (i.e., the use the metrics such as mean or median), instead of time series images. In GEE, many studies simply select as many images as possible to fill gaps without concerning how different year/season images might affect the classification accuracy. This study aims to analyze the effect of different composition methods, as well as different input images, on the classification results. We use Landsat 8 surface reflectance (L8sr) data with eight different combination strategies to produce and evaluate land cover maps for a study area in Mongolia. We implemented the experiment on the GEE platform with a widely applied algorithm, the Random Forest (RF) classifier. Our results show that all the eight datasets produced moderately to highly accurate land cover maps, with overall accuracy over 84.31%. Among the eight datasets, two time series datasets of summer scenes (images from 1 June to 30 September) produced the highest accuracy (89.80% and 89.70%), followed by the median composite of the same input images (88.74%). The difference between these three classifications was not significant based on the McNemar test (p > 0.05). However, significant difference (p < 0.05) was observed for all other pairs involving one of these three datasets. The results indicate that temporal aggregation (e.g., median) is a promising method, which not only significantly reduces data volume (resulting in an easier and faster analysis) but also produces an equally high accuracy as time series data. The spatial consistency among the classification results was relatively low compared to the general high accuracy, showing that the selection of the dataset used in any classification on GEE is an important and crucial step, because the input images for the composition play an essential role in land cover classification, particularly with snowy, cloudy and expansive areas like Mongolia.

...read moreread less

Journal Article•DOI•

A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities

[...]

Sergio González¹, Salvador García¹, Javier Del Ser², Lior Rokach³, Francisco Herrera¹ - Show less +1 more•Institutions (3)

University of Granada¹, University of the Basque Country², Ben-Gurion University of the Negev³

01 Dec 2020-Information Fusion

TL;DR: The performance of 14 different bagging and boosting based ensembles, including XGBoost, LightGBM and Random Forest, is empirically analyzed in terms of predictive capability and efficiency.

...read moreread less

Journal Article•DOI•

A random forest model of landslide susceptibility mapping based on hyperparameter optimization using Bayes algorithm

[...]

Deliang Sun¹, Haijia Wen², Haijia Wen³, Danzhou Wang¹, Jiahui Xu¹ - Show less +1 more•Institutions (3)

Chongqing Normal University¹, Chinese Ministry of Education², Chongqing University³

01 Aug 2020-Geomorphology

TL;DR: The purpose of this study is to optimize the hyperparameters based on a Bayesian optimization algorithm, and to obtain a high accuracy random forest landslide susceptibility evaluation model.

...read moreread less

Journal Article•DOI•

A deep learning method with wrapper based feature extraction for wireless intrusion detection system

[...]

Sydney Mambwe Kasongo¹, Yanxia Sun¹•Institutions (1)

University of Johannesburg¹

01 May 2020-Computers & Security

TL;DR: The results suggested that the proposed Feed-Forward Deep Neural Network (FFDNN) wireless IDS system using a Wrapper Based Feature Extraction Unit (WFEU) has greater detection accuracy than other approaches.

...read moreread less

Journal Article•DOI•

CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection

[...]

Kurniabudi¹, Deris Stiawan¹, Darmawijoyo¹, Mohammad Yazid Bin Idris², Alwi M. Bamhdi³, Rahmat Budiarto⁴ - Show less +2 more•Institutions (4)

Sriwijaya University¹, Universiti Teknologi Malaysia², Umm al-Qura University³, Al Baha University⁴

16 Jul 2020-IEEE Access

TL;DR: The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time.

...read moreread less

Abstract: Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time.

...read moreread less

Journal Article•DOI•

Integration of convolutional neural network and conventional machine learning classifiers for landslide susceptibility mapping

[...]

Zhice Fang¹, Yi Wang¹, Ling Peng, Haoyuan Hong•Institutions (1)

China University of Geosciences (Wuhan)¹

01 Jun 2020-Computers & Geosciences

TL;DR: The experimental results demonstrated that the performance of the machine learning classifiers previously mentioned can be effectively improved by integrating the CNN technique and can be recommended for landslide spatial modelling in other prone areas with similar geo-environmental conditions.

...read moreread less

Journal Article•DOI•

Assessing NO2 Concentration and Model Uncertainty with High Spatiotemporal Resolution across the Contiguous United States Using Ensemble Model Averaging.

[...]

Qian Di¹, Qian Di², Heresh Amini², Liuhua Shi², Liuhua Shi³, Itai Kloog⁴, Rachel F. Silvern², James T. Kelly⁵, M. Benjamin Sabath², Christine Choirat², Petros Koutrakis², Alexei Lyapustin⁶, Yujie Wang⁷, Loretta J. Mickley², Joel Schwartz² - Show less +11 more•Institutions (7)

Tsinghua University¹, Harvard University², Emory University³, Ben-Gurion University of the Negev⁴, Research Triangle Park⁵, Goddard Space Flight Center⁶, University of Maryland, Baltimore County⁷

04 Feb 2020-Environmental Science & Technology

TL;DR: An ensemble model to integrate multiple machine learning algorithms, including neural network, random forest, and gradient boosting, with a variety of predictor variables, including chemical transport models is proposed to assess NO2 level with high accuracy.

...read moreread less

Abstract: NO2 is a combustion byproduct that has been associated with multiple adverse health outcomes. To assess NO2 levels with high accuracy, we propose the use of an ensemble model to integrate multiple machine learning algorithms, including neural network, random forest, and gradient boosting, with a variety of predictor variables, including chemical transport models. This NO2 model covers the entire contiguous U.S. with daily predictions on 1-km-level grid cells from 2000 to 2016. The ensemble produced a cross-validated R2 of 0.788 overall, a spatial R2 of 0.844, and a temporal R2 of 0.729. The relationship between daily monitored and predicted NO2 is almost linear. We also estimated the associated monthly uncertainty level for the predictions and address-specific NO2 levels. This NO2 estimation has a very high spatiotemporal resolution and allows the examination of the health effects of NO2 in unmonitored areas. We found the highest NO2 levels along highways and in cities. We also observed that nationwide NO2 levels declined in early years and stagnated after 2007, in contrast to the trend at monitoring sites in urban areas, where the decline continued. Our research indicates that the integration of different predictor variables and fitting algorithms can achieve an improved air pollution modeling framework.

...read moreread less

Journal Article•DOI•

A comparison between Ensemble and MaxEnt species distribution modelling approaches for conservation: A case study with Egyptian medicinal plants

[...]

Emad Kaky¹, Emad Kaky², Victoria Nolan¹, Abdulaziz S. Alatawi³, Abdulaziz S. Alatawi¹, Francis Gilbert¹ - Show less +2 more•Institutions (3)

University of Nottingham¹, Sulaimani Polytechnic University², University of Tabuk³

01 Nov 2020-Ecological Informatics

TL;DR: It is concluded that single-algorithm modelling methods, particularly MaxEnt, are capable of producing distribution maps of comparable accuracy to ensemble methods, and the ease of use, reduced computational time and simplicity of methods like MaxEnt provides support for their use in scenarios when the choice of modelling methods is limited but the need for robust and accurate conservation predictions is urgent.

...read moreread less

Journal Article•DOI•

Building Auto-Encoder Intrusion Detection System based on random forest feature selection

[...]

Li Xukui¹, Wei Chen¹, Qianru Zhang², Lifa Wu¹•Institutions (2)

Nanjing University of Posts and Telecommunications¹, University of Hong Kong²

01 Aug 2020-Computers & Security

TL;DR: The experimental results show that the proposed AE-IDS (Auto-Encoder Intrusion Detection System) is superior to traditional machine learning based intrusion detection methods in terms of easy training, strong adaptability, and high detection accuracy.

...read moreread less

Journal Article•DOI•

Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and Up-to-Date Dataset

[...]

Gozde Karatas¹, Önder Demir², Ozgur Koray Sahingoz¹•Institutions (2)

Istanbul Kültür University¹, Marmara University²

11 Feb 2020-IEEE Access

TL;DR: Six machine-learning-based IDS are proposed by using K Nearest Neighbor, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithms to increase the efficiency of the system depending on attack types and decrease missed intrusions and false alarms.

...read moreread less

Abstract: In recent years, due to the extensive use of the Internet, the number of networked computers has been increasing in our daily lives. Weaknesses of the servers enable hackers to intrude on computers by using not only known but also new attack-types, which are more sophisticated and harder to detect. To protect the computers from them, Intrusion Detection System (IDS), which is trained with some machine learning techniques by using a pre-collected dataset, is one of the most preferred protection mechanisms. The used datasets were collected during a limited period in some specific networks and generally don't contain up-to-date data. Additionally, they are imbalanced and cannot hold sufficient data for all types of attacks. These imbalanced and outdated datasets decrease the efficiency of current IDSs, especially for rarely encountered attack types. In this paper, we propose six machine-learning-based IDSs by using K Nearest Neighbor, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithms. To implement a more realistic IDS, an up-to-date security dataset, CSE-CIC-IDS2018, is used instead of older and mostly worked datasets. The selected dataset is also imbalanced. Therefore, to increase the efficiency of the system depending on attack types and to decrease missed intrusions and false alarms, the imbalance ratio is reduced by using a synthetic data generation model called Synthetic Minority Oversampling TEchnique (SMOTE). Data generation is performed for minor classes, and their numbers are increased to the average data size via this technique. Experimental results demonstrated that the proposed approach considerably increases the detection rate for rarely encountered intrusions.

...read moreread less

Journal Article•DOI•

A Machine Learning Methodology for Diagnosing Chronic Kidney Disease

[...]

Jiongming Qin¹, Lin Chen², Yuhua Liu¹, Chuanjun Liu², Changhao Feng¹, Bin Chen¹ - Show less +2 more•Institutions (2)

Southwest University¹, Kyushu University²

01 Jan 2020-IEEE Access

TL;DR: An integrated model that combines logistic regression and random forest is proposed by analyzing the misjudgments generated by the established models, which could achieve an average accuracy of 99.83% after ten times of simulation.

...read moreread less

Abstract: Chronic kidney disease (CKD) is a global health problem with high morbidity and mortality rate, and it induces other diseases. Since there are no obvious symptoms during the early stages of CKD, patients often fail to notice the disease. Early detection of CKD enables patients to receive timely treatment to ameliorate the progression of this disease. Machine learning models can effectively aid clinicians achieve this goal due to their fast and accurate recognition performance. In this study, we propose a machine learning methodology for diagnosing CKD. The CKD data set was obtained from the University of California Irvine (UCI) machine learning repository, which has a large number of missing values. KNN imputation was used to fill in the missing values, which selects several complete samples with the most similar measurements to process the missing data for each incomplete sample. Missing values are usually seen in real-life medical situations because patients may miss some measurements for various reasons. After effectively filling out the incomplete data set, six machine learning algorithms (logistic regression, random forest, support vector machine, k-nearest neighbor, naive Bayes classifier and feed forward neural network) were used to establish models. Among these machine learning models, random forest achieved the best performance with 99.75% diagnosis accuracy. By analyzing the misjudgments generated by the established models, we proposed an integrated model that combines logistic regression and random forest by using perceptron, which could achieve an average accuracy of 99.83% after ten times of simulation. Hence, we speculated that this methodology could be applicable to more complicated clinical data for disease diagnosis.

...read moreread less

Journal Article•DOI•

Coronary Artery Heart Disease Prediction: A Comparative Study of Computational Intelligence Techniques

[...]

Safial Islam Ayon¹, Md. Milon Islam¹, Md. Rahat Hossain²•Institutions (2)

Khulna University of Engineering & Technology¹, Central Queensland University²

21 Jan 2020-Iete Journal of Research

TL;DR: Diseases is an unusual circumstance that affects single or more parts of a human’s body and because of lifestyle and patrimonial, different kinds of disease are increasing day by day.

...read moreread less

Abstract: Diseases is an unusual circumstance that affects single or more parts of a human’s body. Because of lifestyle and patrimonial, different kinds of disease are increasing day by day. Among all those ...

...read moreread less

Journal Article•DOI•

Internet of health things-driven deep learning system for detection and classification of cervical cells using transfer learning

[...]

Aditya Khamparia¹, Deepak Gupta², Victor Hugo C. de Albuquerque³, Arun Kumar Sangaiah⁴, Rutvij H. Jhaveri⁵ - Show less +1 more•Institutions (5)

Lovely Professional University¹, Maharaja Agrasen Institute of Technology², University of Fortaleza³, VIT University⁴, Pandit Deendayal Petroleum University⁵

01 Nov 2020-The Journal of Supercomputing

TL;DR: A novel internet of health things (IoHT)-driven deep learning framework for detection and classification of cervical cancer in Pap smear images using concept of transfer learning is proposed.

...read moreread less

Abstract: Cervical cancer is one of the fastest growing global health problems and leading cause of mortality among women of developing countries. Automated Pap smear cell recognition and classification in early stage of cell development is crucial for effective disease diagnosis and immediate treatment. Thus, in this article, we proposed a novel internet of health things (IoHT)-driven deep learning framework for detection and classification of cervical cancer in Pap smear images using concept of transfer learning. Following transfer learning, convolutional neural network (CNN) was combined with different conventional machine learning techniques like K nearest neighbor, naive Bayes, logistic regression, random forest and support vector machines. In the proposed framework, feature extraction from cervical images is performed using pre-trained CNN models like InceptionV3, VGG19, SqueezeNet and ResNet50, which are fed into dense and flattened layer for normal and abnormal cervical cells classification. The performance of the proposed IoHT frameworks is evaluated using standard Pap smear Herlev dataset. The proposed approach was validated by analyzing precision, recall, F1-score, training–testing time and support parameters. The obtained results concluded that CNN pre-trained model ResNet50 achieved the higher classification rate of 97.89% with the involvement of random forest classifier for effective and reliable disease detection and classification. The minimum training time and testing time required to train model were 0.032 s and 0.006 s, respectively.

...read moreread less

Collapse