scispace - formally typeset
Search or ask a question

Showing papers on "Random forest published in 2020"


Journal ArticleDOI
TL;DR: An explanation method for trees is presented that enables the computation of optimal local explanations for individual predictions, and the authors demonstrate their method on three medical datasets.
Abstract: Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.

2,548 citations


Journal ArticleDOI
TL;DR: Two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms using publicly available Cardiotocography dataset from University of California and Irvine Machine Learning Repository to prove that PCA outperforms LDA in all the measures.
Abstract: Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.

414 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a new visualization approach that is called accumulated local effects plots, which do not require this unreliable extrapolation with correlated predictors and are far less computationally expensive than partial dependence plots.
Abstract: In many supervised learning applications, understanding and visualizing the effects of the predictor variables on the predicted response is of paramount importance. A shortcoming of black box supervised learning models (e.g. complex trees, neural networks, boosted trees, random forests, nearest neighbours, local kernel‐weighted methods and support vector regression) in this regard is their lack of interpretability or transparency. Partial dependence plots, which are the most popular approach for visualizing the effects of the predictors with black box supervised learning models, can produce erroneous results if the predictors are strongly correlated, because they require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data. As an alternative to partial dependence plots, we present a new visualization approach that we term accumulated local effects plots, which do not require this unreliable extrapolation with correlated predictors. Moreover, accumulated local effects plots are far less computationally expensive than partial dependence plots. We also provide an R package ALEPlot as supplementary material to implement our proposed method.

410 citations


Journal ArticleDOI
TL;DR: This article overviews the random forest algorithm and illustrates its use with two examples, and introduces a corresponding new command, rforest, which is used to predict the logscaled number of shares of online news articles.
Abstract: Random forests (Breiman, 2001, Machine Learning 45: 5–32) is a statistical- or machine-learning algorithm for prediction. In this article, we introduce a corresponding new command, rforest. We over...

290 citations


Journal ArticleDOI
TL;DR: A meta-analysis of 251 peer-reviewed journal papers relevant to remote sensing image classification and a comparative analysis regarding the performances of RF and SVM classification against various parameters is applied.
Abstract: Several machine-learning algorithms have been proposed for remote sensing image classification during the past two decades. Among these machine learning algorithms, Random Forest (RF) and Support Vector Machines (SVM) have drawn attention to image classification in several remote sensing applications. This article reviews RF and SVM concepts relevant to remote sensing image classification and applies a meta-analysis of 251 peer-reviewed journal papers. A database with more than 40 quantitative and qualitative fields was constructed from these reviewed papers. The meta-analysis mainly focuses on 1) the analysis regarding the general characteristics of the studies, such as geographical distribution, frequency of the papers considering time, journals, application domains, and remote sensing software packages used in the case studies, and 2) a comparative analysis regarding the performances of RF and SVM classification against various parameters, such as data type, RS applications, spatial resolution, and the number of extracted features in the feature engineering step. The challenges, recommendations, and potential directions for future research are also discussed in detail. Moreover, a summary of the results is provided to aid researchers to customize their efforts in order to achieve the most accurate results based on their thematic applications.

275 citations


Journal ArticleDOI
TL;DR: This paper adopts Random Forest to select the important feature in classification and compares the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination to get the best percentage accuracy and kappa.
Abstract: Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

271 citations


Journal ArticleDOI
TL;DR: The results indicated that the RF method is an efficient and reliable model in flood susceptibility assessment, with the highest AUC values, positive predictive rate, negative predictive rates, specificity, and accuracy for the training and validation datasets.

256 citations


Journal ArticleDOI
TL;DR: The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.

244 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a new intrusion detection framework based on the feature selection and ensemble learning techniques, and this framework is able to exhibit better performance than other related and state of the art approaches under several metrics.

244 citations


Journal ArticleDOI
Sheng Shen1, Mohammadkazem Sadoughi1, Meng Li1, Zhengdao Wang1, Chao Hu1 
TL;DR: The verification and comparison results demonstrate that the proposed DCNN-ETL method can produce a higher accuracy and robustness than these other data-driven methods in estimating the capacities of the Li-ion cells in the target task.

218 citations


Journal ArticleDOI
01 Dec 2020
TL;DR: The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set and the classifier is termed as the best machine learning algorithm for the BBC news data set.
Abstract: In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in a proper structure so that classification can be performed and categories can be properly defined. The key technology for gaining the insights into a text information and organizing that information is known as text classification. The classes are then classified by determining the text types of the content. Based on different machine learning algorithms used in the current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation, implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the classifier implementation section, the authors separately chose and compared logistic regression, random forest and K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

Journal ArticleDOI
TL;DR: A robust framework for diabetes prediction is proposed where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers were employed and the weighted ensembling of different ML models were employed to improve the prediction of diabetes.
Abstract: Diabetes, also known as chronic illness, is a group of metabolic diseases due to a high level of sugar in the blood over a long period. The risk factor and severity of diabetes can be reduced significantly if the precise early prediction is possible. The robust and accurate prediction of diabetes is highly challenging due to the limited number of labeled data and also the presence of outliers (or missing values) in the diabetes datasets. In this literature, we are proposing a robust framework for diabetes prediction where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers (k-nearest Neighbour, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost) and Multilayer Perceptron (MLP) were employed. The weighted ensembling of different ML models is also proposed, in this literature, to improve the prediction of diabetes where the weights are estimated from the corresponding Area Under ROC Curve (AUC) of the ML model. AUC is chosen as the performance metric, which is then maximized during hyperparameter tuning using the grid search technique. All the experiments, in this literature, were conducted under the same experimental conditions using the Pima Indian Diabetes Dataset. From all the extensive experiments, our proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio, and AUC as 0.789, 0.934, 0.092, 66.234, and 0.950 respectively which outperforms the state-of-the-art results by 2.00 % in AUC. Our proposed framework for the diabetes prediction outperforms the other methods discussed in the article. It can also provide better results on the same dataset which can lead to better performance in diabetes prediction. Our source code for diabetes prediction is made publicly available.

Journal ArticleDOI
TL;DR: A data fusion enabled Ensemble approach is proposed to work with medical data obtained from BSNs in a fog computing environment and the obtained results are promising, as 98% accuracy when the tree depth is equal to 15, number of estimators is 40, and 8 features are considered for the prediction task.

Journal ArticleDOI
TL;DR: Results show that for the continuous data, RNN and LSTM outperform other prediction models with a considerable difference, and results show that in the binary data evaluation, those deep learning methods are the best; however, the difference becomes less because of the noticeable improvement of models’ performance in the second way.
Abstract: The nature of stock market movement has always been ambiguous for investors because of various influential factors. This study aims to significantly reduce the risk of trend prediction with machine learning and deep learning algorithms. Four stock market groups, namely diversified financials, petroleum, non-metallic minerals and basic metals from Tehran stock exchange, are chosen for experimental evaluations. This study compares nine machine learning models (Decision Tree, Random Forest, Adaptive Boosting (Adaboost), eXtreme Gradient Boosting (XGBoost), Support Vector Classifier (SVC), Naive Bayes, K-Nearest Neighbors (KNN), Logistic Regression and Artificial Neural Network (ANN)) and two powerful deep learning methods (Recurrent Neural Network (RNN) and Long short-term memory (LSTM). Ten technical indicators from ten years of historical data are our input values, and two ways are supposed for employing them. Firstly, calculating the indicators by stock trading values as continuous data, and secondly converting indicators to binary data before using. Each prediction model is evaluated by three metrics based on the input ways. The evaluation results indicate that for the continuous data, RNN and LSTM outperform other prediction models with a considerable difference. Also, results show that in the binary data evaluation, those deep learning methods are the best; however, the difference becomes less because of the noticeable improvement of models' performance in the second way.

Journal ArticleDOI
TL;DR: The experimental results proved that the combination of chi-square with PCA obtains greater performance in most classifiers and the usage of PCA directly from the raw data computed lower results and would require greater dimensionality to improve the results.

Journal ArticleDOI
TL;DR: A broader scope of factors which may potentially influence the flight delay is explored, and several machine learning-based models in designed generalized flight delay prediction tasks are compared, and the proposed random forest-based model can obtain higher prediction accuracy and overcome the overfitting problem.
Abstract: Accurate flight delay prediction is fundamental to establish the more efficient airline business. Recent studies have been focused on applying machine learning methods to predict the flight delay. Most of the previous prediction methods are conducted in a single route or airport. This paper explores a broader scope of factors which may potentially influence the flight delay, and compares several machine learning-based models in designed generalized flight delay prediction tasks. To build a dataset for the proposed scheme, automatic dependent surveillance-broadcast (ADS-B) messages are received, pre-processed, and integrated with other information such as weather condition, flight schedule, and airport information. The designed prediction tasks contain different classification tasks and a regression task. Experimental results show that long short-term memory (LSTM) is capable of handling the obtained aviation sequence data, but overfitting problem occurs in our limited dataset. Compared with the previous schemes, the proposed random forest-based model can obtain higher prediction accuracy (90.2% for the binary classification) and can overcome the overfitting problem.

Journal ArticleDOI
TL;DR: This work compares the strengths and weaknesses of multiple linear regression (MLR), k-nearest neighbors (KNN), support vector regression (SVR), Cubist, random forest (RF), and artificial neural networks (ANN) for DSM.

Journal ArticleDOI
TL;DR: The results indicate that temporal aggregation (e.g., median) is a promising method, which not only significantly reduces data volume (resulting in an easier and faster analysis) but also produces an equally high accuracy as time series data.
Abstract: Land cover information plays a vital role in many aspects of life, from scientific and economic to political. Accurate information about land cover affects the accuracy of all subsequent applications, therefore accurate and timely land cover information is in high demand. In land cover classification studies over the past decade, higher accuracies were produced when using time series satellite images than when using single date images. Recently, the availability of the Google Earth Engine (GEE), a cloud-based computing platform, has gained the attention of remote sensing based applications where temporal aggregation methods derived from time series images are widely applied (i.e., the use the metrics such as mean or median), instead of time series images. In GEE, many studies simply select as many images as possible to fill gaps without concerning how different year/season images might affect the classification accuracy. This study aims to analyze the effect of different composition methods, as well as different input images, on the classification results. We use Landsat 8 surface reflectance (L8sr) data with eight different combination strategies to produce and evaluate land cover maps for a study area in Mongolia. We implemented the experiment on the GEE platform with a widely applied algorithm, the Random Forest (RF) classifier. Our results show that all the eight datasets produced moderately to highly accurate land cover maps, with overall accuracy over 84.31%. Among the eight datasets, two time series datasets of summer scenes (images from 1 June to 30 September) produced the highest accuracy (89.80% and 89.70%), followed by the median composite of the same input images (88.74%). The difference between these three classifications was not significant based on the McNemar test (p > 0.05). However, significant difference (p < 0.05) was observed for all other pairs involving one of these three datasets. The results indicate that temporal aggregation (e.g., median) is a promising method, which not only significantly reduces data volume (resulting in an easier and faster analysis) but also produces an equally high accuracy as time series data. The spatial consistency among the classification results was relatively low compared to the general high accuracy, showing that the selection of the dataset used in any classification on GEE is an important and crucial step, because the input images for the composition play an essential role in land cover classification, particularly with snowy, cloudy and expansive areas like Mongolia.

Journal ArticleDOI
TL;DR: The performance of 14 different bagging and boosting based ensembles, including XGBoost, LightGBM and Random Forest, is empirically analyzed in terms of predictive capability and efficiency.

Journal ArticleDOI
TL;DR: The purpose of this study is to optimize the hyperparameters based on a Bayesian optimization algorithm, and to obtain a high accuracy random forest landslide susceptibility evaluation model.

Journal ArticleDOI
TL;DR: The results suggested that the proposed Feed-Forward Deep Neural Network (FFDNN) wireless IDS system using a Wrapper Based Feature Extraction Unit (WFEU) has greater detection accuracy than other approaches.

Journal ArticleDOI
TL;DR: The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time.
Abstract: Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time.

Journal ArticleDOI
TL;DR: The experimental results demonstrated that the performance of the machine learning classifiers previously mentioned can be effectively improved by integrating the CNN technique and can be recommended for landslide spatial modelling in other prone areas with similar geo-environmental conditions.

Journal ArticleDOI
TL;DR: An ensemble model to integrate multiple machine learning algorithms, including neural network, random forest, and gradient boosting, with a variety of predictor variables, including chemical transport models is proposed to assess NO2 level with high accuracy.
Abstract: NO2 is a combustion byproduct that has been associated with multiple adverse health outcomes. To assess NO2 levels with high accuracy, we propose the use of an ensemble model to integrate multiple machine learning algorithms, including neural network, random forest, and gradient boosting, with a variety of predictor variables, including chemical transport models. This NO2 model covers the entire contiguous U.S. with daily predictions on 1-km-level grid cells from 2000 to 2016. The ensemble produced a cross-validated R2 of 0.788 overall, a spatial R2 of 0.844, and a temporal R2 of 0.729. The relationship between daily monitored and predicted NO2 is almost linear. We also estimated the associated monthly uncertainty level for the predictions and address-specific NO2 levels. This NO2 estimation has a very high spatiotemporal resolution and allows the examination of the health effects of NO2 in unmonitored areas. We found the highest NO2 levels along highways and in cities. We also observed that nationwide NO2 levels declined in early years and stagnated after 2007, in contrast to the trend at monitoring sites in urban areas, where the decline continued. Our research indicates that the integration of different predictor variables and fitting algorithms can achieve an improved air pollution modeling framework.

Journal ArticleDOI
TL;DR: It is concluded that single-algorithm modelling methods, particularly MaxEnt, are capable of producing distribution maps of comparable accuracy to ensemble methods, and the ease of use, reduced computational time and simplicity of methods like MaxEnt provides support for their use in scenarios when the choice of modelling methods is limited but the need for robust and accurate conservation predictions is urgent.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed AE-IDS (Auto-Encoder Intrusion Detection System) is superior to traditional machine learning based intrusion detection methods in terms of easy training, strong adaptability, and high detection accuracy.

Journal ArticleDOI
TL;DR: Six machine-learning-based IDS are proposed by using K Nearest Neighbor, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithms to increase the efficiency of the system depending on attack types and decrease missed intrusions and false alarms.
Abstract: In recent years, due to the extensive use of the Internet, the number of networked computers has been increasing in our daily lives. Weaknesses of the servers enable hackers to intrude on computers by using not only known but also new attack-types, which are more sophisticated and harder to detect. To protect the computers from them, Intrusion Detection System (IDS), which is trained with some machine learning techniques by using a pre-collected dataset, is one of the most preferred protection mechanisms. The used datasets were collected during a limited period in some specific networks and generally don't contain up-to-date data. Additionally, they are imbalanced and cannot hold sufficient data for all types of attacks. These imbalanced and outdated datasets decrease the efficiency of current IDSs, especially for rarely encountered attack types. In this paper, we propose six machine-learning-based IDSs by using K Nearest Neighbor, Random Forest, Gradient Boosting, Adaboost, Decision Tree, and Linear Discriminant Analysis algorithms. To implement a more realistic IDS, an up-to-date security dataset, CSE-CIC-IDS2018, is used instead of older and mostly worked datasets. The selected dataset is also imbalanced. Therefore, to increase the efficiency of the system depending on attack types and to decrease missed intrusions and false alarms, the imbalance ratio is reduced by using a synthetic data generation model called Synthetic Minority Oversampling TEchnique (SMOTE). Data generation is performed for minor classes, and their numbers are increased to the average data size via this technique. Experimental results demonstrated that the proposed approach considerably increases the detection rate for rarely encountered intrusions.

Journal ArticleDOI
TL;DR: An integrated model that combines logistic regression and random forest is proposed by analyzing the misjudgments generated by the established models, which could achieve an average accuracy of 99.83% after ten times of simulation.
Abstract: Chronic kidney disease (CKD) is a global health problem with high morbidity and mortality rate, and it induces other diseases. Since there are no obvious symptoms during the early stages of CKD, patients often fail to notice the disease. Early detection of CKD enables patients to receive timely treatment to ameliorate the progression of this disease. Machine learning models can effectively aid clinicians achieve this goal due to their fast and accurate recognition performance. In this study, we propose a machine learning methodology for diagnosing CKD. The CKD data set was obtained from the University of California Irvine (UCI) machine learning repository, which has a large number of missing values. KNN imputation was used to fill in the missing values, which selects several complete samples with the most similar measurements to process the missing data for each incomplete sample. Missing values are usually seen in real-life medical situations because patients may miss some measurements for various reasons. After effectively filling out the incomplete data set, six machine learning algorithms (logistic regression, random forest, support vector machine, k-nearest neighbor, naive Bayes classifier and feed forward neural network) were used to establish models. Among these machine learning models, random forest achieved the best performance with 99.75% diagnosis accuracy. By analyzing the misjudgments generated by the established models, we proposed an integrated model that combines logistic regression and random forest by using perceptron, which could achieve an average accuracy of 99.83% after ten times of simulation. Hence, we speculated that this methodology could be applicable to more complicated clinical data for disease diagnosis.

Journal ArticleDOI
TL;DR: Diseases is an unusual circumstance that affects single or more parts of a human’s body and because of lifestyle and patrimonial, different kinds of disease are increasing day by day.
Abstract: Diseases is an unusual circumstance that affects single or more parts of a human’s body. Because of lifestyle and patrimonial, different kinds of disease are increasing day by day. Among all those ...

Journal ArticleDOI
TL;DR: A novel internet of health things (IoHT)-driven deep learning framework for detection and classification of cervical cancer in Pap smear images using concept of transfer learning is proposed.
Abstract: Cervical cancer is one of the fastest growing global health problems and leading cause of mortality among women of developing countries. Automated Pap smear cell recognition and classification in early stage of cell development is crucial for effective disease diagnosis and immediate treatment. Thus, in this article, we proposed a novel internet of health things (IoHT)-driven deep learning framework for detection and classification of cervical cancer in Pap smear images using concept of transfer learning. Following transfer learning, convolutional neural network (CNN) was combined with different conventional machine learning techniques like K nearest neighbor, naive Bayes, logistic regression, random forest and support vector machines. In the proposed framework, feature extraction from cervical images is performed using pre-trained CNN models like InceptionV3, VGG19, SqueezeNet and ResNet50, which are fed into dense and flattened layer for normal and abnormal cervical cells classification. The performance of the proposed IoHT frameworks is evaluated using standard Pap smear Herlev dataset. The proposed approach was validated by analyzing precision, recall, F1-score, training–testing time and support parameters. The obtained results concluded that CNN pre-trained model ResNet50 achieved the higher classification rate of 97.89% with the involvement of random forest classifier for effective and reliable disease detection and classification. The minimum training time and testing time required to train model were 0.032 s and 0.006 s, respectively.