scispace - formally typeset
Search or ask a question

Showing papers on "AdaBoost published in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors proposed a model that incorporates different methods to achieve effective prediction of heart disease, which used efficient Data Collection, Data Pre-processing and Data Transformation methods to create accurate information for the training model.
Abstract: Cardiovascular diseases (CVD) are among the most common serious illnesses affecting human health. CVDs may be prevented or mitigated by early diagnosis, and this may reduce mortality rates. Identifying risk factors using machine learning models is a promising approach. We would like to propose a model that incorporates different methods to achieve effective prediction of heart disease. For our proposed model to be successful, we have used efficient Data Collection, Data Pre-processing and Data Transformation methods to create accurate information for the training model. We have used a combined dataset (Cleveland, Long Beach VA, Switzerland, Hungarian and Stat log). Suitable features are selected by using the Relief, and Least Absolute Shrinkage and Selection Operator (LASSO) techniques. New hybrid classifiers like Decision Tree Bagging Method (DTBM), Random Forest Bagging Method (RFBM), K-Nearest Neighbors Bagging Method (KNNBM), AdaBoost Boosting Method (ABBM), and Gradient Boosting Boosting Method (GBBM) are developed by integrating the traditional classifiers with bagging and boosting methods, which are used in the training process. We have also instrumented some machine learning algorithms to calculate the Accuracy (ACC), Sensitivity (SEN), Error Rate, Precision (PRE) and F1 Score (F1) of our model, along with the Negative Predictive Value (NPR), False Positive Rate (FPR), and False Negative Rate (FNR). The results are shown separately to provide comparisons. Based on the result analysis, we can conclude that our proposed model produced the highest accuracy while using RFBM and Relief feature selection methods (99.05%).

169 citations


Journal ArticleDOI
TL;DR: In this paper, the authors analyzed the heart failure survivors from the dataset of 299 patients admitted in hospital and found significant features and effective data mining techniques that can boost the accuracy of cardiovascular patient's survivor prediction.
Abstract: Cardiovascular disease is a substantial cause of mortality and morbidity in the world. In clinical data analytics, it is a great challenge to predict heart disease survivor. Data mining transforms huge amounts of raw data generated by the health industry into useful information that can help in making informed decisions. Various studies proved that significant features play a key role in improving performance of machine learning models. This study analyzes the heart failure survivors from the dataset of 299 patients admitted in hospital. The aim is to find significant features and effective data mining techniques that can boost the accuracy of cardiovascular patient’s survivor prediction. To predict patient’s survival, this study employs nine classification models: Decision Tree (DT), Adaptive boosting classifier (AdaBoost), Logistic Regression (LR), Stochastic Gradient classifier (SGD), Random Forest (RF), Gradient Boosting classifier (GBM), Extra Tree Classifier (ETC), Gaussian Naive Bayes classifier (G-NB) and Support Vector Machine (SVM). The imbalance class problem is handled by Synthetic Minority Oversampling Technique (SMOTE). Furthermore, machine learning models are trained on the highest ranked features selected by RF. The results are compared with those provided by machine learning algorithms using full set of features. Experimental results demonstrate that ETC outperforms other models and achieves 0.9262 accuracy value with SMOTE in prediction of heart patient’s survival.

162 citations


Journal ArticleDOI
01 Jun 2021
TL;DR: The proposed ensemble soft voting classifier gives binary classification and uses the ensemble of three machine learning algorithms viz. random forest, logistic regression, and Naive Bayes for the classification.
Abstract: Diabetes is a dreadful disease identified by escalated levels of glucose in the blood Machine learning algorithms help in identification and prediction of diabetes at an early stage The main objective of this study is to predict diabetes mellitus with better accuracy using an ensemble of machine learning algorithms The Pima Indians Diabetes dataset has been considered for experimentation, which gathers details of patients with and without having diabetes The proposed ensemble soft voting classifier gives binary classification and uses the ensemble of three machine learning algorithms viz random forest, logistic regression, and Naive Bayes for the classification Empirical evaluation of the proposed methodology has been conducted with state-of-the-art methodologies and base classifiers such as AdaBoost, Logistic Regression,Support Vector machine, Random forest, Naive Bayes, Bagging, GradientBoost, XGBoost, CatBoost by taking accuracy, precision, recall, F1-score as the evaluation criteria The proposed ensemble approach gives the highest accuracy, precision, recall, and F1_score value with 7904%, 7348%, 7145% and 806% respectively on the PIMA diabetes dataset Further, the efficiency of the proposed methodology has also been compared and analysed with breast cancer dataset The proposed ensemble soft voting classifier has given 9702% accuracy on the breast cancer dataset

141 citations


Journal ArticleDOI
01 Jan 2021
TL;DR: CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score, and the study outcomes demonstrate that the model’s performance varies depending on the data scaling method.
Abstract: Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.

128 citations


Journal ArticleDOI
TL;DR: Groundwater potential maps predicted in this study can help water resources managers and policymakers in the fields of watershed and aquifer management to preserve an optimal exploit from this important freshwater.
Abstract: Due to the rapidly increasing demand for groundwater, as one of the principal freshwater resources, there is an urge to advance novel prediction systems to more accurately estimate the groundwater potential for an informed groundwater resource management. Ensemble machine learning methods are generally reported to produce more accurate results. However, proposing the novel ensemble models along with running comparative studies for performance evaluation of these models would be equally essential to precisely identify the suitable methods. Thus, the current study is designed to provide knowledge on the performance of the four ensemble models i.e., Boosted generalized additive model (GamBoost), adaptive Boosting classification trees (AdaBoost), Bagged classification and regression trees (Bagged CART), and random forest (RF). To build the models, 339 groundwater resources’ locations and the spatial groundwater potential conditioning factors were used. Thereafter, the recursive feature elimination (RFE) method was applied to identify the key features. The RFE specified that the best number of features for groundwater potential modeling was 12 variables among 15 (with a mean Accuracy of about 0.84). The modeling results indicated that the Bagging models (i.e., RF and Bagged CART) had a higher performance than the Boosting models (i.e., AdaBoost and GamBoost). Overall, the RF model outperformed the other models (with accuracy = 0.86, Kappa = 0.67, Precision = 0.85, and Recall = 0.91). Also, the topographic position index’s predictive variables, valley depth, drainage density, elevation, and distance from stream had the highest contribution in the modeling process. Groundwater potential maps predicted in this study can help water resources managers and policymakers in the fields of watershed and aquifer management to preserve an optimal exploit from this important freshwater.

110 citations


Journal ArticleDOI
TL;DR: It is found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT.
Abstract: Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naive Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

104 citations


Journal ArticleDOI
TL;DR: In this paper, a convolutional neural network was developed focusing on the simplicity of the model to extract deep and high-level features from X-ray images of patients infected with COVID-19.

96 citations


Journal ArticleDOI
TL;DR: Findings reveal that the proposed Jaya-XGBoost emerged as the most reliable model in contrast to other machine learning models and traditional empirical models.

85 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a methodology consisting of six phases, data preprocessing and feature analysis, feature selection is taken into consideration using gravitational search algorithm, and the data has been split into two parts train and test set in the ratio of 80% and 20% respectively.
Abstract: The customer churn prediction (CCP) is one of the challenging problems in the telecom industry. With the advancement in the field of machine learning and artificial intelligence, the possibilities to predict customer churn has increased significantly. Our proposed methodology, consists of six phases. In the first two phases, data pre-processing and feature analysis is performed. In the third phase, feature selection is taken into consideration using gravitational search algorithm. Next, the data has been split into two parts train and test set in the ratio of 80% and 20% respectively. In the prediction process, most popular predictive models have been applied, namely, logistic regression, naive bayes, support vector machine, random forest, decision trees, etc. on train set as well as boosting and ensemble techniques are applied to see the effect on accuracy of models. In addition, K-fold cross validation has been used over train set for hyperparameter tuning and to prevent overfitting of models. Finally, the obtained results on test set have been evaluated using confusion matrix and AUC curve. It was found that Adaboost and XGboost Classifier gives the highest accuracy of 81.71% and 80.8% respectively. The highest AUC score of 84%, is achieved by both Adaboost and XGBoost Classifiers which outperforms over others.

81 citations


Journal ArticleDOI
TL;DR: Ten different feature encoding schemes were explored, with the goal of capturing key characteristics around 6mA sites and Meta-i6mA was proposed that combined the baseline models using the meta-predictor approach and outperformed the existing predictors.
Abstract: DNA N6-methyladenine (6mA) represents important epigenetic modifications, which are responsible for various cellular processes. The accurate identification of 6mA sites is one of the challenging tasks in genome analysis, which leads to an understanding of their biological functions. To date, several species-specific machine learning (ML)-based models have been proposed, but majority of them did not test their model to other species. Hence, their practical application to other plant species is quite limited. In this study, we explored 10 different feature encoding schemes, with the goal of capturing key characteristics around 6mA sites. We selected five feature encoding schemes based on physicochemical and position-specific information that possesses high discriminative capability. The resultant feature sets were inputted to six commonly used ML methods (random forest, support vector machine, extremely randomized tree, logistic regression, naive Bayes and AdaBoost). The Rosaceae genome was employed to train the above classifiers, which generated 30 baseline models. To integrate their individual strength, Meta-i6mA was proposed that combined the baseline models using the meta-predictor approach. In extensive independent test, Meta-i6mA showed high Matthews correlation coefficient values of 0.918, 0.827 and 0.635 on Rosaceae, rice and Arabidopsis thaliana, respectively and outperformed the existing predictors. We anticipate that the Meta-i6mA can be applied across different plant species. Furthermore, we developed an online user-friendly web server, which is available at http://kurata14.bio.kyutech.ac.jp/Meta-i6mA/.

81 citations


Journal ArticleDOI
TL;DR: The experimental test results indicated that the proposed deep learning ensemble model was generally more competitive when addressing imbalanced credit risk evaluation problems than other models.

Journal ArticleDOI
TL;DR: In this paper, the authors have proposed a hybrid decision support system that can assist in the early detection of heart disease based on the clinical parameters of the patient using multivariate imputation by chained equations algorithm to handle the missing values.
Abstract: Detection of heart disease through early-stage symptoms is a great challenge in the current world scenario. If not diagnosed timely then this may become the cause of death. In developing countries where heart specialist doctors are not available in remote, semi-urban, and rural areas; an accurate decision support system can play a vital role in early-stage detection of heart disease. In this paper, the authors have proposed a hybrid decision support system that can assist in the early detection of heart disease based on the clinical parameters of the patient. Authors have used multivariate imputation by chained equations algorithm to handle the missing values. A hybridized feature selection algorithm combining the Genetic Algorithm (GA) and recursive feature elimination has been used for the selection of suitable features from the available dataset. Further for pre-processing of data, SMOTE (Synthetic Minority Oversampling Technique) and standard scalar methods have been used. In the last step of the development of the proposed hybrid system, authors have used support vector machine, naive bayes, logistic regression, random forest, and adaboost classifiers. It has been found that the system has given the most accurate results with random forest classifier. The proposed hybrid system was tested in the simulation environment developed using Python. It was tested on the Cleveland heart disease dataset available at UCI (University of California, Irvine) machine learning repository. It has achieved an accuracy of 86.6%, which is superior to some of the existing heart disease prediction systems found in the literature.

Journal ArticleDOI
TL;DR: The proposed motor fault diagnosis method using attention mechanism and improved AdaBoost driven by multi-sensor information can enhance the robustness, generalization ability and accuracy of fault diagnosis.

Journal ArticleDOI
TL;DR: Enhanced Adaboost models were used to classify soil types base on tree algorithm models that are less commonly used in this area to increase the accuracy and reduce the cost of projects.
Abstract: This research focuses on presenting new models based on classifiers that can be applied to various problems. Adaboost is a type of ensemble learning machine that uses classifiers that contain a range of base models. This study used enhanced Adaboost models to classify soil types base on tree algorithm models that are less commonly used in this area. Determining the type of soil in different geotechnical projects is very important. Using soil classification, soil properties such as mechanical properties, performance against static and dynamic loads can be found. Regarding the importance of the subject, 440 samples of the actual project were used to design this new methodology. The dataset included clay content, moisture content, specific gravity, void ratio, plastic, and liquid limit parameters to determine the type of soil classification. These samples were tested with high precision and the actual type of classification was obtained. For comparison, two enhanced tree and neural network model were designed and developed according to these conditions. The results of this classification were presented for different soil samples. The developed adaboost model showed that it could well classify the soil. This model showed that only 11 samples were not correctly identified among the total data (88 data). Therefore, this new technique can be used to increase the accuracy and reduce the cost of projects.

Journal ArticleDOI
TL;DR: It is demonstrated that the AdaBoost-LSTM and the Ada boost-GRU models outperform the benchmarking models as expected, and the empirical results show that the GRU is superior to all models studied in this research.

Journal ArticleDOI
TL;DR: Three ensemble learning models are developed and the respective results compared: gradient boosted regression trees, random forests and an adaptation of Adaboost and results show that the adapted Adaboast model outperforms the reference models for hour-ahead load forecasting.

Journal ArticleDOI
TL;DR: This paper proposes a method to improve the AdaBoost algorithm using the new weighted vote parameters for the weak classifiers using the basis of the global error rate and the classification accuracy rate of the positive class, which is the primary interest.

Journal ArticleDOI
TL;DR: Experiments show that the improved algorithm can obtain the optimal parameter combination of SVR faster and better and can effectively improve the accuracy of traffic flow prediction.
Abstract: With the development of human society, the shortcomings of the existing transportation system become increasingly prominent, so people hope to use advanced technology to achieve intelligent transportation. However, the recognition rate of most methods of detecting video vehicles is too low and the process is complicated. This paper uses machine learning theory to design a variety of pattern classifiers, including Adaboost, SVM, RF, and SVR algorithms, to classify vehicles. Support vector regression (SVR) is a support vector regression algorithm based on the basic principles of support vector machine (SVM) and then generalized to the regression problem. This paper proposes a short-term traffic flow prediction model based on SVR and optimizes SVM parameters to form an improved SVR short-term traffic flow prediction model. It can be obtained from experiments that the classification error rate of support vector regression (SVR) is the lowest (3.22%). According to the prediction of morning and night peak hours, this paper concludes that the MAPE of SVR is reduced by 19.94% and 42.86%, respectively, and the RMSE is reduced by 29.71% and 47.22%, respectively. Experiments show that the improved algorithm can obtain the optimal parameter combination of SVR faster and better and can effectively improve the accuracy of traffic flow prediction. The target tracking pedestrian counting method proposed in this paper has significantly improved the counting accuracy. The calculation of HOG features can be further expanded, such as the selection of neighborhoods when calculating HOG features, and finally a more efficient pedestrian counting framework is implemented.

Journal ArticleDOI
TL;DR: The results demonstrated that machine learning techniques can be successfully applied to predict malaria using patient information and nationality was found to be the most important feature in malaria prediction.

Journal ArticleDOI
TL;DR: An ensemble classification-based methodology for malware detection is proposed, with the best performance achieved by an ensemble of five dense and CNN neural networks, and the ExtraTrees classifier as a meta-learner.
Abstract: The security of information is among the greatest challenges facing organizations and institutions. Cybercrime has risen in frequency and magnitude in recent years, with new ways to steal, change and destroy information or disable information systems appearing every day. Among the types of penetration into the information systems where confidential information is processed is malware. An attacker injects malware into a computer system, after which he has full or partial access to critical information in the information system. This paper proposes an ensemble classification-based methodology for malware detection. The first-stage classification is performed by a stacked ensemble of dense (fully connected) and convolutional neural networks (CNN), while the final stage classification is performed by a meta-learner. For a meta-learner, we explore and compare 14 classifiers. For a baseline comparison, 13 machine learning methods are used: K-Nearest Neighbors, Linear Support Vector Machine (SVM), Radial basis function (RBF) SVM, Random Forest, AdaBoost, Decision Tree, ExtraTrees, Linear Discriminant Analysis, Logistic, Neural Net, Passive Classifier, Ridge Classifier and Stochastic Gradient Descent classifier. We present the results of experiments performed on the Classification of Malware with PE headers (ClaMP) dataset. The best performance is achieved by an ensemble of five dense and CNN neural networks, and the ExtraTrees classifier as a meta-learner.

Journal ArticleDOI
TL;DR: In this article, a Landslide susceptibility mapping (LSM) is used for disaster risk management that involves planning and decision-making activities, which is a major area of interest.
Abstract: Landslide susceptibility mapping (LSM) is a major area of interest within the field of disaster risk management that involves planning and decision-making activities. Therefore, preparation...

Journal ArticleDOI
TL;DR: The results revealed that the hybrid wavelet models outperformed the stand-alone models, while a significant improvement was also observed in the M5 ensemble models as the highest Nash–Sutcliffe efficiency coefficient values were obtained in M5 hybrid wavelets multi-stage ensemble models for each lead time prediction.
Abstract: In this research, monthly wind speed time series of the Kirsehir was investigated using the stand-alone, hybrid and ensemble models. The artificial neural networks, Gaussian process regression, support vector machines and multivariate adaptive regression splines were employed as stand-alone machine learning models, while the discrete wavelet transform was utilized as a pre-processing technique to create hybrid models. Moreover, for the first time in wind speed predictions, we generated a multi-stage ensemble model by using the M5 Model Tree (M5) algorithm to increase the model accuracies. Two major tasks considered to be necessary, in which the first is to obtain the lag times by using autocorrelation functions, and the latter is to determine the optimum mother wavelet as well as the decomposition level to reduce the uncertainties in wavelet modeling. The results revealed that the hybrid wavelet models outperformed the stand-alone models, while a significant improvement was also observed in M5 ensemble models as the highest Nash–Sutcliffe efficiency coefficient values were obtained in M5 hybrid wavelet multi-stage ensemble models for each lead time prediction. The findings of the study were assessed with respect to the various performance indicators and Kruskal–Wallis test to indicate whether the results are statically significant. The proposed multi-stage ensemble framework also benchmarked with the classical tree-based ensembles, such as Random forest, AdaBoost and XGBoost.

Journal ArticleDOI
TL;DR: The present research highlights the potential of AdaBoost and RF models as useful tools which can assist in mortar design and/or optimization and mapping the development of mortar characteristics can assist the influence of the different mortar mix parameters on the compressive strength.
Abstract: The application of artificial neural networks in mapping the mechanical characteristics of the cement-based materials is underlined in previous investigations. However, this machine learning technique includes several major deficiencies highlighted in the literature, such as the overfitting problem and the inability to explain the decisions. Hence, the present study investigates the applicability of other common machine learning techniques, i.e., support vector machine, random forest (RF), decision tree, AdaBoost and k-nearest neighbors in mapping the behavior of the compressive strength (CS) of cement-based mortars. To this end, a big experimental database has been compiled based on experimental data available in the literature considering, namely the cement grade, which is an important parameter for the modeling of mortar’s CS. Other important parameters are namely the age, the water-to-binder ratio, the particle size distribution of the sand and the amount of plasticizer. Many models based on the influential factors affecting machine learning techniques have been developed, and their prediction capacities have been assessed using performance indexes. The present research highlights the potential of AdaBoost and RF models as useful tools which can assist in mortar design and/or optimization. In addition, mapping the development of mortar characteristics can assist in revealing the influence of the different mortar mix parameters on the compressive strength.

Journal ArticleDOI
TL;DR: In this article, the authors used various machine learning models, such as SVM, Naive Bayes, K-nearest neighbors, AdaBoost, Gradient boosting, XGBoost, Random Forest, ensembles, and neural networks, to predict SARS-CoV and SARS CoV-2.
Abstract: Coronavirus is a pandemic that has become a concern for the whole world. This disease has stepped out to its greatest extent and is expanding day by day. Coronavirus, termed as a worldwide disease, has caused more than 8 lakh deaths worldwide. The foremost cause of the spread of coronavirus is SARS-CoV and SARS-CoV-2, which are part of the coronavirus family. Thus, predicting the patients suffering from such pandemic diseases would help to formulate the difference in inaccurate and infeasible time duration. This paper mainly focuses on the prediction of SARS-CoV and SARS-CoV-2 using the B-cells dataset. The paper also proposes different ensemble learning strategies that came out to be beneficial while making predictions. The predictions are made using various machine learning models. The numerous machine learning models, such as SVM, Naive Bayes, K-nearest neighbors, AdaBoost, Gradient boosting, XGBoost, Random forest, ensembles, and neural networks are used in predicting and analyzing the dataset. The most accurate result was obtained using the proposed algorithm with 0.919 AUC score and 87.248% validation accuracy for predicting SARS-CoV and 0.923 AUC and 87.7934% validation accuracy for predicting SARS-CoV-2 virus.

Journal ArticleDOI
TL;DR: The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers, and it shows a more powerful learning and generalisation ability for small and imbalanced samples.
Abstract: Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines (TBMs). During the TBM tunnelling process, a large number of operation data are generated, reflecting the interaction between the TBM system and surrounding rock, and these data can be used to evaluate the rock mass quality. This study proposed a stacking ensemble classifier for the real-time prediction of the rock mass classification using TBM operation data. Based on the Songhua River water conveyance project, a total of 7538 TBM tunnelling cycles and the corresponding rock mass classes are obtained after data preprocessing. Then, through the tree-based feature selection method, 10 key TBM operation parameters are selected, and the mean values of the 10 selected features in the stable phase after removing outliers are calculated as the inputs of classifiers. The preprocessed data are randomly divided into the training set (90%) and test set (10%) using simple random sampling. Besides stacking ensemble classifier, seven individual classifiers are established as the comparison. These classifiers include support vector machine (SVM), k-nearest neighbors (KNN), random forest (RF), gradient boosting decision tree (GBDT), decision tree (DT), logistic regression (LR) and multi-layer perceptron (MLP), where the hyper-parameters of each classifier are optimised using the grid search method. The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers, and it shows a more powerful learning and generalisation ability for small and imbalanced samples. Additionally, a relative balance training set is obtained by the synthetic minority oversampling technique (SMOTE), and the influence of sample imbalance on the prediction performance is discussed.

Journal ArticleDOI
TL;DR: In this paper, the authors developed a benchmark dataset for resource-deprived Urdu language for sentiment analysis and evaluated various machine and deep learning algorithms for sentiment for Urdu data.
Abstract: Although over 169 million people in the world are familiar with the Urdu language and a large quantity of Urdu data is being generated on different social websites daily, very few research studies and efforts have been completed to build language resources for the Urdu language and examine user sentiments. The primary objective of this study is twofold: (1) develop a benchmark dataset for resource-deprived Urdu language for sentiment analysis and (2) evaluate various machine and deep learning algorithms for sentiment. To find the best technique, we compare two modes of text representation: count-based, where the text is represented using word $n$ -gram feature vectors and the second one is based on fastText pre-trained word embeddings for Urdu. We consider a set of machine learning classifiers (RF, NB, SVM, AdaBoost, MLP, LR) and deep leaning classifiers (1D-CNN and LSTM) to run the experiments for all the feature types. Our study shows that the combination of word $n$ -gram features with LR outperformed other classifiers for sentiment analysis task, obtaining the highest F1 score of 82.05% using combination of features.

Journal ArticleDOI
TL;DR: This paper presents ensemble learning algorithms for banknotes detection, deployed in combination with machine learning algorithms, and results certify that the ensemble models of AdaBoost and voting provided accuracies of up to 100% for counterfeit banknotes.
Abstract: The movement of cash flow transactions by either electronic channels or physically created openings for the influx of counterfeit banknotes in financial markets. Aided by global economic integration and expanding international trade, attention must be geared at robust techniques for the recognition and detection of counterfeit banknotes. This paper presents ensemble learning algorithms for banknotes detection. The AdaBoost and voting ensemble are deployed in combination with machine learning algorithms. Improved detection accuracies are produced by the ensemble methods. Simulation results certify that the ensemble models of AdaBoost and voting provided accuracies of up to 100% for counterfeit banknotes.

Journal ArticleDOI
TL;DR: In this article, the authors proposed an ensemble based Max Voting Classifier is proposed on top of three best performing machine learning classifiers and the proposed model produces an enhanced performance label with accuracy score of 99.41%.
Abstract: Healthcare systems around the world are facing huge challenges in responding to trends of the rise of chronic diseases. The objective of our research study is the adaptation of Data Science and its approaches for prediction of various diseases in early stages. In this study we review latest proposed approaches with few limitations and their possible solutions for future work. This study also shows importance of finding significant features that improves results proposed by existing methodologies. This work aimed to build classification models such as Naive Bayes, Logistic Regression, k-Nearest neighbor, Support vector machine, Decision tree, Random Forest, Artificial neural network, Adaboost, XGBoost and Gradient boosting. The experimental study chooses group of features by means of three feature selection approaches such as Correlation-based selection, Information Gain based selection and Sequential feature selection. Various Machine learning classifiers are applied on these feature subsets and based on their performance best feature subset is selected. Finally, ensemble based Max Voting Classifier is proposed on top of three best performing models. The proposed model produces an enhanced performance label with accuracy score of 99.41%.

Journal ArticleDOI
TL;DR: In this article, a semantic model with term frequency and inverse document frequency weighting for data representation was used to predict the sentiment class of each fake news on COVID-19.
Abstract: A vast amount of data is generated every second for microblogs, content sharing via social media sites, and social networking. Twitter is an essential popular microblog where people voice their opinions about daily issues. Recently, analyzing these opinions is the primary concern of Sentiment analysis or opinion mining. Efficiently capturing, gathering, and analyzing sentiments have been challenging for researchers. To deal with these challenges, in this research work, we propose a highly accurate approach for SA of fake news on COVID-19. The fake news dataset contains fake news on COVID-19; we started by data preprocessing (replace the missing value, noise removal, tokenization, and stemming). We applied a semantic model with term frequency and inverse document frequency weighting for data representation. In the measuring and evaluation step, we applied eight machine-learning algorithms such as Naive Bayesian, Adaboost, K-nearest neighbors, random forest, logistic regression, decision tree, neural networks, and support vector machine and four deep learning CNN, LSTM, RNN, and GRU. Afterward, based on the results, we boiled a highly efficient prediction model with python, and we trained and evaluated the classification model according to the performance measures (confusion matrix, classification rate, true positives rate...), then tested the model on a set of unclassified fake news on COVID-19, to predict the sentiment class of each fake news on COVID-19. Obtained results demonstrate a high accuracy compared to the other models. Finally, a set of recommendations is provided with future directions for this research to help researchers select an efficient sentiment analysis model on Twitter data.

Journal ArticleDOI
01 Jan 2021
TL;DR: The proposed offline Computer-Aided Diagnosis (CAD) system for glaucoma diagnosis using retinal fundus images is developed using image processing, deep learning and machine learning approaches and proves to be robust and more accurate.
Abstract: Background and objective Glaucoma is a neuro-degenerative eye disease developed due to an increase in the Intra-ocular Pressure inside the retina. Being the second largest cause of blindness worldwide, it can lead the person towards complete blindness if an early diagnosis does not take place. With respect to this underlying issue, there is an immense need of developing a system that can effectively work in the absence of excessive equipments, skilled medical practitioners and also is less time consuming. Methods This work proposes an offline Computer-Aided Diagnosis (CAD) system for glaucoma diagnosis using retinal fundus images. This application is developed using image processing, deep learning and machine learning approaches. Le-Net architecture is used for input image validation and Region of Interest (ROI) detection is done using brightest spot algorithm. Further, the optic disc and optic cup segmentation is performed with the help of U-Net architecture and classification is done using SVM, Neural Network and Adaboost classifiers. Results The accuracy of 99% is achieved using Le-Net for input image validation. Considering the application of brightest spot algorithm, an accuracy of 98.67% is achieved for ROI extraction. Further, a dice-coefficient of 0.93 and 0.87 was attained for the segmentation of optic disc and optic cup respectively using U-Net architecture. Using SVM, Neural Network and Adaboost classifiers, the proposed methodology managed to achieve a classification accuracy, recall, specificity and sensitivity of 100%, thus proving the system to be reliable and promising. Conclusions In conclusion, the proposed desktop application is easy to use and can play a major role in early glaucoma detection. The modular design of the CAD system is made up of a set of standalone components that can be used for a variety of different tasks for detecting and classifying glaucoma. Due to the model being trained on a variety of different datasets, the system proves to be robust and more accurate.