scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Reduksi Dimensi Data menggunakan Metode Wrapper Sequential Feature Selection untuk Peningkatan Performa Algoritma Naïve Bayes terhadap Dataset Medis

TL;DR: In this article , Tujuan penelitian ini adalah menginvestigasi performa algoritma naïve bayes dengan menerpakan metode Wrapper Sequential Feature Selection (WSFS).
Abstract: Penggunaan Machine Learning sebagai alat bantu dalam penanganan medis saat ini berkembang dengan pesat. Salah satu penyakit medis yang dikembangkan menggukan algoritma komputasi adalah Cardiovascullar Disease (CVD). Machine learning model yang diterapkan didasarkan dataset rekam medis. Tujuan penelitian ini adalah menginvestigasi performa algoritma naïve bayes dengan menerpakan metode Wrapper Sequential Feature Selection (WSFS). Metode penelitian dimulai dari pengumpulan dataset, data preprocessing, penerapan model Naïve Bayes, dan atribut scoring menggunakan Wrapper SFS, dan validasi performa menggunakan uji validasi 10-Fold Cross-Validation. Data history yang digunakan yaitu dataset Heart Failure Clinical Records yang terdiri dari 299 instances pada 13 features. Hasil penelitian menunjukkan bahwa metode Wrapper SFS dapat mengimprovisasi nilai performa Algoritma Naïve Bayes dari nilai akurasi, Precisi, dan Recall. Adapun kenaikan performa didapatkan dengan kombinasi 6 fitur ('anaemia', 'diabetes', 'ejection_fraction', 'serum_creatinine', 'gender', 'time') yang didapatkan dari seleksi fitur WSFS terhadap Algoritma tersebut yaitu nilai akurasi meningkat sebanyak 6,334%, skor recall meningkat 11,333%, dan nilai precision meningkat sebesar 20,07% dibandingkan dengan Algoritma Naïve Bayes.

Content maybe subject to copyright    Report

References
More filters
Journal ArticleDOI
TL;DR: Analysis of a dataset of 299 patients with heart failure collected in 2015 shows that serum creatinine and ejection fraction are sufficient to predict survival of heart failure patients from medical records, and that using these two features alone can lead to more accurate predictions than using the original dataset features in its entirety.
Abstract: Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body.Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, in particular, can predict patients’ survival from their data and can individuate the most important features among those included in their medical records. In this paper, we analyze a dataset of 299 patients with heart failure collected in 2015. We apply several machine learning classifiers to both predict the patients survival, and rank the features corresponding to the most important risk factors. We also perform an alternative feature ranking analysis by employing traditional biostatistics tests, and compare these results with those provided by the machine learning algorithms. Since both feature ranking approaches clearly identify serum creatinine and ejection fraction as the two most relevant features, we then build the machine learning survival prediction models on these two factors alone. Our results of these two-feature models show not only that serum creatinine and ejection fraction are sufficient to predict survival of heart failure patients from medical records, but also that using these two features alone can lead to more accurate predictions than using the original dataset features in its entirety. We also carry out an analysis including the follow-up month of each patient: even in this case, serum creatinine and ejection fraction are the most predictive clinical features of the dataset, and are sufficient to predict patients’ survival. This discovery has the potential to impact on clinical practice, becoming a new supporting tool for physicians when predicting if a heart failure patient will survive or not. Indeed, medical doctors aiming at understanding if a patient will survive after heart failure may focus mainly on serum creatinine and ejection fraction.

287 citations

Journal ArticleDOI
TL;DR: An effort to compile and analyze epidemiological outbreak information on COVID‐19 based on the several open datasets on 2019‐nCoV provided by the Johns Hopkins University, World Health Organization, Chinese Center for Disease Control and Prevention, National Health Commission, and DXY.
Abstract: There is an obvious concern globally regarding the fact about the emerging coronavirus 2019 novel coronavirus (2019-nCoV) as a worldwide public health threat. As the outbreak of COVID-19 causes by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) progresses within China and beyond, rapidly available epidemiological data are needed to guide strategies for situational awareness and intervention. The recent outbreak of pneumonia in Wuhan, China, caused by the SARS-CoV-2 emphasizes the importance of analyzing the epidemiological data of this novel virus and predicting their risks of infecting people all around the globe. In this study, we present an effort to compile and analyze epidemiological outbreak information on COVID-19 based on the several open datasets on 2019-nCoV provided by the Johns Hopkins University, World Health Organization, Chinese Center for Disease Control and Prevention, National Health Commission, and DXY. An exploratory data analysis with visualizations has been made to understand the number of different cases reported (confirmed, death, and recovered) in different provinces of China and outside of China. Overall, at the outset of an outbreak like this, it is highly important to readily provide information to begin the evaluation necessary to understand the risks and begin containment activities.

146 citations

13 Apr 2019
TL;DR: A survey of various models based on supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbour (KNN), Naive Bayes, Decision Trees (DT), Random Forest (RF) and ensemble models are discovered extremely prominent among the researchers.
Abstract: Diseases related to Heart i.e. Cardiovascular Dis-eases (CVDs) are the main reason for the number of deaths in the course of the most recent couple of decades and has developed as the most perilous ailment, in India and in the entire world. In this way, there is a need for accurate, feasible and reliable system to analyze such maladies in time for legitimate treatment. Machine Learning algorithms and procedures have been im-plemented to various medical datasets to various medical datasets to investigate of extensive and complex information. Numerous analysts, as of late, have been using several methods to enable the health care industry and the professionals in the diagnosis of heart related diseases. This paper demonstrates a survey of various models based on such algorithms and techniques and analyze their perfor-mance. Models depend on supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbour (KNN), Naive Bayes, Decision Trees (DT), Random Forest (RF) and ensemble models are discovered extremely prominent among the researchers.

103 citations

Journal ArticleDOI
07 Jan 2019
TL;DR: The purpose of this paper is to summarize and organize the current developments in the field into three main classes: PCA-based, Non-negative Matrix Factorization (NMF)-based, and manifold-based supervised dimension reduction methods, as well as provide elaborated discussions on their advantages and disadvantages.
Abstract: Recently, we have witnessed an explosive growth in both the quantity and dimension of data generated, which aggravates the high dimensionality challenge in tasks such as predictive modeling and decision support. Up to now, a large amount of unsupervised dimension reduction methods have been proposed and studied. However, there is no specific review focusing on the supervised dimension reduction problem. Most studies performed classification or regression after unsupervised dimension reduction methods. However, we recognize the following advantages if learning the low-dimensional representation and the classification/regression model simultaneously: high accuracy and effective representation. Considering classification or regression as being the main goal of dimension reduction, the purpose of this paper is to summarize and organize the current developments in the field into three main classes: PCA-based, Non-negative Matrix Factorization (NMF)-based, and manifold-based supervised dimension reduction methods, as well as provide elaborated discussions on their advantages and disadvantages. Moreover, we outline a dozen open problems that can be further explored to advance the development of this topic.

88 citations

Journal Article
TL;DR: Simulation results showed that the wrapper method (sequential forward selection and sequential backward elimination) methods were better than the filter method in selecting the correct features.
Abstract: Feature selection has been widely applied in many areas such as classification of spam emails, cancer cells, fraudulent claims, credit risk, text categorisation and DNA microarray analysis. Classification involves building predictive models to predict the target variable based on several input variables (features). This study compares filter and wrapper feature selection methods to maximise the classifier accuracy. The logistic regression was used as a classifier while the performance of the feature selection methods was based on the classification accuracy, Akaike information criteria (AIC), Bayesian information criteria (BIC), Area Under Receiver operator curve (AUC), as well as sensitivity and specificity of the classifier. The simulation study involves generating data for continuous features and one binary dependent variable for different sample sizes. The filter methods used are correlation based feature selection and information gain, while the wrapper methods are sequential forward and sequential backward elimination. The simulation was carried out using R, an open-source programming language. Simulation results showed that the wrapper method (sequential forward selection and sequential backward elimination) methods were better than the filter method in selecting the correct features.

63 citations