International Journal of Computer Applications (0975 – 8887)
Volume 24– No.3, June 2011
7
Diagnosis of Heart Disease for Diabetic Patients using
Naive Bayes Method
G. Parthiban
Research Scholar,
Dr. MGR Educational
Research and Institute,
Maduravoyal,
Chennai, India.
A. Rajesh
Professor, Dept of CSE
C.Abdul Hakkeem College
of Engineering and
Technology,
Melvishram,
Vellore, India.
S.K.Srivatsa
Sr. Professor, Dept of
E & I,
St.Joseph’s College of
Engineering,
Chennai, India.
ABSTRACT
The objective of our paper is to predict the chances of diabetic
patient getting heart disease. In this study, we are applying
Naïve Bayes data mining classifier technique which produces an
optimal prediction model using minimum training set. Data
mining is the analysis step of the Knowledge Discovery in
Databases process (KDD). Data mining involves use of
techniques to find underlying structures and relationships in a
large database. Diabetes is a set of related diseases in which
body cannot regulate the amount of sugar specifically glucose
(hyperglycemia) in the blood. The diagnosis of diseases is a vital
role in medical field. Using diabetic‟s diagnosis, the proposed
system predicts attributes such as age, sex, blood pressure and
blood sugar and the chances of a diabetic patient getting a heart
disease.
Keywords: Knowledge Discovery, Data Mining, Diabetes,
Heart disease, Naïve Bayes M ethod.
1. INTRODUCTION
Knowledge discovery in medical databases is well-defined
process and data mining is an essential step. Data mining is the
non trivial extraction of potential useful information about data.
[1][2] Thus data mining should have been more appropriately
named “knowledge mining from data”. [3] Diabetes Mellitus is a
chronic disease which causes serious health complications
including renal (kidney) failure, heart disease, stroke, and
blindness [4].People with diabetes either do not produce enough
insulin (type 1 diabetes) or cannot use insulin properly (type 2
diabetes), or both. Type1 diabetes was also called Insulin
Dependent Diabetes Mellitus (IDDM ) or Childhood-onset
diabetes. Type2 diabetes was also referred to as Non-Insulin
Dependent Diabetes Mellitus (NIDDM) or Adult-onset diabetes
[5]. Type1 diabetes is typically recognized in childhood or
adolescence. At least 90% of patients with diabetes have type2
diabetes and it is typically recognized in adulthood where the
body cannot effectively use the insulin produced [6] [13]. The
causes of diabetes mellitus are unclear, however, there seem to
be both hereditary (genetic factors passed on in families), and
environmental factors involved.
The risk factors for type 2 diabetes are being 45 years of age or
older, being overweight, having a parent or sibling with diabetes
(family heredity), having high blood pressure (140/90 or higher),
having high cholesterol (HDL 35 or lower; triglycerides 250 or
higher) and acute stress. [7] Over 80 per cent of people with
type 2 diabetes are overweight and it is treated with diet and
exercise, the blood sugar level is lowered with drugs. [8] [15]
A family history of diabetes research has shown that people are
more at risk if there is a history of diabetes in close family
members. The physical inactivity research has shown that
people who do not lead an active life are more at risk of
developing type 2 diabetes [9][14].
Diabetes also increases the risk of micro-vascular damage and
macro-vascular complications. People with diabetes are two to
four times more likely to get cardio vascular diseases. Thus
diabetes is found to be one of the leading causes of global death
by disease. There are several methods in the literature
individually to diagnosis diabetes or heart disease. There is no
automated diagnosis method to diagnose Heart disease for
diabetic patient based on diabetes diagnosis attributes to our
knowledge.
In this paper, we propose a Naïve Bayes based method to
diagnose heart disease for diabetic patients. It should be noted
that the attributes used in our proposed method are those used
for diagnosis of diabetes and are not direct indicators of heart
disease.
2. BACKGROUND
Naïve Bayes Classifier is a term dealing with simple
probabilistic classifier based on applying Bayes Theorem with
strong independence assumptions. It assumes that the presence
or absence of particular feature of a class is unrelated to the
presence or absence of any other feature [10].
The Naive Bayes algorithm is based on conditional probabilities.
It uses Bayes' theorem, a formula that calculates a probability by
counting the frequency of values and combinations of values in
the historical data. Bayes' Theorem finds the probability of an
event occurring given the probability of another event that has
already occurred. If B represents the dependent event and A
represents the prior event, Bayes' theorem can be stated as
follows.
Prob (B given A) = Prob(A and B)/Prob(A)
International Journal of Computer Applications (0975 – 8887)
Volume 24– No.3, June 2011
8
To calculate the probability of B given A, the algorithm counts
the number of cases where A and B occur together and divides it
by the number of cases where A occurs alone.
An advantage of the Naive Bayes classifier is that it requires a
small amount of training data to estimate the parameters (means
and variances of the variables) necessary for classification.
Since independent variables are assumed, only the variances of
the variables for each class need to be determined and not the
entire. It can be used for both binary and multi class
classification problems.
3. EXPERIMENTAL METHODOLOGY
3.1 Data set and used variables
The data set used in this work are clinical data set collected from
one of the leading diabetic research institute in Chennai and
contain records of about 500 patients. The clinical data set
specification provides concise, unambiguous definition for items
related to diabetes.
The diabetes data set is developed to ensure people with diabetes
have up to date records of their risk factors, current
management, treatment target achievements and arrangements
and outcomes of regular surveillance for complications, to help
them monitor their care and make informed choices about their
management. It will also ensure that when people wit h diabetes
meet health care professionals the consultation is fully informed
by comprehensive, up to date and accurate information.
The diabetes attributes used in our proposed system and their
descriptions are shown in Table 1.
Table 1 Diabetes attributes used in the experimentation
Attribute
Description
Sex
A classification of the sex of the person
Age
Age of the patient
Family
Heredity
Previous history (Father / Mother)
Weight
Patient‟s weight
BP
Blood pressure
Fasting
Sugar level after fasting
PP
Post Prandial blood glucose level
A1C
HbA1c level Glycosylated
Last 4 months sugar level
LP Tot
Cholesterol
Total cholesterol level
3.2 Preprocessing and Sampling
Except for the attributes sex and family heredity all the other
attributes listed in Table 1 have numeric values. The attribute
sex takes on values „M ‟ or „F‟ to denote male or female
respectively. The attribute family heredity takes on values
„Father‟, „M other‟ or „Both‟. In case there is no previous
diabetes history for the patient the attribute is left empty.
Since no attribute value should be left empty for the mining
algorithm to work properly, we have used the value „No‟ for
patients without any previous diabetes history. Likewise, we
need to have a categorical attribute based on which the data sets
are to be classified. The aim of our work is to predict the
chances of a diabetic patient getting heart disease. Hence, we
have taken the „LP Tot Y/N‟ attribute as the class attribute.
Since the „LP Tot Y/N‟ attribute is a numeric attribute, we have
categorized the attribute values into high cholesterol value
(„Yes‟) or low cholesterol value („No‟).
This categorization has been done based on the fact that a
cholesterol value of 180 or more is taken to be high cholesterol
for Indians.
3.3. Data Analysis
The distribution of the attribute values with respect to the class
attribute „LP Tot Y/N‟ is shown in Figure 1.
International Journal of Computer Applications (0975 – 8887)
Volume 24– No.3, June 2011
9
Figure 1 Attribute value distributions with respect class attribute LP Tot Y/N
The blue colored regions in the graphs in Figure 1 denote high
cholesterol values. From the graphs we can see that, most of the
diabetic patients with high cholesterol values are in the age
group of 45 – 55, have a body weight in the range of 60 – 71,
have BP value of 148 or 230, have a Fasting value in the range
of 102 – 135, have a PP value in the range of 88 – 107, and have
a A1C value in the range of 7.7 – 9.6.
3.4. Using Data Mining in data set
The WEKA ("Waikato Environment for Knowledge Analysis")
tool is used for Data mining. Data mining finds valuable
information hidden in large volumes of data. Weka is a
collection of machine learning algorithms for data mining tasks,
written in Java and it contains tools for data pre-processing,
classification, regression, clustering, association rules, and
visualization. [11] The key features of Weka are it is open
source and platform independent. It provides many different
algorithms for data mining and machine learning [12]. We have
used Naïve bayes method to perform the mining and
classification process. We have used 10 folds cross validation to
minimize any bias in the process and improve the efficiency of
the process.
4. RESULTS AND DISCUSSION
The results of our experimentation are shown in Figure 2.
International Journal of Computer Applications (0975 – 8887)
Volume 24– No.3, June 2011
10
Figure 2 Result window of the data mining process
The proposed naïve bayes model was able to classify 74% of the
input instances correctly. It exhibited a precision of 71% in
average, recall of 74% in average, and F-measure of 71.2% in
average. The results show clearly that the proposed method
performs well compared to other similar methods in the
literature, taking into the fact that the attributes taken for
analysis are not direct indicators of heart disease.
5. CONCLUSIONS AND FUTURE
ENHANCEMENTS
Application of Data mining in analyzing the medical data is a
good method for considering the existing relationships between
variables. From our proposed approach we have shown that
mining helps to retrieve useful correlation even from attributes
which are not direct indicators of the class we are trying to
predict.
In our work we have tried to predict the chances of getting a
heart disease using attributes from diabetic‟s diagnosis. This can
be extended to predict other type of ailments which arise from
diabetes, such as visual impairment in future. Further, the data
analysis results can be used for further research in enhancing the
accuracy of the prediction system in future.
6. ACKNOWLEDGEMENTS
We are grateful to Dr.V.Shesiah, Chairman and M anaging
director of Dr.V.Shesiah Diabetic Research Institute, Chennai
for providing an access to medical diabetic data and for his
involvement in this domain.
International Journal of Computer Applications (0975 – 8887)
Volume 24– No.3, June 2011
11
7. REFERENCES
[1] Frawley and Piatetsky -Shaprio, 1996. Knowledge Discovery
in Databases – An Overview. The AAAI/MIT Press, Menlo
Park,C.A.
[2] Cios, K. J., Pedrycz, W., Swiniarski, R.W. and Kurgan, L. A.
2007. Data M ining: A Knowledge Discovery Approach,
New York: Springer.
[3] Han, J., Kamber, M . 2006. Data M ining: Concepts and
Techniques, 2nd ed. San Francisco: Morgan Kaufman.
[4] World Health Organization. Definition and diagnosis of
diabetes mellitus and intermediate hyperglycemia:
http://www.who.int/topics/diabetes mellitus/en/
[5] Diabetes mellitus doctor‟s knowledge in M edicineNet :
http://www.medicinenet.com/diabetes
mellitus/page2.htm#toce.
[6] I. International Diabetes Federation, “Diabetes Atlas third
edition”, IDF 2007.
[7] M .Franciosi and M.Sacco, “Use of the diabetes risk score
and impaired glucose tolerance”, Diabetes care
Vol.28,no.5, pp 1187-2005.
[8] Kelling, D.G. and J.A. Wentworth et al., 1997, Diabetes
mellitus. Using a database to implement a systematic
management program. NC.Med.J.,58:368-371.
[9]International Diabetes Federation(IDF),
http://www.idf.org/about-diabetes
[10] Naïve bayes classifier based on applying bayes theorem:
http://en.wikipedia.org/wiki/Naive bayes classifier
[11] Weka Data mining software
http://www.cs.waikato.ac.nz/ml/weka
[12] An Introduction to the WEKA Data mining system -
http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf
[13] Jianchao Han, Juan C. Rodriguze, and M ohsen Beheshti,
2008. Diabetes Data Analysis and Prediction M odel
Discovery Using RapidMiner. In Proceedings of the
Second International Conference on Future Generation
Communication and Networking.
[14] Asuncion, A., Newman, D. J. 2007. Pima Indians Diabetes
Data Set, UCI Machine Learning Repository,
http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabet
s, Irvine, CA: University of California, School of
Information and Computer Science.
[15] Eleni Georga et al, 2009. Data M ining for Blood Glucose
Prediction and Knowledge Discovery in Diabetic Patients:
The METABO Diabetes M odeling and M anagement
System. In Proceedings of the 31st Annual International
Conference of the IEEE EMBS Minneapolis, M innesota,
USA.