scispace - formally typeset
Search or ask a question

Application of Data Mining Methods and Techniques for Diabetes Diagnosis

01 Jan 2012-
TL;DR: The data mining methods and techniques will be explored to identify the suitable methods and Techniques for efficient classification of Diabetes dataset and in mining useful patterns.
Abstract: Medical professionals need a reliable prediction methodology to diagnose Diabetes. Data mining is the process of analysing data from different perspectives and summarizing it into useful information. The main goal of data mining is to discover new patterns for the users and to interpret the data patterns to provide meaningful and useful information for the users. Data mining is applied to find useful patterns to help in the important tasks of medical diagnosis and treatment. This project aims for mining the relationship in Diabetes data for efficient classification. The data mining methods and techniques will be explored to identify the suitable methods and techniques for efficient classification of Diabetes dataset and in mining useful patterns.

Content maybe subject to copyright    Report

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 3, September 2012
224
Application of Data Mining Methods and
Techniques for Diabetes Diagnosis
K. Rajesh, V. Sangeetha
Abstract-- Medical professionals need a reliable prediction
methodology to diagnose Diabetes. Data mining is the process
of analysing data from different perspectives and summarizing
it into useful information. The main goal of data mining is to
discover new patterns for the users and to interpret the data
patterns to provide meaningful and useful information for the
users. Data mining is applied to find useful patterns to help in
the important tasks of medical diagnosis and treatment. This
project aims for mining the relationship in Diabetes data for
efficient classification. The data mining methods and
techniques will be explored to identify the suitable methods
and techniques for efficient classification of Diabetes dataset
and in mining useful patterns.
Index Terms Data Mining, Healthcare, Diabetes
Research, Clinical Data, Classification, Diabetes Dataset.
I. INTRODUCTION
Diabetes mellitus, or simply diabetes, is a set of related
diseases in which the body cannot regulate the amount of
sugar in the blood [1]. It is a group of metabolic diseases
in which a person has high blood sugar, either because
the body does not produce enough insulin, or because
cells do not respond to the insulin that is produced. This
high blood sugar produces the classical symptoms of
polyuria, polydipsia and polyphagia [2]. There are three
main types of diabetes mellitus (DM). Type 1 DM results
from the body's failure to produce insulin, and presently
requires the person to inject insulin or wear an insulin
pump. This form was previously referred to as "insulin-
dependent diabetes mellitus" (IDDM) or "juvenile
diabetes". Type 2 DM results from insulin resistance, a
condition in which cells fail to use insulin properly,
sometimes combined with an absolute insulin deficiency.
This form was previously referred to as non insulin-
dependent diabetes mellitus (NIDDM) or "adult-onset
diabetes". The third main form, gestational diabetes
occurs when pregnant women without a previous
diagnosis of diabetes develop a high blood glucose level.
It may precede development of type 2 DM. As of 2000 it
was estimated that 171 million people globally suffered
from diabetes or 2.8% of the population. Type-2 diabetes
is the most common type worldwide [3]. Figures for the
year 2007 show that the 5 countries with the largest
amount of people diagnosed with diabetes were India
(40.9 million), China (38.9 million), US (19.2 million),
Russia (9.6 million), and Germany (7.4 million) [3]. Data
Mining [4] refers to extracting or mining knowledge from
large amounts of data. The aim of data mining is to make
sense of large amounts of mostly unsupervised data, in
some domain. Classification [5] maps data into
predefined groups. It is often referred to as supervised
learning as the classes are determined prior to examining
the data. Classification Algorithms usually require that
the classes be defined based on the data attribute values.
They often describe these classes by looking at the
characteristics of data already known to belong to class.
Pattern Recognition is a type of classification where an
input pattern is classified into one of the several classes
based on its similarity to these predefined classes.
Knowledge Discovery in Databases (KDD) is the process
of finding useful information and patterns in data which
involves Selection, Pre-processing, Transformation, Data
Mining and Evaluation.
II. RELATED WORK
Santi Wulan Purnami et al. [6], in their research work
used support vector machine for feature selection and
classification of breast cancer and also emphasizes how
1-norm SVM can be used in feature selection and smooth
SVM (SSVM) for classification. Two problems addressed
here are, the first is to identify the importance of the
parameters on the breast cancer. The second research
problem is to diagnose breast cancer based on nine
attributes of Wisconsin breast cancer dataset. To identify
the importance of the parameters, the 1-norm SVM of the
original data was done. The stronger parameters are as
follows: parameter 1 (Clump thickness), parameter
3(Uniformity Of Cell shape), parameter 6 (Bare Nuclei),
parameter 7 (Bland Chromatin), and parameter
9(Mitoses), while parameter 2 (Uniformity Of Belsize),
parameter 4 (Marginal Adhesion), parameter 5(Single
Epithelial Cell Size) and parameter 8 (Normal Nucleoli)
are weaker. The obtained training and testing
classification accuracy using 10 fold cross validation
were 97.52% and 97.01% respectively. When one of the
weak parameters was removed both training and testing
shows a little decrease in accuracy. Pardha Repalli [7], In
their research work predicted how likely the people with
different age groups are affected by diabetes based on
their life style activities. They also found out factors
responsible for the individual to be diabetic. Statistics
given by the Centers for Disease Control states that
26.9% of the population affected by diabetes are people
whose age is greater than 65, 11.8% of all men aged 20
years or older are affected by diabetes and 10.8% of all
women aged 20 years or older are affected by diabetes.
The dataset used for analysis and modeling has 50784
records with 37 variables. They computed a new variable
age_new as nominal variable, dividing in to three group’s
young age, middle age and old age and the target variable
diabetes_diag_binary is a binary variable. They found
34% of the population whose age was below 20 years was
not affected by diabetes. 33.9% of the population whose

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 3, September 2012
225
age was above 20 and below 45 years was not affected by
diabetes. 26.8% of the population whose age was above
45 years was not diabetic. Joseph L. Breault [8], In his
research work used the publicly available Pima Indian
diabetic database (PIDD) at the UCIrvine Machine
Learning Lab. They tested data mining algorithms to
predict their accuracy in predicting diabetic status from
the 8 variables given. Out of 392 complete cases,
guessing all are non-diabetic gives an accuracy of 65.1%.
Rough sets as a data mining predictive tool applied rough
sets to PIDD using ROSETTA software. The test sets
were classified according to defaults of the naïve Bayes
classifier, and the 10 accuracies ranged from 69.6% to
85.5% with a mean of 73.8% and a 95% CI .The accuracy
of predicting diabetic status on the PIDD was 82.6% on
the initial random sample, which exceeds the previously
used machine learning algorithms that ranged from 66-
81%. Using a group of 10 random samples the mean
accuracy was 73.2%. G. Parthiban et al. [9] The main
objective of their research paper is to predict the chances
of diabetic patient getting heart disease. In this study, we
are applying Naïve Bayes data mining classifier
technique which produces an optimal prediction model
using minimum training set. They proposed a system
which predicts attributes such as age, sex, blood pressure
and blood sugar and the chances of a diabetic patient
getting a heart disease. They used Naïve Bayes Classifier.
It is a term dealing with simple probabilistic classifier
based on applying Bayes Theorem with strong
independence assumptions. The data set used in their
work was clinical data set collected from one of the
leading diabetic research institute in Chennai and contain
records of about 500 patients. The clinical data set
specification provides concise, unambiguous definition
for items related to diabetes. The WEKA tool was used
for Data mining. They used 10 fold cross validation. They
found most of the diabetic patients with high cholesterol
values are in the age group of 45 55, have a body
weight in the range of 60 71, have BP value of 148 or
230, have a Fasting value in the range of 102 135, have
a PP value in the range of 88 107, and have a A1C
value in the range of 7.7 9.6. Padmaja et al. [10] In their
research aimed at finding out the characteristics that
determine the presence of diabetes and to track the
maximum number of women suffering from diabetes.
They used Data mining functionalities like clustering and
attribute oriented induction techniques to track the
characteristics of the women suffering from diabetes.
Information related to the study was obtained from
National Institute of Diabetes, Digestive and Kidney
Diseases. The results were presented in the form of
clusters. Those clusters denote the concentrations of the
various attributes and the percentage of women suffering
from diabetes. The results were evaluated in five different
clusters and they show that 23% of the women suffering
from diabetes fall in cluster-0, 5% fall in cluster-1, 23%
fall in cluster-2, 8% in cluster-3 and 25% in cluster-3.
The study predicts the state of diabetes i.e., whether it is
in an initial stage or in an advanced stage based on the
characteristic results and also helps in estimating the
maximum number of women suffering from diabetes with
specific characteristics. This is used to effectively in
diagnosis and treatment.
III. PROPOSED SYSTEM
We have applied data mining techniques to classify
Diabetes Clinical data and predict the likelihood of a
patient being affected with Diabetes or not. The training
dataset used for data mining classification was the Pima
Indians Diabetes Database of National Institute of
Diabetes and Digestive and Kidney Diseases from UCI
Machine Learning Repository [11]. The dataset contains
768 record samples, each having 8 attributes. We used
this dataset for our classification exercise, as the data is
complete with no missing values. We applied different
classification techniques to Pima Indians Diabetes
Database and the error results obtained is tabulated in
table
IV. PROPOSED SYSTEM DESIGN
A. Dataset Used
The training dataset used for data mining classification
was the Pima Indians Diabetes Database of National
Institute of Diabetes and Digestive and Kidney Diseases.
The dataset contains 768 record samples, each having 8
attributes. We used this dataset for our classification
exercise, as the data is complete. The diagrammatic
representation of the proposed system design is given in
Figure.
Fig 1. Proposed Architecture
The attributes in the dataset are given in Table I.
Feature selection [12] is the technique that is applied to
the dataset to obtain a reduced subset of key attributes to
be used in the classification exercise. Feature Relevance
Analysis was performed on the given dataset to rank the

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 3, September 2012
226
features in accordance with the relevance to the class
label. There are many feature different techniques
available for use. As the dataset consists of continuous
attributes, filtering techniques which would be effective
for such type of data has been selected and applied. The
filtering techniques and the results obtained are given in
Table II.
Table I: Diabetes Dataset Attributes
S. No
Attributes
Type
1
Number of times
pregnant
Continuous
2
Plasma glucose
concentration a 2 hours
in an oral glucose
tolerance test
Continuous
3
Diastolic blood pressure
(mm Hg)
Continuous
4
Triceps skin fold
thickness (mm)
Continuous
5
2-Hour serum insulin
(mu U/ml)
Continuous
6
Body mass index
(kg/m)^2)
Continuous
7
Diabetes pedigree
function
Continuous
8
Age (years)
Continuous
9
Class variable (0 or 1)
Discrete
Table II: Filtering Results
Filtering
Technique
No. of Attributes
Before filtering
Fisher
8
Runs
8
ReliefF
8
Step Disc
8
B. Comparison of Classification Algorithms
We applied different classification techniques to
Diabetes dataset and the error results obtained is tabulated
in table given below.
Table III: Comparison of Classification Algorithms
S.No
Technique
Error Rate
1
C-RT
0.2148
2
CS-RT
0.2148
3
C 4.5
0.0938
4
ID3
0.2279
5
K-NN
0.1966
6
LDA
0.2161
7
NAÏVE BAYES
0.2461
8
PLS-DA
0.2253
9
SVM
0.2253
10
RND TREE
0.0
In the above ten Classification Algorithms RND TREE
gives 100% accuracy but the rule set is huge and this
algorithms is suffering from over fitting of data. C4.5
gives ~91% classification. Since C4.5 Algorithm is
mainly used for most of the medical application we use
C4.5 for the classification.
C. C4.5 Classification Algorithm
C4.5 is a well known decision tree induction learning
technique that has been successfully and extensively
applied for medical data. C4.5 [13][14] is a software
extension of the basic ID3 algorithm designed by
Quinlan. The number of attributes and error rates
obtained in classification using C4.5 is given in Table IV.
Table IV: Feature Relevance Analysis Results
Filter
ing
Techn
ique
No. of Attributes
Error Rate
in
Classificatio
n
Before
filtering
After
filtering
Before
filtering
After
filtering
Fisher
8
6
0.0938
0.1224
Runs
8
2
0.0938
0.1875
Relief
F
8
3
0.0938
0.1576
Step
Disc
8
5
0.0938
0.1237
It can be observed that C4.5 algorithm gives a
classification rate of ~ 91% without feature relevance.

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 3, September 2012
227
However, when feature relevance technique is applied,
the classification rate decreases to lesser than 88% . The
classification rules obtained by applying C4.5 algorithm
is given below.
Plasma glucose concentration < 127.5000
Body mass index < 26.4500
then Class variable = Tested Negative
Body mass index >= 26.4500
Age < 28.5000
Body mass index < 30.9500
Then Class variable = Tested Negative
Body mass index >= 30.9500
2-Hour serum insulin < 168.5000
Triceps skin fold thickness < 44.5000
Triceps skin fold thickness < 40.5000
Diastolic blood pressure < 53.0000
then Class variable = Tested Positive
Diastolic blood pressure >= 53.0000
Diastolic blood pressure < 79.0000
Plasma glucose concentration < 92.5000
then Class variable = Tested Negative
Plasma glucose concentration >= 92.5000
Body mass index < 33.7000
2-Hour serum insulin < 65.0000 then Class variable =
Tested Positive
2-Hour serum insulin >= 65.0000 then Class variable =
Tested Negative
Body mass index >= 33.7000 then Class variable =
Tested Negative
Diastolic blood pressure >= 79.0000
Plasma glucose concentration < 93.0000 then Class
variable = Tested Negative
Plasma glucose concentration >= 93.0000
Body mass index < 36.5500 then Class variable = Tested
Negative
Body mass index >= 36.5500 then Class variable =
Tested Positive
Triceps skin fold thickness >= 40.5000 then Class
variable = Tested Positive
Triceps skin fold thickness >= 44.5000 then Class
variable = Tested Negative
2-Hour serum insulin >= 168.5000 then Class variable =
Tested Negative
Age >= 28.5000
Plasma glucose concentration < 99.5000
2-Hour serum insulin < 88.0000
2-Hour serum insulin < 21.0000
Number of times pregnant < 3.5000 then Class variable =
Tested Negative
Number of times pregnant >= 3.5000
Triceps skin fold thickness < 20.5000 then Class variable
= Tested Negative
Triceps skin fold thickness >= 20.5000
Diabetes pedigree function < 0.2885 then Class variable =
Tested Negative
Diabetes pedigree function >= 0.2885 then Class variable
= Tested Positive
2-Hour serum insulin >= 21.0000 then Class variable =
Tested Negative
2-Hour serum insulin >= 88.0000 then Class variable =
Tested Positive
Plasma glucose concentration >= 99.5000
Diastolic blood pressure < 91.0000
Diabetes pedigree function < 0.5610
Age < 54.5000
Triceps skin fold thickness < 28.0000
Body mass index < 27.9500 then Class variable = Tested
Positive
Body mass index >= 27.9500
Age < 29.5000 then Class variable = Tested Negative
Age >= 29.5000
Body mass index < 29.6500 then Class variable = Tested
Negative
Body mass index >= 29.6500 then Class variable =
Tested Positive
Triceps skin fold thickness >= 28.0000
Age < 41.0000
Plasma glucose concentration < 122.5000
Plasma glucose concentration < 111.5000 then Class
variable = Tested Negative
Plasma glucose concentration >= 111.5000
Body mass index < 37.0000 then Class variable = Tested
Positive
Body mass index >= 37.0000 then Class variable =
Tested Negative
Plasma glucose concentration >= 122.5000 then Class
variable = Tested Negative
Age >= 41.0000 then Class variable = Tested Negative
Age >= 54.5000 then Class variable = Tested Negative
Diabetes pedigree function >= 0.5610
Number of times pregnant < 6.5000
2-Hour serum insulin < 120.5000
Age < 34.5000 then Class variable = Tested Negative
Age >= 34.5000 then Class variable = Tested Positive
2-Hour serum insulin >= 120.5000 then Class variable =
Tested Positive
Number of times pregnant >= 6.5000 then Class variable
= Tested Positive
Diastolic blood pressure >= 91.0000 then Class variable =
Tested Negative
Plasma glucose concentration >= 127.5000
Body mass index < 29.9500
Body mass index < 23.2000 then Class variable = Tested
Negative
Body mass index >= 23.2000
Age < 60.5000
Plasma glucose concentration < 160.0000
Age < 21.5000 then Class variable = Tested Negative
Age >= 21.5000
2-Hour serum insulin < 132.5000
Triceps skin fold thickness < 28.0000
Number of times pregnant < 1.5000 then Class variable =
Tested Negative
Number of times pregnant >= 1.5000

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 3, September 2012
228
Number of times pregnant < 3.5000 then Class variable =
Tested Positive
Number of times pregnant >= 3.5000 then Class variable
= Tested Negative
Triceps skin fold thickness >= 28.0000 then Class
variable = Tested Positive
2-Hour serum insulin >= 132.5000 then Class variable =
Tested Negative
Plasma glucose concentration >= 160.0000 then Class
variable = Tested Positive
Age >= 60.5000 then Class variable = Tested Negative
Body mass index >= 29.9500
Diastolic blood pressure < 61.0000 then Class variable =
Tested Positive
Diastolic blood pressure >= 61.0000
Diastolic blood pressure < 96.5000
Plasma glucose concentration < 157.5000
Age < 30.5000
2-Hour serum insulin < 260.0000
Diabetes pedigree function < 0.3315 then Class variable =
Tested Negative
Diabetes pedigree function >= 0.3315
Diabetes pedigree function < 0.3730 then Class variable =
Tested Positive
Diabetes pedigree function >= 0.3730
Triceps skin fold thickness < 28.5000
Diastolic blood pressure < 73.0000 then Class variable =
Tested Positive
Diastolic blood pressure >= 73.0000 then Class variable =
Tested Negative
Triceps skin fold thickness >= 28.5000 then Class
variable = Tested Negative
2-Hour serum insulin >= 260.0000 then Class variable =
Tested Negative
Age >= 30.5000
Triceps skin fold thickness < 45.0000
Diabetes pedigree function < 0.4305
Triceps skin fold thickness < 31.0000
2-Hour serum insulin < 50.0000
Diabetes pedigree function < 0.2265 then Class variable =
Tested Negative
Diabetes pedigree function >= 0.2265 then Class variable
= Tested Positive
2-Hour serum insulin >= 50.0000 then Class variable =
Tested Negative
Triceps skin fold thickness >= 31.0000 then Class
variable = Tested Positive
Diabetes pedigree function >= 0.4305
Age < 44.5000
Plasma glucose concentration < 132.0000 then Class
variable = Tested Positive
Plasma glucose concentration >= 132.0000
Number of times pregnant < 7.5000 then Class variable =
Tested Negative
Number of times pregnant >= 7.5000 then Class variable
= Tested Positive
Age >= 44.5000 then Class variable = Tested Positive
Triceps skin fold thickness >= 45.0000 then Class
variable = Tested Positive
Plasma glucose concentration >= 157.5000
Body mass index < 46.1000
Body mass index < 40.8500
Triceps skin fold thickness < 26.5000
Diastolic blood pressure < 69.0000 then Class variable =
Tested Negative
Diastolic blood pressure >= 69.0000 then Class variable =
Tested Positive
Triceps skin fold thickness >= 26.5000 then Class
variable = Tested Positive
Body mass index >= 40.8500 then Class variable =
Tested Positive
Body mass index >= 46.1000 then Class variable =
Tested Positive
Diastolic blood pressure >= 96.5000 then Class variable =
Tested Positive
V. EVALUATION
The classification algorithm predicts the class label.
The final output will be patterns which are used to find
out whether the person is affected with Diabetes or not.
The accuracy [4] of a classifier on a given test set is the
percentage of test set tuples that are correctly classified
by the classifier. Some of the performance measures are
given below. A confusion matrix is a useful tool for
analyzing classifier accuracy. Structure of confusion
matrix is given below.
Table V: confusion matrix
C1
C2
C1
True
positives
False
negatives
C2
False
positives
True
negatives
True Positive (TP) refers to positive tuples that were
correctly labeled by the classifier. True Negative (TN)
refers to negatives tuples that were correctly labeled by
the classifier. False Positive (FP) refers to negatives
tuples that were incorrectly labeled by the classifier. False
Negative (FN) refers to positive tuples that were
incorrectly labeled by the classifier.
Accuracy: Accuracy is the percentage of tuples that
are correctly classified by the classifier
FN) + FP + TN + (TP / TN) + (TP=Accuracy
Recall: Recall is the proportion of examples which
were classified as class x, among all examples which
truly have class x, i.e. how much part of the class was
captured.
FN)+(TP / TP =Recall
Precision: Precision is the proportion of the examples
which truly have class x among all those which
were classified as class x.
FP) + (TP / TP =Precision

Citations
More filters
01 Jan 2002

9,314 citations

Journal ArticleDOI
TL;DR: A diabetes prediction model for better classification of diabetes which includes few external factors responsible for diabetes along with regular factors like Glucose, BMI, Age, Insulin, etc is proposed.

168 citations

Journal ArticleDOI
TL;DR: The predictive analysis algorithm in Hadoop/Map Reduce environment is used to predict the diabetes types prevalent, complications associated with it and the type of treatment to be provided and this system provides an efficient way to cure and care the patients with better outcomes like affordability and availability.

135 citations


Cites methods from "Application of Data Mining Methods ..."

  • ...5 classification algorithm was carried out in Pima Indians Diabetes Database [3]....

    [...]

  • ...Prediction and classification of various type of diabetes using C4.5 classification algorithm was carried out in Pima Indians Diabetes Database [3]....

    [...]

Book ChapterDOI
01 Jan 2018
TL;DR: J48 and Naive Bayesian techniques are used for the early detection of diabetes and a model is proposed and elaborated, in order to make medical practitioner to explore and to understand the discovered rules better.
Abstract: The diabetes mellitus disease (DMD) commonly referred as diabetes is a significant public health problem. Predicting the disease at the early stage can save the valuable human resource. Voluminous datasets are available in various medical data repositories in the form of clinical patient records and pathological test reports which can be used for real-world applications to disclose the hidden knowledge. Various data mining (DM) methods can be applied to these datasets, stored in data warehouses for predicting DMD. The aim of this research is to predict diabetes based on some of the DM techniques like classification and clustering. Out of which, classification is one of the most suitable methods for predicting diabetes. In this study, J48 and Naive Bayesian techniques are used for the early detection of diabetes. This research will help to propose a quicker and more efficient technique for diagnosis of disease, leading to timely and proper treatment of patients. We have also proposed a model and elaborated it step-by-step, in order to make medical practitioner to explore and to understand the discovered rules better. The study also shows the algorithm generated on the dataset collected from college medical hospital as well as from online repository. In the end, an article also outlines how an intelligent diagnostic system works. A clinical trial of this proposed method involves local patients, which is still continuing and requires longer research and experimentation.

78 citations

Proceedings ArticleDOI
17 Dec 2014
TL;DR: Experimental results and evaluation show that Bagging ensemble technique shows better performance as compared to single as well as other ensemble techniques.
Abstract: Conventional techniques for clinical decision support systems are based on a single classifier or simple combination of these classifiers used for disease diagnosis and prediction. Recently much attention has been paid on improving the performance of disease prediction by using ensemble-based methods. In this paper, we use multiple ensemble classification techniques for diabetes datasets. Three types of decision trees ID3, C4.5 and CART are used as the base classifiers. The ensemble techniques used are Majority Voting, Adaboost, Bayesian Boosting, Stacking and Bagging. Two benchmark diabetes datasets are used from UCI and Bio Stat repositories respectively. Experimental results and evaluation show that Bagging ensemble technique shows better performance as compared to single as well as other ensemble techniques.

51 citations

References
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

01 Jan 2002

9,314 citations


"Application of Data Mining Methods ..." refers background in this paper

  • ...Data Mining [4] refers to extracting or mining knowledge from large amounts of data....

    [...]

  • ...The accuracy [4] of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier....

    [...]

  • ...Knowledge Discovery in Databases (KDD) is the process of finding useful information and patterns in data which involves Selection, Pre-processing, Transformation, Data Mining and Evaluation....

    [...]

  • ...Volume 2, Issue 3, September 2012 Application of Data Mining Methods and Techniques for Diabetes Diagnosis K. Rajesh, V. Sangeetha Abstract-- Medical professionals need a reliable prediction methodology to diagnose Diabetes....

    [...]

  • ...Index Terms — Data Mining, Healthcare, Diabetes Research, Clinical Data, Classification, Diabetes Dataset....

    [...]

Book
31 Jul 1998
TL;DR: Feature Selection for Knowledge Discovery and Data Mining offers an overview of the methods developed since the 1970's and provides a general framework in order to examine these methods and categorize them and suggests guidelines for how to use different methods under various circumstances.
Abstract: From the Publisher: With advanced computer technologies and their omnipresent usage, data accumulates in a speed unmatchable by the human's capacity to process data. To meet this growing challenge, the research community of knowledge discovery from databases emerged. The key issue studied by this community is, in layman's terms, to make advantageous use of large stores of data. In order to make raw data useful, it is necessary to represent, process, and extract knowledge for various applications. Feature Selection for Knowledge Discovery and Data Mining offers an overview of the methods developed since the 1970's and provides a general framework in order to examine these methods and categorize them. This book employs simple examples to show the essence of representative feature selection methods and compares them using data sets with combinations of intrinsic properties according to the objective of feature selection. In addition, the book suggests guidelines for how to use different methods under various circumstances and points out new challenges in this exciting area of research. Feature Selection for Knowledge Discovery and Data Mining is intended to be used by researchers in machine learning, data mining, knowledge discovery, and databases as a toolbox of relevant tools that help in solving large real-world problems. This book is also intended to serve as a reference book or secondary text for courses on machine learning, data mining, and databases.

1,867 citations

Journal ArticleDOI
TL;DR: This study is applying Naive Bayes data mining classifier technique which produces an optimal prediction model using minimum training set which predicts attributes such as age, sex, blood pressure and blood sugar and the chances of a diabetic patient getting heart disease.
Abstract: objective of our paper is to predict the chances of diabetic patient getting heart disease. In this study, we are applying Naive Bayes data mining classifier technique which produces an optimal prediction model using minimum training set. Data mining is the analysis step of the Knowledge Discovery in Databases process (KDD). Data mining involves use of techniques to find underlying structures and relationships in a large database. Diabetes is a set of related diseases in which body cannot regulate the amount of sugar specifically glucose (hyperglycemia) in the blood. The diagnosis of diseases is a vital role in medical field. Using diabetic"s diagnosis, the proposed system predicts attributes such as age, sex, blood pressure and blood sugar and the chances of a diabetic patient getting a heart disease.

71 citations