scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improved K-mean Clustering Algorithm for Prediction Analysis using Classification Technique in Data Mining

17 Jan 2017-International Journal of Computer Applications (Foundation of Computer Science (FCS), NY, USA)-Vol. 157, Iss: 6, pp 35-40
TL;DR: Improvement in the kmean clustering algorithm will be proposed which can define number of clusters automatically and assign required cluster to un-clustered points and will leads to improvement in accuracy and reduce clustering time by the member assigned to the cluster to predict cancer.
Abstract: Clustering is technique which is used to analyze the data in efficient manner and generate required information. To cluster the dataset, there is a technique named k-mean, is applied which is based on central point selection and calculation of Euclidian Distance. Here in k-mean, dataset will be loaded and from the dataset. Central points are selected using the formulae Euclidian distance and on the basis of Euclidian distance points are assigned to the clusters. The main disadvantage of k-mean is of accuracy, as in k-mean clustering user needs to define number of clusters. Because of user defined number of clusters, some points of the dataset are remained un-clustered. In this work, improvement in the kmean clustering algorithm will be proposed which can define number of clusters automatically and assign required cluster to un-clustered points. The proposed improvement will leads to improvement in accuracy and reduce clustering time by the member assigned to the cluster to predict cancer.

Content maybe subject to copyright    Report

International Journal of Computer Applications (0975 8887)
Volume 157 No 6, January 2017
35
Improved K-mean Clustering Algorithm for Prediction
Analysis using Classification Technique in Data Mining
Arpit Bansal
Department of Computer
Science & Engineering
MIET Meerut, India
Mayur Sharma
Software Engineer
Meerut, Uttar Pradesh
India
Shalini Goel
Department of Computer
Science & Engineering
MIET Meerut, India
ABSTRACT
Clustering is technique which is used to analyze the data in
efficient manner and generate required information. To cluster
the dataset, there is a technique named k-mean, is applied
which is based on central point selection and calculation of
Euclidian Distance. Here in k-mean, dataset will be loaded
and from the dataset. Central points are selected using the
formulae Euclidian distance and on the basis of Euclidian
distance points are assigned to the clusters. The main
disadvantage of k-mean is of accuracy, as in k-mean
clustering user needs to define number of clusters. Because of
user defined number of clusters, some points of the dataset are
remained un-clustered. In this work, improvement in the k-
mean clustering algorithm will be proposed which can define
number of clusters automatically and assign required cluster
to un-clustered points. The proposed improvement will leads
to improvement in accuracy and reduce clustering time by the
member assigned to the cluster to predict cancer.
General Terms
Clustering Algorithm, Predictive Analysis, Data Mining
Keywords
K-mean clustering, Prediction, clustering, Classification,
Hierarchal clustering
1. INTRODUCTION
Data Mining is known as the process of analyzing data to
extract interesting patterns and knowledge. Data mining is
used for analysis purpose to analyze different type of data by
using available data mining
tools. This information is currently used for wide range of
applications like customer retention, education system,
production control, healthcare, market basket analysis,
manufacturing engineering, scientific discovery and decision
making etc [1]. Data mining is studied for different databases
like object-relational databases, relational database, data ware
houses and multimedia databases etc.
Data mining is playing a vital role in many applications like
market-basket analysis, etc. Frequent item sets have
significant role in data mining which is used to find out the
correlations between the fields of database [2]. Association
Figure 1: Data Mining Process
rule is based on discovering frequent item sets and frequently
used by retail stores. Mining data in other words, named as
Discovery of new knowledge in Databases which further
moves to the nontrivial extraction of indirect, new and much
required information from data in databases [3].
1.1 Clustering in Data Mining
Cluster analysis has been widely used in numerous
applications, including market research, pattern recognition,
data analysis, and image processing. In business, clustering
can help marketers discover interests of their customers based
on purchasing patterns and characterize groups of the
customers. In biology, it can be used to derive plant and
animal taxonomies, categorize genes with similar
functionality, and gain insight into structures inherent in
populations. In geology, specialist can employ clustering to
identify areas of similar lands, similar houses in a city and etc.
data clustering can also be helpful in classifying documents
on the Web for information discovery.
Data clustering [4] is an unsupervised classification method
aims at creating groups of objects, or clusters, in such a way
that objects in the same cluster are very similar and objects in
different clusters are quite distinct. Cluster analysis is one of
the traditional topics in the data mining field. It is the first step
in the direction of exciting knowledge discovery. Clustering is
the procedure of grouping data objects into a set of disjoint
classes, called clusters. Now objects within a class have high
resemblance to each other in the meantime objects in separate
classes are more unlike.

International Journal of Computer Applications (0975 8887)
Volume 157 No 6, January 2017
36
Figure 2: Clustering
1.2 Types of Clustering
There are various types of clustering in data mining.
1.2.1 Partitioning Clustering
The general criterion for partitioning is a combination of high
similarity of the samples inside of clusters with high
dissimilarity between distinct clusters. Most partitioning
methods are distance-based. These clustering methods are
work well for finding spherical shaped clusters in small to
medium size databases [5].
Figure 3: Partitioning Clustering
1.2.2 Density Based Clustering
Most partitioning methods cluster objects based on distance
between objects. In these methods the cluster is continue to
grow as long as the density in the neighborhood exceeds some
threshold [6].
Figure 4: Density based Cluster
Grid Based Clustering
Grid based methods quantize the object space into a finite
number of cells that form a grid structure. It is a fast method
and is independent of the number of data objects and depends
only on the number of cells in each dimension in the
quantized space [7].
1.2.3 Hierarchical Methods
In this method hierarchical decomposition of the given set of
data objects is created. It can be classified as being either
agglomerative or divisive based on how hierarchical
decomposition is formed. Agglomerative approach is the
bottom up approach starts with each object forming a separate
group. Hierarchical algorithms create a hierarchical
decomposition of the given data set of data objects. The
hierarchical decomposition is represented by a tree structure,
called dendrogram. It does not need clusters as inputs. In this
type of clustering it is possible to view partitions at different
level of granularities using different types of K. E.g. Flat
Clustering [8].
It then merges groups close to one another until all the groups
are merged into one. Divisive approach is top down approach
starts with all the clusters in the same cluster and then in each
iteration step a cluster is split into smaller clusters until each
object is in one cluster.
Figure 5: Hierarchal Clustering
2. REVIEW OF LITERATURE
In this paper [9] they explained that huge data is available in
medical field to extract information from large data sets using
analytic tool. In this paper a real data set has been taken from
SGPGI. Real time data sets are always interlinked with some
challenges like missing values, high dimensional values and
noise etc which is not efficient for all the classification.
Therefore clustering is the alternate solution for data
analytics. The main focus of this paper is to develop a novel
technique based upon foggy k-mean clustering. The result of
the experiment depicts that foggy k-means clustering
algorithm has excellent result on datasets which are real as
compared to simple k-means clustering algorithm and
provides a enhanced result to the real world problem.
In this paper [10] they explained that clustering is the
powerful tool which used in various forecasting tools. In this
paper generic methodology of incremental K-mean clustering
is proposed for weather forecasting. This research has been
done on air pollution of west Bengal dataset. This paper
generally uses typical K-means clustering on the main air
pollution database and a list of weather category will be
developed based on the peak mean values of the clusters.
Whenever new data are coming, the incremental K-means is
used to group data into those clusters where weather category
has been already defined. Thus it is able to predict weather
information of future. This forecasting database is totally
based on the weather of west Bengal and this forecasting
methodology is prepared to mitigate the consequences of air
pollutions and launch focused modeling computations for
prediction and forecasts of weather events. Here correctness
of this approach is also measured.
In this paper [11] they proposed a system named Student
Performance Analysis System (SPAS) to keep track of
student’s result in a particular university. The proposed
project offers a system which predicts performance of the
students on the basis of their result on the basis of analysis
and design. The proposed system offers student performance
prediction through the rules generated via data mining
technique. The data mining technique used in this project is

International Journal of Computer Applications (0975 8887)
Volume 157 No 6, January 2017
37
classification, which classifies the students based on students'
grade.
In this paper [12] they presented an analysis of the prediction
of survivability rate of breast cancer patients using data
mining techniques. SEER public datasets has been used in this
project. This preprocessed dataset consists of 151, 886 records
which have 16 fields from the SEER. After that they have
investigated three data mining techniques: Naïve Bayes, back
propagated neural network and C4.5 decision tree algorithm.
Several experiments have been implemented using above
mentioned experimental. At the end existing techniques have
been compared with the achieved prediction performance.
Later on it is concluded that C4.5 algorithm has a much better
performance than other two techniques.
In this paper [13] they explained that forecasting stock return
is one the important subject to be learn for prediction for data
analysis. It is analysis that past investigations help to predict
future in data analysis. In this paper they try to help investors
in stock market better timing for the buying and selling stocks
on the basis of knowledge of past historical experiments. In
this paper they define decision tree classifier which is one of
the best data mining techniques.
3. BASIC THEORY
The k-means clustering algorithm [14] is the basic algorithm
which is based on partitioning method which is used for many
clustering tasks especially with low dimension datasets. It
uses k as a parameter, divide n objects into k clusters so that
the objects in the same cluster are similar to each other but
dissimilar to other objects in other clusters. The algorithm
attempts to find the cluster centers, (C1 …… Ck), such that
the sum of the squared distances of each data point, xi ,1 ≤ i ≤
n, to its nearest cluster centre Cj, 1 ≤ j ≤ k, is minimized. First,
the algorithm randomly selects the k objects, each of which
initially represents a cluster mean or centre. Then, each object
xi in the data set is assigned to the nearest cluster centre i.e. to
the most similar centre [4]. The algorithm then computes the
new mean for each cluster and reassigns each object to the
nearest new centre. This process iterates until no changes
occur to the assignment of objects.
3.1 Data Analytics
Data Analytics [15] is the science of examining raw data with
the purpose of extract useful information to draw conclusions.
Data Analytics can be used by many industries and
organizations to get better business decision. Data analytics
focuses on inference, the process of deriving a conclusion
based solely on what is already known by the researcher.
There are two types of data analytics: These are:
1. Classification
2. Prediction
1. Classification: Classification models predict categorical
class labels; and prediction models predict continuous valued
functions.
2. Prediction: Prediction model to predict the expenditures in
dollars of potential customers on computer equipment given
their income and occupation.
Prediction Analytics brings management, skills, information
technology and modeling. Predictive Analytics is a data
science, multidisciplinary skills set essential for non-profit
organizations, success in business and government.
Marketing forecasting sales or market share, good retail site
opportunity has been found. It also identify consumer
segments and target marketing and risks associated with
existing products, predictive analytics provides the key for it.
Classification is a data mining technique which comes under
machine learning technique to predict group membership for
data instances. Classification technique is capable of
processing a wider variety of data than regression and is
growing in popularity.
4. PROPOSED METHODOLOGY
The k-mean clustering algorithm is used to cluster the similar
type of data for prediction analysis. In k-mean clustering
algorithm, probability of the most relevant function is
calculated and using Euclidian distance formula the functions
are clustered. In this work, we will enhance the Euclidian
distance formula to increase the cluster quality. The
enhancement will be based on normalization. In the
enhancement two new features will be added. The first point
is to calculate normal distance metrics on the basis of
normalization. In second point the functions will be clustered
on the basis of majority voting. The proposed technique will
be implemented in MATLAB.
Figure 6 : Flowchart of Methodology
The steps of proposed work -
First of all, we have started process in which at initial
stage we generated data from user end in which we give
number of data inputs, which are generated by sigma and
random functions.
When all data has been generated then Simple k-means
applied and got result in subplot.
After applying normalization on that data, we gave
scatter data in second subplot.
Now we applied normalization, in precede in which we
read text file data of that generate data after that applied
hierarchy k-means before normalization in which we got
result in different form rather than first subplot.

International Journal of Computer Applications (0975 8887)
Volume 157 No 6, January 2017
38
After this process normalization on that process is done
in which iterations process started.
This process is continued until we don’t get a nearest
point to accurate position with generating data.
At last calculated their total time in which we got results
which shows betterment in accuracy of cluster.
5. EXPERIMENTAL RESULTS
The Cancer Dataset is used for the research process and to
find the results. The original data were highly dimensional,
but only 5 attributes has been finally considered on the basis
of requirements. The dataset is loaded using MATLAB.
Figure 7: Dataset is Plotted
As shown in figure 7, The dataset is plotted on a 2D plane
with according to existing K-means algorithm. Now the entire
dataset can be analyzed in one single view. We can check how
outliers does this dataset have . Thus we can put some
controls on the data to minimize or reduce their effect for a
better analysis and prediction.
Figure 8 : Cluster using k-mean algorithm
As shown in figure 8, The regions have been divided with the
existing k-mean clustering algorithm. Here the dataset is
divided in three regions. All these regions are different from
each other as they all depict something different from the
other one.
Figure 9: SVM is applied on the Dataset
As shown in figure 9, here in this the SVM is applied on the
data for classifying the dataset.
Figure 10 : Calculating Best cost using Normalization
As shown in figure 10, the k-mean clustering is improvement
to improve cluster quality using the technique of
normalization. The dataset is loaded is and it is shown on the
command window. The dataset which is loaded is plotted on
the 2-D plane for analysis. The plotted data will be clustered
using the algorithm of k-mean clustering. The central points
are marked in each cluster. The normalization technique is
applied to calculate best distance to make high quality
clusters.

International Journal of Computer Applications (0975 8887)
Volume 157 No 6, January 2017
39
Figure 11: Central Point of Cluster
As shown in figure 11, The best calculated clusters have been
formed along with their central points. All these 3 clusters are
different from each other somehow. It depends on us that how
many clusters we want to have in our output data. In this case
we have define to divide the dataset in 3 separate clusters.
Figure 12 : Final output of k-mean algorithm
As shown in figure 12, The dataset has been divided into two
separate regions. This graph plotted on a 2D plane, can be
described easily as our technique has made it very easy to
understand and predict after seeing the graph that the red dots
that are below the line are indicting the cancerous and the
green dots that above the line are non cancerous data.
5.1 Result
As it can be seen in the Table 1 shown below that there is a
big variation between execution time of existing algorithm
and proposed algorithm. And with this proposed algorithm the
accuracy level gets almost double. So it can be said that the
proposed modification in the existing k-means algorithm will
give a huge improvement to the clustering techniques. The
problem of accuracy level and execution time do not matter
much, but when it comes to large dataset when there are
millions of record, then it becomes enormously big problem.
Because then the entire study of the dataset may move to a
wrong direction as there was less accuracy level.
Table 1: Comparison with Existing Algorithm
Parameters
Existing
Algorithm
Proposed
Algorithm
Execution Time
6.90047 e-02
8.59160 e-02
Accuracy
57.14%
92.86%
6. CONCLUSION
In this paper, it is concluded that clustering is technique by
which large datasets are divide in to small data collections that
are called clusters. The cluster is a collection of the data that
are turned into information. The data clusters are different
from each others as they are possessing some different values
from each other. There are number of algorithms that work
well for clustering the data that can divide a dataset into
clusters. In this paper we have proposed technique for a
modification in K-Means Clustering Algorithm. Here in this
proposed modification, the K-Means clustering will vanish off
the two major drawbacks of K-Means clustering, that are
accuracy level and calculation time consumed in clustering
the dataset. Although when we use small datasets these two
factors accuracy level and calculation time may not matter
much but when we use large datasets that have trillions of
records, then little dispersion in accuracy level will matter a
lot and can lead to a disastrous situation, if not handled
properly. So in last considering the whole of the situation, it
can be said that this proposed modification can be more
extended to achieve the full accuracy level upto 100%, with
very little time and with more quality clusters.
7. ACKNOWLDEMENT
I am extremely grateful and indebted to my parents and my
colleague for being pillars of strength , for their unfailing
moral support, and encouragement. I treasure their blessings
and good wishes and dedicate this study to them.
I thank one and all who have been instrumental in helping me
to complete this dissertation work.
8. REFERENCES
[1] K.Rajalakshmi,, Dr.S.S.Dhenakaran,N.Roobin
“Comparative Analysis of K-Means Algorithm in
Disease Prediction”, International Journal of Science,
Engineering and Technology Research (IJSETR),
Volume 4, Issue 7, July 2015
[2] Oyelade, O. J, Oladipupo, O. O and Obagbuwa, I. C,
“Application of k-Means Clustering algorithm for
prediction of Students’ Academic Performance”,
International Journal of Computer Science and
Information Security, Vol. 7, o. 1, 2010
[3] Shital A. Raut and S. R. Sathe, “A Modified Fastmap K-
Means Clustering Algorithm for Large Scale Gene
Expression Datasets”, International Journal of
Bioscience, Biochemistry and Bioinformatics, Vol. 1, No.
4, page 120-124, November 2011.
[4] Daljit Kaur and Kiran Jyot, “Enhancement in the
Performance of K-means Algorithm”, International
Journal of Computer Science and Communication
Engineering, Volume 2 Issue 1, 2013

Citations
More filters
Journal ArticleDOI
TL;DR: Pregnant women make quick decisions in case of miscarriage or probable miscarriage is predicted by creating a real time system prediction of miscarriage using wearable healthcare sensors, mobile tools, data mining algorithms and big data technologies.
Abstract: Mobile phone and sensors have become very useful to understand and analyze human lifestyle because of the huge amount of data they can collect every second. This triggered the idea of combining benefits and advantages of reality mining, machine learning and big data predictive analytics tools, applied to smartphones/sensors real time. The main goal of our study is to build a system that interacts with mobile phones and wearable healthcare sensors to predict patterns. Wearable healthcare sensors (heart rate sensor, temperature sensor and activity sensor) and mobile phone are used for gathering real time data. All sensors are managed using IoT systems; we used Arduino for collecting data from health sensors and Raspberry Pi 3 for programming and processing. Kmeans clustering algorithm is used for patterns prediction and predicted clusters (partitions) are transmitted to the user in his front-end interface in the mobile application. Real world data and clustering validation statistics (Elbow method and Silhouette method) are used to validate the proposed system and assess its performance and effectiveness. All data management and processing tasks are conducted over Apache Spark Databricks. This system relies on real time gathered data and can be applied to any prediction case making use of sensors and mobile generated data. As a proof of concept, we worked on predicting miscarriages to help pregnant women make quick decisions in case of miscarriage or probable miscarriage by creating a real time system prediction of miscarriage using wearable healthcare sensors, mobile tools, data mining algorithms and big data technologies. 9 risk factors contribute vastly in prediction, the Elbow method asserts that the optimal number of cluster is 2 and we achieve a higher value (0, 95) of Silhouette width that validates the good matching between clusters and observations. K-means algorithm gives good results in clustering the data.

36 citations

Journal ArticleDOI
TL;DR: The genetic algorithm model has the best performance, and effective regional segmentation based on the auction appraisal price improves the predictive accuracy.
Abstract: The real estate auction market has become increasingly important in the financial, economic and investment fields, but few artificial intelligence-based studies have attempted to forecast the auction prices of real estate. The purpose of this study is to develop forecasting models of real estate auction prices using artificial intelligence and statistical methodologies. The forecasting models are developed through a regression model, an artificial neural network and a genetic algorithm. For empirical analysis, we use Seoul apartment auction data from 2013 to 2017 to predict the auction prices and compare the forecasting accuracy of the models. The genetic algorithm model has the best performance, and effective regional segmentation based on the auction appraisal price improves the predictive accuracy.

24 citations

Journal ArticleDOI
01 Oct 2018
TL;DR: In this article, the authors compared Naive Bayes and C.45 algorithms for credit card submission cases at a bank and showed that Naive-Bayes algorithm is better than C45 algorithm.
Abstract: Pada paper ini, telah diterapkan metode Naive Bayes serta C.45 ke dalam 4 buah studi kasus, yaitu kasus penerimaan “Kartu Indonesia Sehat”, penentuan pengajuan kartu kredit di sebuah bank, penentuan usia kelahiran, serta penentuan kelayakan calon anggota kredit pada koperasi untuk mengetahui algoritma terbaik di setiap kasus . Setelah itu, dilakukan perbandingan dalam hal Precision , Recall serta Accuracy untuk setiap data training dan data testing yang telah diberikan. Dari hasil implementasi yang dilakukan, telah dibangun sebuah aplikasi yang dapat menerapkan algoritma Naive Bayes dan C.45 di 4 buah kasus tersebut. Aplikasi telah diuji dengan blackbox dan algoritma dengan hasil valid dan dapat mengimplementasikan kedua buah algoritma dengan benar. Berdasarkan hasil pengujian, semakin banyaknya data training yang digunakan, maka nilai precision, recall dan accuracy akan semakin meningkat. Selain itu, hasil klasifikasi pada algoritma Naive Bayes dan C.45 tidak dapat memberikan nilai yang absolut atau mutlak di setiap kasus. Pada kasus penentuan penerimaan Kartu Indonesia Sehat, kedua buah algoritma tersebut sama-sama efektif untuk digunakan. Untuk kasus pengajuan kartu kredit di sebuah bank, C.45 lebih baik daripada Naive Bayes. Pada kasus penentuan usia kelahiran, Naive Bayes lebih baik daripada C.45. Sedangkan pada kasus penentuan kelayakan calon anggota kredit di koperasi, Naive Bayes memberikan nilai yang lebih baik pada precision, tapi untuk recall dan accuracy, C.45 memberikan hasil yang lebih baik. Sehingga untuk menentukan algoritma terbaik yang akan dipakai di sebuah kasus, harus melihat kriteria, variable maupun jumlah data di kasus tersebut. Abstract In this paper, applied Naive Bayes and C.45 into 4 case studies, namely the case of acceptance of “Kartu Indonesia Sehat”, determination of credit card application in a bank, determination of birth age, and determination of eligibility of prospective members of credit to Koperasi to find out the best algorithm in each case. After that, the comparison in Precision, Recall and Accuracy for each training data and data testing has been given. From the results of the implementation, has built an application that can apply the Naive Bayes and C.45 algorithm in 4 cases. Applications have been tested in blackbox and algorithms with valid results and can implement both algorithms correctly. Based on the test results, the more training data used, the value of precision, recall and accuracy will increase. The classification results of Naive Bayes and C.45 algorithms can not provide absolute value in each case. In the case of determining the acceptance of the Kartu Indonesia Indonesia, the two algorithms are equally effective to use. For credit card submission cases at a bank, C.45 is better than Naive Bayes. In the case of determining the age of birth, Naive Bayes is better than C.45. Whereas in the case of determining the eligibility of prospective credit members in the cooperative, Naive Bayes provides better value in precision, but for recall and accuracy, C.45 gives better results. So, to determine the best algorithm to be used in a case, it must look at the criteria, variables and amount of data in the case

18 citations


Cites background from "Improved K-mean Clustering Algorith..."

  • ...Bansar, Sharma & Goel (2017) menyatakan klasifikasi adalah sebuah teknik untuk menentukan keanggotaan kelompok berdasarkan data-data yang sudah ada....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a new technology developed in recent years, data mining is used to discover the valuable and potential knowledge hidden behind the data and provide strong support for scientistic research.
Abstract: Data mining is a new technology developed in recent years. Through data mining, people can discover the valuable and potential knowledge hidden behind the data and provide strong support for scient...

17 citations

Journal ArticleDOI
TL;DR: This study has used a large-scale real-world data set to identify the efficiency of clustering technique to improve the classification model and found that applying K-means clustering prior to KNN model helps in reducing the computation time.
Abstract: Product classification is the key issue in e-commerce domains. Many products are released to the market rapidly and to select the correct category in taxonomy for each product has become a challenging task. The application of classification model is useful to precisely classify the products. The study proposed a method to apply clustering prior to classification. This study has used a large-scale real-world data set to identify the efficiency of clustering technique to improve the classification model. The conventional text classification procedures are used in the study such as preprocessing, feature extraction and feature selection before applying the clustering technique. Results show that clustering technique improves the accuracy of the classification model. The best classification model for all three approaches which are classification model only, classification with hierarchical clustering and classification with K-means clustering is K-Nearest Neighbor (KNN) model. Even though the accuracy of the KNN models are the same across different approaches but the KNN model with K-means clustering had the shortest time of execution. Hence, applying K-means clustering prior to KNN model helps in reducing the computation time.

17 citations


Cites background or methods from "Improved K-mean Clustering Algorith..."

  • ...Previous studies often utilized these techniques to enhance the performance of their classification models [13], [14],[20]-[22]....

    [...]

  • ...Recent studies have shown that the combination of classification and clustering models can provide better classification result [13],[14]....

    [...]

References
More filters
Proceedings Article
Sid Ray1, Rose H Turi1
01 Jan 2000
TL;DR: This paper presents a simple validity measure based on the intra-clusters and inter-cluster distance measures which allows the number of clusters to be determined automatically and is tested for synthetic images for which theNumber of clusters in known, and is also implemented for natural images.
Abstract: The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented images for 2 clusters up to Kmax clusters, where Kmax represents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure. The validity measure is tested for synthetic images for which the number of clusters in known, and is also implemented for natural images.

649 citations


"Improved K-mean Clustering Algorith..." refers methods in this paper

  • ...These clustering methods are work well for finding spherical –shaped clusters in small to medium size databases [5]....

    [...]

01 Jan 2009
TL;DR: A method for making the k-means clustering algorithm more effective and efficient, so as to get better clustering with reduced complexity is proposed.
Abstract: Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data per- taining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.

324 citations

Proceedings ArticleDOI
25 Oct 2010
TL;DR: In this paper, appropriate and efficient networks for breast cancer knowledge discovery from clinically collected data sets are investigated and principal component techniques are used to reduce the dimension of data and find appropriate networks.
Abstract: In this paper, appropriate and efficient networks for breast cancer knowledge discovery from clinically collected data sets are investigated. Invoking various data mining techniques, it is desired to find out the percentage of disease development, using the developed network. The results, help in choosing a reasonable treatment of the patient. Several neural network structures are evaluated for this investigation. The performance of the statistical neural network structures, self organizing map(SOM), radial basis function network (RBF), general regression neural network (GRNN) and probabilistic neural network (PNN) are tested both on the Wisconsin breast cancer data (WBCD) and on the Shiraz Namazi Hospital breast cancer data (NHBCD). To overcome the problem of high dimension of the data set and realizing the correlated nature of the data, principal component techniques are used to reduce the dimension of data and find appropriate networks. The results are quite satisfactory while presenting a comparison of effectiveness each proposed network for such problems.

234 citations


"Improved K-mean Clustering Algorith..." refers methods in this paper

  • ...In this paper [12] they presented an analysis of the prediction of survivability rate of breast cancer patients using data mining techniques....

    [...]

01 Jan 2010
TL;DR: A new method is proposed for finding the better initial centroids and to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity.
Abstract: Cluster analysis is one of the primary data analysis methods and k-means is one of the most well known popular clustering algorithms. The k-means algorithm is one of the frequently used clustering method in data mining, due to its performance in clustering massive data sets. The final clustering result of the k- means clustering algorithm greatly depends upon the correctness of the initial centroids, which are selected randomly. The original k-means algorithm converges to local minimum, not the global optimum. Many improvements were already proposed to improve the performance of the k-means, but most of these require additional inputs like threshold values for the number of data points in a set. In this paper a new method is proposed for finding the better initial centroids and to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity. According to our experimental results, the proposed algorithm has the more accuracy with less computational time comparatively original k-means clustering algorithm.

167 citations


"Improved K-mean Clustering Algorith..." refers methods in this paper

  • ...It is a fast method and is independent of the number of data objects and depends only on the number of cells in each dimension in the quantized space [7]....

    [...]

01 Jan 2010
TL;DR: K-mean clustering algorithm was combined with the deterministic model to analyze the students’ results of a private Institution in Nigeria which is a good benchmark to monitor the progression of academic performance of students in higher Institution for the purpose of making an effective decision by the academic planners.
Abstract: The ability to monitor the progress of students academic performance is a critical issue to the academic community of higher learning. A system for analyzing students results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance is described. In this paper, we also implemented k mean clustering algorithm for analyzing students result data. The model was combined with the deterministic model to analyze the students results of a private Institution in Nigeria which is a good benchmark to monitor the progression of academic performance of students in higher Institution for the purpose of making an effective decision by the academic planners.

118 citations

Trending Questions (1)
How can k-means be used for clustering functional data?

In k-means clustering for functional data, the most relevant function is determined using probability and Euclidean distance formula, and functions are clustered based on majority voting.