scispace - formally typeset
Search or ask a question

Showing papers on "Dunn index published in 2016"


Journal ArticleDOI
01 Sep 2016
TL;DR: The analysis shows that the hyper-tangent kernel with Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means is the best one for image segmentation among all these clustering algorithms.
Abstract: Over the years data clustering algorithms have been used for image segmentation. Due to the presence of uncertainty in real life datasets, several uncertainty based data clustering algorithms have been developed. The c-means clustering algorithms form one such family of algorithms. Starting with the fuzzy c-means (FCM) a subfamily of this family comprises of rough c-means (RCM), intuitionistic fuzzy c-means (IFCM) and their hybrids like rough fuzzy c-means (RFCM) and rough intuitionistic fuzzy c-means (RIFCM). In the basic subfamily of this family of algorithms, the Euclidean distance was being used to measure the similarity of data. However, the sub family of algorithms obtained replacing the Euclidean distance by kernel based similarities produced better results. Especially, these algorithms were useful in handling viably cluster data points which are linearly inseparable in original input space. During this period it was inferred by Krishnapuram and Keller that the membership constraints in some rudimentary uncertainty based clustering techniques like fuzzy c-means imparts them a probabilistic nature, hence they suggested its possibilistic version. In fact all the other member algorithms from basic subfamily have been extended to incorporate this new notion. Currently, the use of image data is growing vigorously and constantly, accounting to huge figures leading to big data. Moreover, since image segmentation happens to be one of the most time consuming processes, industries are in the need of algorithms which can solve this problem at a rapid pace and with high accuracy. In this paper, we propose to combine the notions of kernel and possibilistic approach together in a distributed environment provided by Apacheź Hadoop. We integrate this combined notion with map-reduce paradigm of Hadoop and put forth three novel algorithms; Hadoop based possibilistic kernelized rough c-means (HPKRCM), Hadoop based possibilistic kernelized rough fuzzy c-means (HPKRFCM) and Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means (HPKRIFCM) and study their efficiency in image segmentation. We compare their running times and analyze their efficiencies with the corresponding algorithms from the other three sub families on four different types of images, three different kernels and six different efficiency measures; the Davis Bouldin index (DB), Dunn index (D), alpha index (α), rho index (ź), alpha star index (α*) and gamma index (γ). Our analysis shows that the hyper-tangent kernel with Hadoop based possibilistic kernelized rough intuitionistic fuzzy c-means is the best one for image segmentation among all these clustering algorithms. Also, the times taken to render segmented images by the proposed algorithms are drastically low in comparison to the other algorithms. The implementations of the algorithms have been carried out in Java and for the proposed algorithms we have used Hadoop framework installed on CentOS. For statistical plotting we have used matplotlib (python library).

28 citations


Journal ArticleDOI
TL;DR: The detailed study of various evaluation measuress to work with new incremental clustreing algorithm ICNBCF is shown.
Abstract: Cluster members are decided based on how close they are with each other. Compactness of cluster plays an important role in forming better quality clusters. ICNBCF incremental clustering algorithm computes closeness factor between every two data series. To decide members of cluster, it is necessary to know one more decisive factor to compare, threshold. Internal evaluation measure of cluster like variance and dunn index provide required decisive factor. in intial phase of ICNBCF, this decisive factor was given manually by investigative formed closeness factors. With values generated by internal evaluation measure formule, this process can be automated. This paper shows the detailed study of various evaluation measuress to work with new incremental clustreing algorithm ICNBCF.

9 citations


Journal ArticleDOI
TL;DR: The optimal number of clusters for the experimental dataset have been concluded as K=2 and the optimal method for clustering the given dataset is hierarchical.
Abstract: Objective: This paper discusses and compares the various clustering methods over Ill-structured datasets and the primary objective is to find the best clustering method and to fix the optimal number of clusters. Methods: The dataset used in this experiment has derived from the measures of sensors used in an urban waste water treatment plant. In this paper, clustering methods like hierarchical, K means and PAM have been compared and internal cluster validity indices like connectivity, Dunn index, and silhouette index have been used to validate the clusters and the optimization of clustering is expressed in terms of number of clusters. At the end, experiment is done by varying the number of clusters and optimal scores are calculated. Findings: Optimal score and optimal rank list are generated which reveals that the hierarchical clustering is the optimal clustering method. The optimum value of connectivity index should be minimum, silhouette should be maximum, dunn should be maximum. So by interpreting the results, the optimal number of clusters for the experimental dataset have been concluded as K=2 and the optimal method for clustering the given dataset is hierarchical. Applications: The experiment has been done over the dataset derived from the measures of sensors used in a urban waste water treatment plant.

5 citations


DOI
28 Oct 2016
TL;DR: The results of this study indicate that the maximum index Dunn, the best grouping K-means clustering to obtain the dominant topic as many as three clusters, namely the government, Jakarta, and politics.
Abstract: The internet is an extraordinary phenomenon. Starting from a military experiment in the United States, the internet has evolved into a 'need' for more than tens of millions of people worldwide. The number of internet users is large and growing, has been creating internet culture. One of the fast growing social media twitter. Twitter is a microblogging service that stores text database called tweets. To make it easier to obtain information that is dominant discussed, then sought the topic of twitter tweet using clustering. In this research, grouping 500 tweets from twitter account @detikcom using k-means clustering. The results of this study indicate that the maximum index Dunn, the best grouping K-means clustering to obtain the dominant topic as many as three clusters, namely the government, Jakarta, and politics. Keywords: text mining, clustering,, k-means , dunn index, and twitter.

4 citations


Book ChapterDOI
01 Jan 2016
TL;DR: This paper focuses in on multi-SOM clustering approach which overcomes the problem of extracting the number of clusters from the SOM map through the use of a clustering validity index and shows that it is more efficient than classical clustering methods.
Abstract: The interpretation of the quality of clusters and the determination of the optimal number of clusters is still a crucial problem in cluster Analysis. In this paper, we focus in on multi-SOM clustering approach which overcomes the problem of extracting the number of clusters from the SOM map through the use of a clustering validity index. We test the multi-SOM algorithm using real and artificial data sets with different evaluation criteria not used previously such as Davies Bouldin index, and Silhouette index. The multi-SOM algorithm is compared to k-means and Birch methods. Results developed with R language show that it is more efficient than classical clustering methods.

3 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: This study aims to create a web application called BencanaVis which provide innovative visualization of disaster government open data using Shiny, a web framework from R programming language.
Abstract: The open data movement has led us into immensely useful applications and innovations for decision making, both for individual citizen as well as government. This study aims to create a web application called BencanaVis which provide innovative visualization of disaster government open data using Shiny, a web framework from R programming language. The datasets being used are available from Indonesian National Disaster Management Authority agency (or BNPB), the official Indonesian Open Data government portal and the Indonesian National Statistical Bureau (or BPS) website. We create three types of scenarios or experiments for the dataset. After that, we normalize the data using min-max use normalization. Then, we employ PCA (principal component analysis) to reduce feature dimensionality. Furthermore, we apply K-Means clustering techniques and calculate the cluster validity using Sum of Square Error (SSE), Davis-Bouldin Index (DBI), Dunn Index, Connectivity Index and Silhouettes Index. The cluster member from optimal number of k are then being analyzed to create a score for disaster readiness. We shall analyze this disaster readiness using the scoring produced by weighting the attributes values with weights from the AHP methods. Furthermore, we provide two visualizations; they are 3D scatter plot and cluster distribution using leaflet library from R. There are two other visualizations provided in the web application use heatmap and streamgraph library. The heatmap visualization shows the pattern distribution of all attributes and streamgraph visualization which refers to stacked area chart shows the number of 21 types disaster which recorded from BNPB data in 16 years during the year 2000 – 2016.

2 citations


Journal Article
TL;DR: In this paper, a rule-based expert system is used to classify objects according to changes in their properties over time, such as how to classify companies' share price stability over a period of time or how to class students' preferences for subjects while they are progressing through school.
Abstract: This study optimises manually derived rule-based expert system classification of objects according to changes in their properties over time. One of the key challenges that this study tries to address is how to classify objects that exhibit changes in their behaviour over time, for example how to classify companies’ share price stability over a period of time or how to classify students’ preferences for subjects while they are progressing through school. A specific case the paper considers is the strategy of players in public goods games (as common in economics) across multiple consecutive games. Initial classification starts from expert definitions specifying class allocation for players based on aggregated attributes of the temporal data. Based on these initial classifications, the optimisation process tries to find an improved classifier which produces the best possible compact classes of objects (players) for every time point in the temporal data. The compactness of the classes is measured by a cost function based on internal cluster indices like the Dunn Index, distance measures like Euclidean distance or statistically derived measures like standard deviation. The paper discusses the approach in the context of incorporating changing player strategies in the aforementioned public good games, where common classification approaches so far do not consider such changes in behaviour resulting from learning or in-game experience. By using the proposed process for classifying temporal data and the actual players’ contribution during the games, we aim to produce a more refined classification which in turn may inform the interpretation of public goods game data.

1 citations


Posted Content
TL;DR: A dissimilarity measure based on the data cloud, called MADD, is used, which takes care of the problem of estimating the number of clusters and shows that many existing algorithms have superior performance in high dimensions when MADD is used instead of Euclidean distance.
Abstract: Popular clustering algorithms based on usual distance functions (e.g., Euclidean distance) often suffer in high dimension, low sample size (HDLSS) situations, where concentration of pairwise distances has adverse effects on their performance. In this article, we use a dissimilarity measure based on the data cloud, called MADD, which takes care of this problem. MADD uses the distance concentration property to its advantage and, as a result, clustering algorithms based on MADD usually perform better for high dimensional data. We also address the problem of estimating the number of clusters. This is a very challenging problem in cluster analysis, and several algorithms have been proposed for it. We show that many of these existing algorithms have superior performance in high dimensions when MADD is used instead of Euclidean distance. We also construct a new estimator based on penalized Dunn index and prove its consistency in HDLSS asymptotic regime, where the sample size remains fixed and the dimension grows to infinity. Several simulated and real data sets are analyzed to demonstrate the importance of MADD for cluster analysis of high dimensional data.

1 citations


Posted Content
TL;DR: This study optimises manually derived rule-based expert system classification of objects according to changes in their properties over time to produce a more refined classification which in turn may inform the interpretation of public goods game data.
Abstract: This study optimises manually derived rule-based expert system classification of objects according to changes in their properties over time. One of the key challenges that this study tries to address is how to classify objects that exhibit changes in their behaviour over time, for example how to classify companies' share price stability over a period of time or how to classify students' preferences for subjects while they are progressing through school. A specific case the paper considers is the strategy of players in public goods games (as common in economics) across multiple consecutive games. Initial classification starts from expert definitions specifying class allocation for players based on aggregated attributes of the temporal data. Based on these initial classifications, the optimisation process tries to find an improved classifier which produces the best possible compact classes of objects (players) for every time point in the temporal data. The compactness of the classes is measured by a cost function based on internal cluster indices like the Dunn Index, distance measures like Euclidean distance or statistically derived measures like standard deviation. The paper discusses the approach in the context of incorporating changing player strategies in the aforementioned public good games, where common classification approaches so far do not consider such changes in behaviour resulting from learning or in-game experience. By using the proposed process for classifying temporal data and the actual players' contribution during the games, we aim to produce a more refined classification which in turn may inform the interpretation of public goods game data.

1 citations


Book ChapterDOI
07 Sep 2016
TL;DR: A pattern detection mechanism is introduced with-in the authors' data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics and introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.
Abstract: Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.

Book ChapterDOI
01 Jan 2016
TL;DR: Experiments compare the behavior of these new indexes with usual cluster quality indexes based on Euclidean distance on different kinds of test datasets for which ground truth is available and clearly highlights the superior accuracy and stability of the new method.
Abstract: This paper presents new cluster quality indexes which can be efficiently applied for a low-to-high dimensional range of data and which are tolerant to noise. These indexes relies on feature maximization, which is an alternative measure to usual distributional measures relying on entropy or on Chi-square metric or vector-based measures such as Euclidean distance or correlation distance. Experiments compare the behavior of these new indexes with usual cluster quality indexes based on Euclidean distance on different kinds of test datasets for which ground truth is available. This comparison clearly highlights the superior accuracy and stability of the new method.