Showing papers on "Dunn index published in 2019"

PDF

Open Access

Journal Article•DOI•

Evolutionary Multiobjective Clustering and Its Applications to Patient Stratification

[...]

Xiangtao Li¹, Ka-Chun Wong²•Institutions (2)

Northeast Normal University¹, City University of Hong Kong²

01 May 2019-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A novel multiobjective framework called multiobjectives clustering algorithm by fast search and find of density peaks is proposed to address limitations altogether and can achieve better or competitive solutions than the others.

...read moreread less

Abstract: Patient stratification has a major role in enabling efficient and personalized medicine. An important task in patient stratification is to discover disease subtypes for effective treatment. To achieve this goal, the research on clustering algorithms for patient stratification has brought attention from both academia and medical community over the past decades. However, existing clustering algorithms suffer from realistic restrictions such as experimental noises, high dimensionality, and poor interpretability. In particular, the existing clustering algorithms usually determine clustering quality using only one internal evaluation function. Unfortunately, it is obvious that one internal evaluation function is hard to be fitted and robust for all datasets. Therefore, in this paper, a novel multiobjective framework called multiobjective clustering algorithm by fast search and find of density peaks is proposed to address those limitations altogether. In the proposed framework, a parameter candidate population is evolved under multiple objectives to select features and evaluate clustering densities automatically. To guide the multiobjective evolution, five cluster validity indices including compactness, separation, Calinski–Harabasz index, Davies–Bouldin index, and Dunn index, are chosen as the objective functions, capturing multiple characteristics of the evolving clusters. Multiobjective differential evolution algorithm based on decomposition is adopted to optimize those five objective functions simultaneously. To demonstrate its effectiveness, extensive experiments have been conducted, comparing the proposed algorithm with 45 algorithms including nine state-of-the-art clustering algorithms, five multiobjective evolutionary algorithms, and 31 baseline algorithms under different objective subsets on 94 datasets featuring 35 real patient stratification datasets, 55 synthetic datasets based on a real human transcription regulation network model, and four other medical datasets. The numerical results reveal that the proposed algorithm can achieve better or competitive solutions than the others. Besides, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed algorithm from different perspectives.

...read moreread less

55 citations

Journal Article•DOI•

Automatic Scientific Document Clustering Using Self-organized Multi-objective Differential Evolution

[...]

Naveen Saini¹, Sriparna Saha¹, Pushpak Bhattacharyya¹•Institutions (1)

Indian Institute of Technology Patna¹

01 Apr 2019-Cognitive Computation

TL;DR: The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents.

...read moreread less

Abstract: Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

...read moreread less

44 citations

Proceedings Article•DOI•

Clustering Validation of CLARA and K-Means Using Silhouette & DUNN Measures on Iris Dataset

[...]

Tanvi Gupta¹, Supriya P. Panda¹•Institutions (1)

International Institute of Minnesota¹

01 Feb 2019

TL;DR: In this paper, Cluster plot, Silhouette plot and Dunn Index on Iris dataset are shown for both the techniques and the final outcome attains that the CLARA clustering stands better than the K-Means clustering.

...read moreread less

Abstract: This paper is regarding the comparison of two techniques; Clustering Large Applications (CLARA) clustering and K-Means clustering using popular Iris dataset. CLARA clustering and K-Means clustering are the two techniques of “partitioning based” clustering. One considers medoids using some random sample data to form a cluster whereas the other considers centroid (means) of the dataset to form a cluster. In this paper, Cluster plot, Silhouette plot and Dunn Index on Iris dataset are shown for both the techniques. These all are used for “cluster validation”. The “Silhouette Analysis” is the measurement of an approximated average distance among the clusters. The “Silhouette plot” is the measurement of the closeness of the points in one cluster to the neighboring clusters, whereas the other internal clustering validation measure is the DUNN Index; higher the “Dunn Index” better is the clustering. All these statistical analysis is done in R programming. The final outcome attains that the CLARA clustering stands better than the K-Means clustering.

...read moreread less

25 citations

Journal Article•DOI•

Density-based clustering of big probabilistic graphs

[...]

Zahid Halim¹, Jamal Hussain Khattak¹•Institutions (1)

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology¹

01 Sep 2019-Evolving Systems

TL;DR: The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information.

...read moreread less

Abstract: Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core, last.fm, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.

...read moreread less

21 citations

Journal Article•DOI•

BMS: An improved Dunn index for Document Clustering validation

[...]

Michelangelo Misuraca¹, Maria Spano², Simona Balbi²•Institutions (2)

University of Calabria¹, University of Naples Federico II²

18 Oct 2019-Communications in Statistics-theory and Methods

TL;DR: This work focuses on the use of cosine similarity into the clustering process and proposes a new measure based on the same criterion, which is shown to be effective by an extensive comparative study.

...read moreread less

Abstract: Document Clustering aims at organizing a large quantity of unlabeled documents into a smaller number of meaningful and coherent clusters. One of the main unsolved problems in the literature...

...read moreread less

16 citations

Proceedings Article•DOI•

Clustering Based Approach for Web Service Selection Using Skyline Computations

[...]

Lalit Purohit¹, Sandeep Kumar¹•Institutions (1)

Indian Institute of Technology Roorkee¹

08 Jul 2019

TL;DR: The clustering is applied to candidate web services to determine similar services on the basis of QoS information and it is evident from the results of experimentation that the proposed approach is better than existing similar approaches for web service selection.

...read moreread less

Abstract: Web services are useful to automate a task. Along with automation of task, efficiency improvement is another important challenge for researchers of web service community. To improve the overall execution efficiency of web service based system, the input to selection process needs to be preprocessed. In this work, the clustering is applied to candidate web services to determine similar services on the basis of QoS information. A systematic analysis is done to evaluate the performance of three clustering techniques using Dunn index and average distance measure. The best performing clustering technique is applied on candidate web services. The most prominent set of web services is considered for skyline based selection. To perform various experiments, a QoS dataset based on real world web services is used. It is evident from the results of experimentation that the proposed approach is better than existing similar approaches for web service selection.

...read moreread less

15 citations

Book Chapter•DOI•

Performance of Internal Cluster Validations Measures For Evolutionary Clustering

[...]

Pranav Nerurkar¹, Aruna Pavate, Mansi Shah, Samuel Jacob•Institutions (1)

Veermata Jijabai Technological Institute¹

01 Jan 2019

TL;DR: The current inquiry focuses on the use of internal validation criteria as cost functions of the swarm optimizer metaheuristic as they achieve the dual goals of clustering which are compactness and separation.

...read moreread less

Abstract: Clustering is an NP-hard grouping problem and thus there are advantages of using a metaheuristic (swarm intelligence) strategy to find the near global optimal solution to it To effectively guide the agents of the swarm in the metaheuristic strategy, a suitable cost function is needed for successful outcome The current inquiry focuses on the use of internal validation criteria as cost functions as they achieve the dual goals of clustering which are compactness and separation Out of the multiple internal validation criteria included in the literature, two are identified for this purpose, viz BetaCV and Dunn index These were used as cost functions of the swarm optimizer metaheuristic (PSO-BCV and PSO-Dunn) To demonstrate the validity of the proposed technique, it was compared with other metaheuristics differential evolution as well as the traditional swarm optimizer based on distance-based criteria (PSO) The analysis of the results obtained on clustering benchmark datasets highlighted the suitability of this approach

...read moreread less

14 citations

Journal Article•DOI•

Analysis of DBSCAN and K-means algorithm for evaluating outlier on RFM model of customer behaviour

[...]

Siti Monalisa, Fitra Kurnia

01 Feb 2019-TELKOMNIKA Telecommunication Computing Electronics and Control

TL;DR: This research shown that outlier in DBSCAN and K-Means in cluster 1 have similarities is 100%.

...read moreread less

Abstract: The aim of study is to discover outlier of customer data to found customer behaviour. The customer behaviour determined with RFM (Recency, Frequency and Monetary) models with K-Mean and DBSCAN algorithm as clustering customer data. There are six step in this study. The first step is determining the best number of clusters with the dunn index (DN) validation method for each algorithm. Based on the dunn index, the best cluster values were 2 clusters with DN value for DBSCAN 1.19 which were minpts and epsilon value 0.2 and 3 and DN for K-Means was 1.31. The next step was to cluster the dataset with the DBSCAN and K-Means algorithm based on the best cluster that was 2. DBSCAN algorithm had 37 outliers data and K-means algorithm had 63 outliers (cluster 1 are 26 outliers and cluster 2 are 37 outliers). This research shown that outlier in DBSCAN and K-Means in cluster 1 have similarities is 100%. But overal outliers similarities is 67%. Based the outliers shown that the behaviour of customers is a small frequency of spending but high recency and monetary.

...read moreread less

8 citations

Journal Article•DOI•

A Novel Bio-Inspired Method for Early Diagnosis of Breast Cancer through Mammographic Image Analysis

[...]

David González-Patiño, Yenny Villuendas-Rey, Amadeo José Argüelles-Cruz, Fakhri Karray

23 Oct 2019-Applied Sciences

TL;DR: Meta-heuristics applied to the segmentation of mammographic images using the Dunn index as an optimization function, and the grey levels to represent each individual results in the maximization of the Dunn’s index function.

...read moreread less

Abstract: Breast cancer is a current problem that causes the death of many women. In this work, we test meta-heuristics applied to the segmentation of mammographic images. Traditionally, the application of these algorithms has a direct relationship with optimization problems; however, in this study, its implementation is oriented to the segmentation of mammograms using the Dunn index as an optimization function, and the grey levels to represent each individual. The update of grey levels during the process results in the maximization of the Dunn’s index function; the higher the index, the better the segmentation will be. The results showed a lower error rate using these meta-heuristics for segmentation compared to a well-adopted classical approach known as the Otsu method.

...read moreread less

7 citations

Proceedings Article•DOI•

Clustered Wind Resource Domains for the South African Renewable Energy Development Zones

[...]

C. Y. Janse van Vuuren, Hendrik J. Vermeulen

01 Jan 2019

TL;DR: In this article, the authors presented the results of an investigation to cluster the temporal wind speed profiles associated with the South African renewable energy development zones, which greatly reduced the computational cost of high-level capacity allocation optimization studies.

...read moreread less

Abstract: This paper presents the results of an investigation to cluster the temporal wind speed profiles associated with the South African renewable energy development zones. The study makes use of a renewable energy resource dataset produced by the Council for Scientific and Industrial Research. The clustering large applications algorithm, which is based on the partitioning around mediods algorithm, is used in the clustering exercise. Results are presented for each of the eight South African renewable energy zones. These results include clustered mean daily temporal profiles of the wind speed obtained for the high demand and low demand season, as well as the corresponding geographical cluster maps. Clustering performance metrics, including the average within cluster distance, the Dunn index and the average silhouette width are presented.The clustering results yield optimal output of three to five clusters for each of the individual renewable energy development zones. This implies that the wind speed profiles associated with each of these zones can be reduced to three to five archetypal mean daily profiles, which greatly reduces the computational cost of high-level capacity allocation optimization studies.

...read moreread less

5 citations

Book Chapter•DOI•

Comparison Between k-Means and k-Medoids for Mixed Variables Clustering

[...]

Norin Rahayu Shamsuddin¹, Nor Idayu Mahat²•Institutions (2)

Universiti Teknologi MARA¹, Florida State University College of Arts and Sciences²

01 Jan 2019

TL;DR: This paper compares the performance of k-means and k-medoids in clustering objects with mixed variables by using a mixed variables data set on a modified cancer data and indicates that k- medoids is a good clustering option when the measured variables are mixed with different types.

...read moreread less

Abstract: This paper compares the performance of k-means and k-medoids in clustering objects with mixed variables. The k-means initially means for clustering objects with continuous variables as it uses Euclidean distance to compute distance between objects. While, k-medoids has been designed suitable for mixed type variables especially with PAM (partition around medoids). By using a mixed variables data set on a modified cancer data, we compared k-means and k-medoids on internal validity set up in R package. The result indicates that k-medoids is a good clustering option when the measured variables are mixed with different types.

...read moreread less

Proceedings Article•DOI•

Clustering of Wind Resource Weibull Characteristics on the South African Renewable Energy Development Zones

[...]

Chantelle Y. Janse van Vuuren¹, Hendrik J. Vermeulen¹, J. C. Bekker¹•Institutions (1)

Stellenbosch University¹

26 Mar 2019

TL;DR: The clustering performance metrics achieved indicate that the k-means algorithm performs best when clustering on the Weibull distribution characteristics, together with the mean wind speed.

...read moreread less

Abstract: This paper presents the results of an investigation done to cluster the wind speed profiles associated with the South African Renewable Energy Development Zones using the associated Weibull distribution characteristics, together with the mean wind speed. The study uses a meso-scale wind resource dataset produced by the Council for Scientific and Industrial Research. Various clustering methods are explored, namely k-means clustering, the clustering large applications algorithm, the hierarchical agglomerative algorithm and a model-based clustering algorithm. Results are presented for each of the clustering algorithms for the Springbok renewable energy development zone for the high demand season wind speed profiles. These results include the non-overlapping clusters obtained, the Weibull distribution of the average profile associated with each cluster, the mean daily wind speed associated with each cluster and final analysis with an associated geographical cluster map. Clustering performance metrics, including the average silhouette width, Dunn index, the average intra-cluster distance, connectivity and the Calinski-Harabasz index are presented and interpreted. The clustering performance metrics achieved indicate that the k-means algorithm performs best when clustering on the Weibull distribution characteristics.

...read moreread less

Journal Article•DOI•

Kernelised Rough Sets Based Clustering Algorithms Fused With Firefly Algorithm for Image Segmentation

[...]

Srujan Chinta¹•Institutions (1)

VIT University¹

01 Oct 2019

TL;DR: This paper describes the design and construction of the proposed Algorithm for Firefly’s rough-tangent-Kernel system.

...read moreread less

Abstract: Data clustering methods have been used extensively for image segmentation in the past decade. In one of the author's previous works, this paper has established that combining the traditional cluste...

...read moreread less

Book Chapter•DOI•

Clustering with Polar Coordinates System: Exploring Possibilities

[...]

Yogita S. Patil¹, Manish Joshi¹•Institutions (1)

North Maharashtra University¹

01 Jan 2019

TL;DR: The observations of changing coordinate system of objects from Euclidean to Polar on clustering and the possibilities of clustering with different distance techniques for partitioning objects represented in Polar coordinate system are explored.

...read moreread less

Abstract: Clustering is unsupervised learning technique to group similar objects. The quality of clustering is assessed by several internal as well as external measures such as Dunn index, Davies–Bouldin index (DB), Calinski-Harabasz index (CH), Silhouette index, R-Squared, Rand, Jaccard, Purity and Entropy, F-measures, and many more. Researchers are exploring different approaches to improve quality of clustering by experimenting with different partitioning strategies (similarity/distance formula), by changing representation of data points or by applying different algorithms. In our earlier research paper (Joshi and Patil in 2016 IEEE Conference on Current Trends in Advanced Commuting (ICCTAC), pp 1–7, 2016 [1]), we put forth our observations of changing coordinate system of objects from Euclidean to Polar on clustering. In continuation, we further experimented to explore the possibilities of clustering with different distance techniques for partitioning objects represented in Polar coordinate system. We experimented with a standard as well as real data set. The quality of clustering is evaluated using Silhouette internal evaluation measure.

...read moreread less

Book Chapter•DOI•

Comparative Analysis of Hybridized C-Means and Fuzzy Firefly Algorithms with Application to Image Segmentation

[...]

Anurag Pant¹, Sai Srujan Chinta¹, B. K. Tripathy¹•Institutions (1)

VIT University¹

01 Jan 2019

TL;DR: The experimental results prove that the proposed clustering algorithms outperform the existing contemporary clusteringgorithms.

...read moreread less

Abstract: In this paper, we combine two famous fuzzy data clustering algorithms called fuzzy C-means and intuitionistic fuzzy C-means with a metaheuristic called fuzzy firefly algorithm. The resultant hybrid clustering algorithms (FCMFFA and IFCMFFA) are used for image segmentation. We compare the performance of the proposed algorithms with FCM, IFCM, FCMFA (fuzzy C-means fused with firefly algorithm), and IFCMFA (intuitionistic fuzzy C-means fused with firefly algorithm). The centroid values returned by firefly algorithm and fuzzy firefly algorithm are compared. Two performance indices, namely Davies–Bouldin (DB) index and Dunn index, have also been used to judge the quality of the clustering output. Different types of images have been used for the empirical analysis. Our experimental results prove that the proposed clustering algorithms outperform the existing contemporary clustering algorithms.

...read moreread less

Journal Article•DOI•

Cluster Phenomenon to Determine Anomaly Detection of Flight Route

[...]

Mohammad Yazdi Pusadan, Joko Lianto Buliali, Raden Venantius Hari Ginardi

01 Jan 2019-Procedia Computer Science

TL;DR: The results achieved in this study were to generate potential anomaly areas in the ADS-B data in the segment region to automatically resolve conflicts on flights through the flight route.

...read moreread less

Journal Article•DOI•

Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis

[...]

Wenbin Ye¹, Yuqi Long¹, Guoli Ji¹, Yaru Su², Pengchao Ye¹, Hongjuan Fu¹, Xiaohui Wu¹ - Show less +3 more•Institutions (2)

Xiamen University¹, Fuzhou University²

22 Jan 2019-BMC Genomics

TL;DR: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data.

...read moreread less

Abstract: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3′ end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes. Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3′ end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes. By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3′ end sequencing data to address the complex biological phenomenon.

...read moreread less