scispace - formally typeset
Search or ask a question

Showing papers on "Dunn index published in 2019"


Journal ArticleDOI
TL;DR: A novel multiobjective framework called multiobjectives clustering algorithm by fast search and find of density peaks is proposed to address limitations altogether and can achieve better or competitive solutions than the others.
Abstract: Patient stratification has a major role in enabling efficient and personalized medicine. An important task in patient stratification is to discover disease subtypes for effective treatment. To achieve this goal, the research on clustering algorithms for patient stratification has brought attention from both academia and medical community over the past decades. However, existing clustering algorithms suffer from realistic restrictions such as experimental noises, high dimensionality, and poor interpretability. In particular, the existing clustering algorithms usually determine clustering quality using only one internal evaluation function. Unfortunately, it is obvious that one internal evaluation function is hard to be fitted and robust for all datasets. Therefore, in this paper, a novel multiobjective framework called multiobjective clustering algorithm by fast search and find of density peaks is proposed to address those limitations altogether. In the proposed framework, a parameter candidate population is evolved under multiple objectives to select features and evaluate clustering densities automatically. To guide the multiobjective evolution, five cluster validity indices including compactness, separation, Calinski–Harabasz index, Davies–Bouldin index, and Dunn index, are chosen as the objective functions, capturing multiple characteristics of the evolving clusters. Multiobjective differential evolution algorithm based on decomposition is adopted to optimize those five objective functions simultaneously. To demonstrate its effectiveness, extensive experiments have been conducted, comparing the proposed algorithm with 45 algorithms including nine state-of-the-art clustering algorithms, five multiobjective evolutionary algorithms, and 31 baseline algorithms under different objective subsets on 94 datasets featuring 35 real patient stratification datasets, 55 synthetic datasets based on a real human transcription regulation network model, and four other medical datasets. The numerical results reveal that the proposed algorithm can achieve better or competitive solutions than the others. Besides, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed algorithm from different perspectives.

55 citations


Journal ArticleDOI
TL;DR: The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents.
Abstract: Document clustering is the partitioning of a given collection of documents into various K- groups based on some similarity/dissimilarity criterion. This task has applications in scope detection of journals/conferences, development of some automated peer-review support systems, topic-modeling, latest cognitive-inspired works on text summarization, and classification of documents based on semantics, etc. In the current paper, a cognitive-inspired multi-objective automatic document clustering technique is proposed which is a fusion of self-organizing map (SOM) and multi-objective differential evolution approach. The variable number of cluster centers are encoded in different solutions of the population to determine the number of clusters from a data set in an automated way. These solutions undergo various genetic operations during evolution. The concept of SOM is utilized in designing new genetic operators for the proposed clustering technique. In order to measure the goodness of a clustering solution, two cluster validity indices, Pakhira-Bandyopadhyay-Maulik index, and Silhouette index, are optimized simultaneously. The effectiveness of the proposed approach, namely self-organizing map based multi-objective document clustering technique (SMODoc_clust) is shown in automatic classification of some scientific articles and web-documents. Different representation schemas including tf, tf-idf and word-embedding are employed to convert articles in vector-forms. Comparative results with respect to internal cluster validity indices, namely, Dunn index and Davies-Bouldin index, are shown against several state-of-the-art clustering techniques including three multi-objective clustering techniques namely MOCK, VAMOSA, NSGA-II-Clust, single objective genetic algorithm (SOGA) based clustering technique, K-means, and single-linkage clustering. Results obtained clearly show that our approach is better than existing approaches. The validation of the obtained results is also shown using statistical significant t tests.

44 citations


Proceedings ArticleDOI
01 Feb 2019
TL;DR: In this paper, Cluster plot, Silhouette plot and Dunn Index on Iris dataset are shown for both the techniques and the final outcome attains that the CLARA clustering stands better than the K-Means clustering.
Abstract: This paper is regarding the comparison of two techniques; Clustering Large Applications (CLARA) clustering and K-Means clustering using popular Iris dataset. CLARA clustering and K-Means clustering are the two techniques of “partitioning based” clustering. One considers medoids using some random sample data to form a cluster whereas the other considers centroid (means) of the dataset to form a cluster. In this paper, Cluster plot, Silhouette plot and Dunn Index on Iris dataset are shown for both the techniques. These all are used for “cluster validation”. The “Silhouette Analysis” is the measurement of an approximated average distance among the clusters. The “Silhouette plot” is the measurement of the closeness of the points in one cluster to the neighboring clusters, whereas the other internal clustering validation measure is the DUNN Index; higher the “Dunn Index” better is the clustering. All these statistical analysis is done in R programming. The final outcome attains that the CLARA clustering stands better than the K-Means clustering.

25 citations


Journal ArticleDOI
TL;DR: The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information.
Abstract: Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core, last.fm, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.

21 citations


Journal ArticleDOI
TL;DR: This work focuses on the use of cosine similarity into the clustering process and proposes a new measure based on the same criterion, which is shown to be effective by an extensive comparative study.
Abstract: Document Clustering aims at organizing a large quantity of unlabeled documents into a smaller number of meaningful and coherent clusters. One of the main unsolved problems in the literature...

16 citations


Proceedings ArticleDOI
08 Jul 2019
TL;DR: The clustering is applied to candidate web services to determine similar services on the basis of QoS information and it is evident from the results of experimentation that the proposed approach is better than existing similar approaches for web service selection.
Abstract: Web services are useful to automate a task. Along with automation of task, efficiency improvement is another important challenge for researchers of web service community. To improve the overall execution efficiency of web service based system, the input to selection process needs to be preprocessed. In this work, the clustering is applied to candidate web services to determine similar services on the basis of QoS information. A systematic analysis is done to evaluate the performance of three clustering techniques using Dunn index and average distance measure. The best performing clustering technique is applied on candidate web services. The most prominent set of web services is considered for skyline based selection. To perform various experiments, a QoS dataset based on real world web services is used. It is evident from the results of experimentation that the proposed approach is better than existing similar approaches for web service selection.

15 citations


Book ChapterDOI
01 Jan 2019
TL;DR: The current inquiry focuses on the use of internal validation criteria as cost functions of the swarm optimizer metaheuristic as they achieve the dual goals of clustering which are compactness and separation.
Abstract: Clustering is an NP-hard grouping problem and thus there are advantages of using a metaheuristic (swarm intelligence) strategy to find the near global optimal solution to it To effectively guide the agents of the swarm in the metaheuristic strategy, a suitable cost function is needed for successful outcome The current inquiry focuses on the use of internal validation criteria as cost functions as they achieve the dual goals of clustering which are compactness and separation Out of the multiple internal validation criteria included in the literature, two are identified for this purpose, viz BetaCV and Dunn index These were used as cost functions of the swarm optimizer metaheuristic (PSO-BCV and PSO-Dunn) To demonstrate the validity of the proposed technique, it was compared with other metaheuristics differential evolution as well as the traditional swarm optimizer based on distance-based criteria (PSO) The analysis of the results obtained on clustering benchmark datasets highlighted the suitability of this approach

14 citations


Journal ArticleDOI
TL;DR: This research shown that outlier in DBSCAN and K-Means in cluster 1 have similarities is 100%.
Abstract: The aim of study is to discover outlier of customer data to found customer behaviour. The customer behaviour determined with RFM (Recency, Frequency and Monetary) models with K-Mean and DBSCAN algorithm as clustering customer data. There are six step in this study. The first step is determining the best number of clusters with the dunn index (DN) validation method for each algorithm. Based on the dunn index, the best cluster values were 2 clusters with DN value for DBSCAN 1.19 which were minpts and epsilon value 0.2 and 3 and DN for K-Means was 1.31. The next step was to cluster the dataset with the DBSCAN and K-Means algorithm based on the best cluster that was 2. DBSCAN algorithm had 37 outliers data and K-means algorithm had 63 outliers (cluster 1 are 26 outliers and cluster 2 are 37 outliers). This research shown that outlier in DBSCAN and K-Means in cluster 1 have similarities is 100%. But overal outliers similarities is 67%. Based the outliers shown that the behaviour of customers is a small frequency of spending but high recency and monetary.

8 citations


Journal ArticleDOI
TL;DR: Meta-heuristics applied to the segmentation of mammographic images using the Dunn index as an optimization function, and the grey levels to represent each individual results in the maximization of the Dunn’s index function.
Abstract: Breast cancer is a current problem that causes the death of many women. In this work, we test meta-heuristics applied to the segmentation of mammographic images. Traditionally, the application of these algorithms has a direct relationship with optimization problems; however, in this study, its implementation is oriented to the segmentation of mammograms using the Dunn index as an optimization function, and the grey levels to represent each individual. The update of grey levels during the process results in the maximization of the Dunn’s index function; the higher the index, the better the segmentation will be. The results showed a lower error rate using these meta-heuristics for segmentation compared to a well-adopted classical approach known as the Otsu method.

7 citations


Proceedings ArticleDOI
01 Jan 2019
TL;DR: In this article, the authors presented the results of an investigation to cluster the temporal wind speed profiles associated with the South African renewable energy development zones, which greatly reduced the computational cost of high-level capacity allocation optimization studies.
Abstract: This paper presents the results of an investigation to cluster the temporal wind speed profiles associated with the South African renewable energy development zones. The study makes use of a renewable energy resource dataset produced by the Council for Scientific and Industrial Research. The clustering large applications algorithm, which is based on the partitioning around mediods algorithm, is used in the clustering exercise. Results are presented for each of the eight South African renewable energy zones. These results include clustered mean daily temporal profiles of the wind speed obtained for the high demand and low demand season, as well as the corresponding geographical cluster maps. Clustering performance metrics, including the average within cluster distance, the Dunn index and the average silhouette width are presented.The clustering results yield optimal output of three to five clusters for each of the individual renewable energy development zones. This implies that the wind speed profiles associated with each of these zones can be reduced to three to five archetypal mean daily profiles, which greatly reduces the computational cost of high-level capacity allocation optimization studies.

5 citations


Book ChapterDOI
01 Jan 2019
TL;DR: This paper compares the performance of k-means and k-medoids in clustering objects with mixed variables by using a mixed variables data set on a modified cancer data and indicates that k- medoids is a good clustering option when the measured variables are mixed with different types.
Abstract: This paper compares the performance of k-means and k-medoids in clustering objects with mixed variables. The k-means initially means for clustering objects with continuous variables as it uses Euclidean distance to compute distance between objects. While, k-medoids has been designed suitable for mixed type variables especially with PAM (partition around medoids). By using a mixed variables data set on a modified cancer data, we compared k-means and k-medoids on internal validity set up in R package. The result indicates that k-medoids is a good clustering option when the measured variables are mixed with different types.

Proceedings ArticleDOI
26 Mar 2019
TL;DR: The clustering performance metrics achieved indicate that the k-means algorithm performs best when clustering on the Weibull distribution characteristics, together with the mean wind speed.
Abstract: This paper presents the results of an investigation done to cluster the wind speed profiles associated with the South African Renewable Energy Development Zones using the associated Weibull distribution characteristics, together with the mean wind speed. The study uses a meso-scale wind resource dataset produced by the Council for Scientific and Industrial Research. Various clustering methods are explored, namely k-means clustering, the clustering large applications algorithm, the hierarchical agglomerative algorithm and a model-based clustering algorithm. Results are presented for each of the clustering algorithms for the Springbok renewable energy development zone for the high demand season wind speed profiles. These results include the non-overlapping clusters obtained, the Weibull distribution of the average profile associated with each cluster, the mean daily wind speed associated with each cluster and final analysis with an associated geographical cluster map. Clustering performance metrics, including the average silhouette width, Dunn index, the average intra-cluster distance, connectivity and the Calinski-Harabasz index are presented and interpreted. The clustering performance metrics achieved indicate that the k-means algorithm performs best when clustering on the Weibull distribution characteristics.

Journal ArticleDOI
Srujan Chinta1
01 Oct 2019
TL;DR: This paper describes the design and construction of the proposed Algorithm for Firefly’s rough-tangent-Kernel system.
Abstract: Data clustering methods have been used extensively for image segmentation in the past decade. In one of the author's previous works, this paper has established that combining the traditional cluste...

Book ChapterDOI
01 Jan 2019
TL;DR: The observations of changing coordinate system of objects from Euclidean to Polar on clustering and the possibilities of clustering with different distance techniques for partitioning objects represented in Polar coordinate system are explored.
Abstract: Clustering is unsupervised learning technique to group similar objects. The quality of clustering is assessed by several internal as well as external measures such as Dunn index, Davies–Bouldin index (DB), Calinski-Harabasz index (CH), Silhouette index, R-Squared, Rand, Jaccard, Purity and Entropy, F-measures, and many more. Researchers are exploring different approaches to improve quality of clustering by experimenting with different partitioning strategies (similarity/distance formula), by changing representation of data points or by applying different algorithms. In our earlier research paper (Joshi and Patil in 2016 IEEE Conference on Current Trends in Advanced Commuting (ICCTAC), pp 1–7, 2016 [1]), we put forth our observations of changing coordinate system of objects from Euclidean to Polar on clustering. In continuation, we further experimented to explore the possibilities of clustering with different distance techniques for partitioning objects represented in Polar coordinate system. We experimented with a standard as well as real data set. The quality of clustering is evaluated using Silhouette internal evaluation measure.

Book ChapterDOI
01 Jan 2019
TL;DR: The experimental results prove that the proposed clustering algorithms outperform the existing contemporary clusteringgorithms.
Abstract: In this paper, we combine two famous fuzzy data clustering algorithms called fuzzy C-means and intuitionistic fuzzy C-means with a metaheuristic called fuzzy firefly algorithm. The resultant hybrid clustering algorithms (FCMFFA and IFCMFFA) are used for image segmentation. We compare the performance of the proposed algorithms with FCM, IFCM, FCMFA (fuzzy C-means fused with firefly algorithm), and IFCMFA (intuitionistic fuzzy C-means fused with firefly algorithm). The centroid values returned by firefly algorithm and fuzzy firefly algorithm are compared. Two performance indices, namely Davies–Bouldin (DB) index and Dunn index, have also been used to judge the quality of the clustering output. Different types of images have been used for the empirical analysis. Our experimental results prove that the proposed clustering algorithms outperform the existing contemporary clustering algorithms.

Journal ArticleDOI
TL;DR: The results achieved in this study were to generate potential anomaly areas in the ADS-B data in the segment region to automatically resolve conflicts on flights through the flight route.

Journal ArticleDOI
TL;DR: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data.
Abstract: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3′ end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes. Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3′ end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes. By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3′ end sequencing data to address the complex biological phenomenon.