scispace - formally typeset
Search or ask a question
Topic

Rand index

About: Rand index is a research topic. Over the lifetime, 630 publications have been published within this topic receiving 20373 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: An investigation of cluster validation indices that relates 4 of the indices to the L. Arabie (1985) adjusted Rand index is provided, which provides a method for testing the significance of observed adjusted Rand indices.
Abstract: This article provides an investigation of cluster validation indices that relates 4 of the indices to the L. Hubert and P. Arabie (1985) adjusted Rand index--the cluster validation measure of choice (G. W. Milligan & M. C. Cooper, 1986). It is shown how these other indices can be "roughly" transformed into the same scale as the adjusted Rand index. Furthermore, in-depth explanations are given of why classification rates should not be used in cluster validation research. The article concludes by summarizing several properties of the adjusted Rand index across many conditions and provides a method for testing the significance of observed adjusted Rand indices.

520 citations

Journal ArticleDOI
TL;DR: The results of the study indicated that the Hubert and Arabie adjusted Rank index was best suited to the task of comparison across hierarchy levels.
Abstract: Five external criteria were used to evaluate the extent of recovery of the true structure in a hierarchical clustering solution. This was accomplished by comparing the partitions produced by the clustering algorithm with the partition that indicates the true cluster structure known to exist in the data. The five criteria examined were the Rand, the Morey and Agresti adjusted Rand, the Hubert and Arabie adjusted Rand, the Jaccard, and the Fowlkes and Mallows measures. The results of the study indicated that the Hubert and Arabie adjusted Rank index was best suited to the task of comparison across hierarchy levels. Deficiencies with the other measures are noted.

499 citations

Book ChapterDOI
02 Oct 2009
TL;DR: This paper investigates the usability of this clustering validation measure in supervised classification problems by two different approaches: as a performance measure and in feature selection.
Abstract: The Adjusted Rand Index (ARI) is frequently used in cluster validation since it is a measure of agreement between two partitions: one given by the clustering process and the other defined by external criteria. In this paper we investigate the usability of this clustering validation measure in supervised classification problems by two different approaches: as a performance measure and in feature selection. Since ARI measures the relation between pairs of dataset elements not using information from classes (labels) it can be used to detect problems with the classification algorithm specially when combined with conventional performance measures. Instead, if we use the class information, we can apply ARI also to perform feature selection. We present the results of several experiments where we have applied ARI both as a performance measure and for feature selection showing the validity of this index for the given tasks.

339 citations

Journal ArticleDOI
TL;DR: A new combined stability index is proposed to be the sum of the pairwise individual and ensemble stabilities to correlate better with the ensemble accuracy, following the hypothesis that a point of stability of a clustering algorithm corresponds to a structure found in the data.
Abstract: Many clustering algorithms, including cluster ensembles, rely on a random component. Stability of the results across different runs is considered to be an asset of the algorithm. The cluster ensembles considered here are based on k-means clusterers. Each clusterer is assigned a random target number of clusters, k and is started from a random initialization. Here, we use 10 artificial and 10 real data sets to study ensemble stability with respect to random k, and random initialization. The data sets were chosen to have a small number of clusters (two to seven) and a moderate number of data points (up to a few hundred). Pairwise stability is defined as the adjusted Rand index between pairs of clusterers in the ensemble, averaged across all pairs. Nonpairwise stability is defined as the entropy of the consensus matrix of the ensemble. An experimental comparison with the stability of the standard k-means algorithm was carried out for k from 2 to 20. The results revealed that ensembles are generally more stable, markedly so for larger k. To establish whether stability can serve as a cluster validity index, we first looked at the relationship between stability and accuracy with respect to the number of clusters, k. We found that such a relationship strongly depends on the data set, varying from almost perfect positive correlation (0.97, for the glass data) to almost perfect negative correlation (-0.93, for the crabs data). We propose a new combined stability index to be the sum of the pairwise individual and ensemble stabilities. This index was found to correlate better with the ensemble accuracy. Following the hypothesis that a point of stability of a clustering algorithm corresponds to a structure found in the data, we used the stability measures to pick the number of clusters. The combined stability index gave best results

306 citations

Journal ArticleDOI
TL;DR: The results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clusters and SOM perform among the worst.
Abstract: Motivation: Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including hierarchical clustering, K-means, PAM, SOM, mixture model-based clustering and tight clustering have been widely used in the literature. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods. Results: In this paper, six gene clustering methods are evaluated by simulated data from a hierarchical log-normal model with various degrees of perturbation as well as four real datasets. A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes (i.e. a set of noise genes not being clustered). Performance of the methods in the real data is assessed by a predictive accuracy analysis through verified gene annotations. Our results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clustering and SOM perform among the worst. Our analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

294 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
83% related
Support vector machine
73.6K papers, 1.7M citations
80% related
Feature (computer vision)
128.2K papers, 1.7M citations
78% related
Deep learning
79.8K papers, 2.1M citations
78% related
Feature extraction
111.8K papers, 2.1M citations
78% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20238
202222
202170
202064
201945
201842