scispace - formally typeset
Search or ask a question

Showing papers on "Rand index published in 2007"


Journal ArticleDOI
TL;DR: A fuzzy extension of the Rand index is introduced, able to evaluate a fuzzy partition of a data set - provided by a fuzzy clustering algorithm or a classifier with fuzzy-like outputs - against a reference hard partition that encodes the actual (known) data classes.

198 citations


Journal ArticleDOI
TL;DR: Experiments on gene expression data indicate that the first time in which GCC is applied to class discovery for microarray data can outperform most of the existing algorithms, identify the number of classes correctly in real cancer datasets, and discover the classes of samples with biological meaning.
Abstract: Motivation: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. Results: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as Kmax in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. Availability: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu. Contact:yuzhiwen@cs.cityu.edu.hk and cshswong@cityu.edu.hk Supplementary information: Supplementary data are available at Bioinformatics online.

167 citations


Proceedings ArticleDOI
03 Dec 2007
TL;DR: An adjusted iK-Means method is proposed, which performs well in the current experiment setting and is compared to the least squares and least modules version of an intelligent version of the method by Mirkin.
Abstract: K-means is one of the most popular data mining and unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a pre-specified number of clusters K, therefore the problem of determining "the right number of clusters" has attracted considerable interest. However, to the authors' knowledge, no experimental results of their comparison have been reported so far. This paper presents results of such a comparison involving eight selection options presenting four approaches. We generate data according to a Gaussian-mixture distribution with clusters' spread and spatial sizes variant. Most consistent results are shown by the least squares and least modules version of an intelligent version of the method, iK-Means by Mirkin [14]. However, the right K is reproduced best by the Hartigan's [5] method. This leads us to propose an adjusted iK-Means method, which performs well in the current experiment setting.

39 citations


Book ChapterDOI
18 Jul 2007
TL;DR: i¾?is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances and the monotonic regression used by Nonmetric MDS produces solutions with good values of i¾?
Abstract: We develop a metric i¾?, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. This metric is designed to test the preservation of neighborhood structure in derived lower dimensional configurations. We use a customer information data set to show how i¾?can be used to compare dimensionality reduction methods, tune method parameters, and choose solutions when methods have a local optimum problem. We show that i¾?is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances. In general a method with a good value of i¾?also has a good value of K. However the monotonic regression used by Nonmetric MDS produces solutions with good values of i¾?, but poor values of K.

28 citations


Journal ArticleDOI
TL;DR: The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically meaningful clustering results are obtained.
Abstract: Clustering is an important analysis performed on microarray gene expression data since it groups genes which have similar expression patterns and enables the exploration of unknown gene functions. Microarray experiments are associated with many sources of experimental and biological variation and the resulting gene expression data are therefore very noisy. Many heuristic and model-based clustering approaches have been developed to cluster this noisy data. However, few of them include consideration of probe-level measurement error which provides rich information about technical variability. We augment a standard model-based clustering method to incorporate probe-level measurement error. Using probe-level measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we include the probe-level measurement error directly into the standard Gaussian mixture model. Our augmented model is shown to provide improved clustering performance on simulated datasets and a real mouse time-course dataset. The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically meaningful clustering results are obtained.

25 citations


Book ChapterDOI
01 Jan 2007
TL;DR: An automatic validation of hierarchical clustering based on resampling techniques is recommended that can be considered as a three level assessment of stability.
Abstract: An automatic validation of hierarchical clustering based on resampling techniques is recommended that can be considered as a three level assessment of stability. The first and most general level is decision making about the appropriate number of clusters. The decision is based on measures of correspondence between partitions such as the adjusted Rand index. Second, the stability of each individual cluster is assessed based on measures of similarity between sets such as the Jaccard coefficient. In the third and most detailed level of validation, the reliability of the cluster membership of each individual observation can be assessed. The built-in validation is demonstrated on the wine data set from the UCI repository where both the number of clusters and the class membership are known beforehand.

11 citations


01 Jan 2007
TL;DR: The experimental results show that the proposed speaker-clustering method outperforms the conventional method based on hierarchical agglomerative clustering in conjunction with the Bayesian information criterion to determine the number of clusters.
Abstract: Thispaper presents aneffective methodforclustering unknown speech utterances based ontheir associated speakers. Theproposed method jointly optimizes thegenerated clusters andthenumberof clusters byestimating andminimizing theRandindexofthe clustering. TheRandindex, whichreflects clustering errors that utterances fromthesamespeaker areplaced indifferent clusters, orutterances fromdifferent speakers areplaced inthesamecluster, reaches itsminimal valueonlywhenthenumberofclusters is equal tothetruespeaker population size. We approximate the Randindexbya function ofthesimilarity measures between utterances andemploythegenetic algorithm todetermine the cluster whereeachutterance should belocated, suchthatthe overall clustering errors areminimized. Theexperimental results showthat theproposed speaker-clustering method outperforms the conventional methodbasedon hierarchical agglomerative clustering inconjunction withtheBayesian information criterion todetermine thenumber ofclusters.

5 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: In this paper, a method for clustering unknown speech utterances based on their associated speakers is presented, which jointly optimizes the generated clusters and the number of clusters by estimating and minimizing the Rand index of the clustering.
Abstract: This paper presents an effective method for clustering unknown speech utterances based on their associated speakers. The proposed method jointly optimizes the generated clusters and the number of clusters by estimating and minimizing the Rand index of the clustering. The Rand index, which reflects clustering errors that utterances from the same speaker are placed in different clusters, or utterances from different speakers are placed in the same cluster, reaches its minimal value only when the number of clusters is equal to the true speaker population size. We approximate the Rand index by a function of the similarity measures between utterances and employ the genetic algorithm to determine the cluster where each utterance should be located, such that the overall clustering errors are minimized. The experimental results show that the proposed speaker-clustering method outperforms the conventional method based on hierarchical agglomerative clustering in conjunction with the Bayesian information criterion to determine the number of clusters.

5 citations


Book ChapterDOI
01 Jan 2007
TL;DR: An adaptive dissimilarity index is proposed which would cover both values and behavior proximity and is illustrated through a classification process for identification of genes cell cycle phases.
Abstract: DNA microarray technology allows to monitor simultaneously the expression levels of thousands of genes during important biological processes and across collections of related experiments. Clustering and classification techniques have proved to be helpful to understand gene function, gene regulation, and cellular processes. However the conventional proximity measures between genes expression data, used for clustering or classification purpose, do not fit gene expression specifications as they are based on the closeness of the expression magnitudes regardless of the overall gene expression profile (shape). We propose in this paper an adaptive dissimilarity index which would cover both values and behavior proximity. The effectiveness of the adaptive dissimilarity index is illustrated through a classification process for identification of genes cell cycle phases.

2 citations


Proceedings ArticleDOI
21 May 2007
TL;DR: This research conducts research with Markovian models, so that it has a solid ground for comparison, although the benefits of applying clustering techniques lie in the domain of the non-Markovian processes.
Abstract: In contemporary telecommunication systems Markov processes are seldom observed, and the widely used Markovian models don't represent precisely the real system. In order to omit the need of modeling the system with a Markov chain we apply different clustering approaches for obtaining the steady state probabilities, which are represented by the data clusters. Some widely used data clustering methods are applied for performance evaluation of different telecommunication networks. However, in order to accomplish our investigation, we conduct our research with Markovian models, so that we have a solid ground for comparison, although the benefits of applying clustering techniques lie in the domain of the non-Markovian processes.

1 citations


Journal ArticleDOI
01 Jan 2007
TL;DR: This paper compared various cluster validity indices for low-dimensional simulation data and real gene expression data and found that Dunn`s index is the most effective and robust, Silhouette index is next and Davies-Bouldin Index is the bottom among the internal measures.
Abstract: Many clustering algorithms and cluster validation techniques for high-dimensional gene expression data have been suggested. The evaluations of these cluster validation techniques have, however, seldom been implemented. In this paper we compared various cluster validity indices for low-dimensional simulation data and real gene expression data, and found that Dunn`s index is the most effective and robust, Silhouette index is next and Davies-Bouldin index is the bottom among the internal measures. Jaccard index is much more effective than Goodman-Kruskal index and adjusted Rand index among the external measures.

Proceedings ArticleDOI
29 Oct 2007
TL;DR: The Sentence Cluster Model is developed as a multidimensional SMM, the solution of parameter estimation by EM algorithm is got, and the result on aspect of word sense distinction, part-of-speech distinction and window size choosing is discussed.
Abstract: The purpose of this article is to research the clustering method based on statistical model, then deal with the Chinese sentence clustering problem on bilingual lexicographical platform. In the view of cooccurrence data, we develop the Sentence Cluster Model as a multidimensional SMM, and get the solution of parameter estimation by EM algorithm. Based on this model, we represent three methods for sentence clustering, and use Rand index to evaluate our method through experiments on corpus with comparison to the k-means algorithm. We mainly discuss the result on aspect of word sense distinction, part-of-speech distinction and window size choosing.