Showing papers on "Rand index published in 2007"

PDF

Open Access

Journal Article•DOI•

A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment

[...]

Ricardo J. G. B. Campello¹•Institutions (1)

01 May 2007-Pattern Recognition Letters

TL;DR: A fuzzy extension of the Rand index is introduced, able to evaluate a fuzzy partition of a data set - provided by a fuzzy clustering algorithm or a classifier with fuzzy-like outputs - against a reference hard partition that encodes the actual (known) data classes.

...read moreread less

198 citations

Journal Article•DOI•

Graph-based consensus clustering for class discovery from gene expression data

[...]

Zhiwen Yu¹, Hau-San Wong¹, Hong-Qiang Wang¹•Institutions (1)

City University of Hong Kong¹

01 Nov 2007-Bioinformatics

TL;DR: Experiments on gene expression data indicate that the first time in which GCC is applied to class discovery for microarray data can outperform most of the existing algorithms, identify the number of classes correctly in real cancer datasets, and discover the classes of samples with biological meaning.

...read moreread less

Abstract: Motivation: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. Results: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as Kmax in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. Availability: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu. Contact:yuzhiwen@cs.cityu.edu.hk and cshswong@cityu.edu.hk Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

167 citations

Proceedings Article•DOI•

Experiments for the number of clusters in K-means

[...]

Mark Ming-Tso Chiang¹, Boris Mirkin¹•Institutions (1)

Birkbeck, University of London¹

03 Dec 2007

TL;DR: An adjusted iK-Means method is proposed, which performs well in the current experiment setting and is compared to the least squares and least modules version of an intelligent version of the method by Mirkin.

...read moreread less

Abstract: K-means is one of the most popular data mining and unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a pre-specified number of clusters K, therefore the problem of determining "the right number of clusters" has attracted considerable interest. However, to the authors' knowledge, no experimental results of their comparison have been reported so far. This paper presents results of such a comparison involving eight selection options presenting four approaches. We generate data according to a Gaussian-mixture distribution with clusters' spread and spatial sizes variant. Most consistent results are shown by the least squares and least modules version of an intelligent version of the method, iK-Means by Mirkin [14]. However, the right K is reproduced best by the Hartigan's [5] method. This leads us to propose an adjusted iK-Means method, which performs well in the current experiment setting.

...read moreread less

39 citations

Book Chapter•DOI•

Development of an Agreement Metric Based Upon the RAND Index for the Evaluation of Dimensionality Reduction Techniques, with Applications to Mapping Customer Data

[...]

Douglas Carroll¹•Institutions (1)

Saint Petersburg State University¹

18 Jul 2007

TL;DR: i¾?is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances and the monotonic regression used by Nonmetric MDS produces solutions with good values of i¾?

...read moreread less

Abstract: We develop a metric i¾?, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. This metric is designed to test the preservation of neighborhood structure in derived lower dimensional configurations. We use a customer information data set to show how i¾?can be used to compare dimensionality reduction methods, tune method parameters, and choose solutions when methods have a local optimum problem. We show that i¾?is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances. In general a method with a good value of i¾?also has a good value of K. However the monotonic regression used by Nonmetric MDS produces solutions with good values of i¾?, but poor values of K.

...read moreread less

28 citations

Journal Article•DOI•

Including probe-level uncertainty in model-based gene expression clustering

[...]

Xuejun Liu¹, Kevin K. Lin², Bogi Andersen², Magnus Rattray³•Institutions (3)

Nanjing University of Aeronautics and Astronautics¹, University of California, Irvine², University of Manchester³

21 Mar 2007-BMC Bioinformatics

TL;DR: The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically meaningful clustering results are obtained.

...read moreread less

Abstract: Clustering is an important analysis performed on microarray gene expression data since it groups genes which have similar expression patterns and enables the exploration of unknown gene functions. Microarray experiments are associated with many sources of experimental and biological variation and the resulting gene expression data are therefore very noisy. Many heuristic and model-based clustering approaches have been developed to cluster this noisy data. However, few of them include consideration of probe-level measurement error which provides rich information about technical variability. We augment a standard model-based clustering method to incorporate probe-level measurement error. Using probe-level measurements from a recently developed Affymetrix probe-level model, multi-mgMOS, we include the probe-level measurement error directly into the standard Gaussian mixture model. Our augmented model is shown to provide improved clustering performance on simulated datasets and a real mouse time-course dataset. The performance of model-based clustering of gene expression data is improved by including probe-level measurement error and more biologically meaningful clustering results are obtained.

...read moreread less

25 citations

Book Chapter•DOI•

On Validation of Hierarchical Clustering

[...]

Hans-Joachim Mucha

01 Jan 2007

TL;DR: An automatic validation of hierarchical clustering based on resampling techniques is recommended that can be considered as a three level assessment of stability.

...read moreread less

Abstract: An automatic validation of hierarchical clustering based on resampling techniques is recommended that can be considered as a three level assessment of stability. The first and most general level is decision making about the appropriate number of clusters. The decision is based on measures of correspondence between partitions such as the adjusted Rand index. Second, the stability of each individual cluster is assessed based on measures of similarity between sets such as the Jaccard coefficient. In the third and most detailed level of validation, the reliability of the cluster membership of each individual observation can be assessed. The built-in validation is demonstrated on the wine data set from the UCI repository where both the number of clusters and the class membership are known beforehand.

...read moreread less

11 citations

Speaker clusteringbased on minimum rand index

[...]

Wei-HoTsaiI andHsin-Min Wang

01 Jan 2007

TL;DR: The experimental results show that the proposed speaker-clustering method outperforms the conventional method based on hierarchical agglomerative clustering in conjunction with the Bayesian information criterion to determine the number of clusters.

...read moreread less

Abstract: Thispaper presents aneffective methodforclustering unknown speech utterances based ontheir associated speakers. Theproposed method jointly optimizes thegenerated clusters andthenumberof clusters byestimating andminimizing theRandindexofthe clustering. TheRandindex, whichreflects clustering errors that utterances fromthesamespeaker areplaced indifferent clusters, orutterances fromdifferent speakers areplaced inthesamecluster, reaches itsminimal valueonlywhenthenumberofclusters is equal tothetruespeaker population size. We approximate the Randindexbya function ofthesimilarity measures between utterances andemploythegenetic algorithm todetermine the cluster whereeachutterance should belocated, suchthatthe overall clustering errors areminimized. Theexperimental results showthat theproposed speaker-clustering method outperforms the conventional methodbasedon hierarchical agglomerative clustering inconjunction withtheBayesian information criterion todetermine thenumber ofclusters.

...read moreread less

5 citations

Proceedings Article•DOI•

Speaker Clustering Based on Minimum Rand Index

[...]

Wei-Ho Tsai, Hsin-Min Wang¹•Institutions (1)

Academia Sinica¹

15 Apr 2007

TL;DR: In this paper, a method for clustering unknown speech utterances based on their associated speakers is presented, which jointly optimizes the generated clusters and the number of clusters by estimating and minimizing the Rand index of the clustering.

...read moreread less

Abstract: This paper presents an effective method for clustering unknown speech utterances based on their associated speakers. The proposed method jointly optimizes the generated clusters and the number of clusters by estimating and minimizing the Rand index of the clustering. The Rand index, which reflects clustering errors that utterances from the same speaker are placed in different clusters, or utterances from different speakers are placed in the same cluster, reaches its minimal value only when the number of clusters is equal to the true speaker population size. We approximate the Rand index by a function of the similarity measures between utterances and employ the genetic algorithm to determine the cluster where each utterance should be located, such that the overall clustering errors are minimized. The experimental results show that the proposed speaker-clustering method outperforms the conventional method based on hierarchical agglomerative clustering in conjunction with the Bayesian information criterion to determine the number of clusters.

...read moreread less

5 citations

Book Chapter•DOI•

[...]

Ahlame Douzal Chouakria¹, Alpha Diallo¹, Françoise Giroud¹•Institutions (1)

Joseph Fourier University¹

01 Jan 2007

TL;DR: An adaptive dissimilarity index is proposed which would cover both values and behavior proximity and is illustrated through a classification process for identification of genes cell cycle phases.

...read moreread less

Abstract: DNA microarray technology allows to monitor simultaneously the expression levels of thousands of genes during important biological processes and across collections of related experiments. Clustering and classification techniques have proved to be helpful to understand gene function, gene regulation, and cellular processes. However the conventional proximity measures between genes expression data, used for clustering or classification purpose, do not fit gene expression specifications as they are based on the closeness of the expression magnitudes regardless of the overall gene expression profile (shape). We propose in this paper an adaptive dissimilarity index which would cover both values and behavior proximity. The effectiveness of the adaptive dissimilarity index is illustrated through a classification process for identification of genes cell cycle phases.

...read moreread less

2 citations

Proceedings Article•DOI•

Modeling and Clustering Analysis of Broadband Convergence Networks

[...]

V. Denchev, Franz Pernkopf, D. Radev

21 May 2007

TL;DR: This research conducts research with Markovian models, so that it has a solid ground for comparison, although the benefits of applying clustering techniques lie in the domain of the non-Markovian processes.

...read moreread less

Abstract: In contemporary telecommunication systems Markov processes are seldom observed, and the widely used Markovian models don't represent precisely the real system. In order to omit the need of modeling the system with a Markov chain we apply different clustering approaches for obtaining the steady state probabilities, which are represented by the data clusters. Some widely used data clustering methods are applied for performance evaluation of different telecommunication networks. However, in order to accomplish our investigation, we conduct our research with Markovian models, so that we have a solid ground for comparison, although the benefits of applying clustering techniques lie in the domain of the non-Markovian processes.

...read moreread less

1 citations

Journal Article•DOI•

Comparison of the Cluster Validation Methods for High-dimensional (Gene Expression) Data

[...]

Yun-Kyoung Jeong, Jangsun Baek

01 Jan 2007

TL;DR: This paper compared various cluster validity indices for low-dimensional simulation data and real gene expression data and found that Dunn`s index is the most effective and robust, Silhouette index is next and Davies-Bouldin Index is the bottom among the internal measures.

...read moreread less

Abstract: Many clustering algorithms and cluster validation techniques for high-dimensional gene expression data have been suggested. The evaluations of these cluster validation techniques have, however, seldom been implemented. In this paper we compared various cluster validity indices for low-dimensional simulation data and real gene expression data, and found that Dunn`s index is the most effective and robust, Silhouette index is next and Davies-Bouldin index is the bottom among the internal measures. Jaccard index is much more effective than Goodman-Kruskal index and adjusted Rand index among the external measures.

...read moreread less

Proceedings Article•DOI•

Clustering of Chinese Sentences Using the SMM Model

[...]

Tiansang Du¹, Xinying Xu¹, Liang Chen¹, Baobao Chang¹•Institutions (1)

Peking University¹

29 Oct 2007

TL;DR: The Sentence Cluster Model is developed as a multidimensional SMM, the solution of parameter estimation by EM algorithm is got, and the result on aspect of word sense distinction, part-of-speech distinction and window size choosing is discussed.

...read moreread less

Abstract: The purpose of this article is to research the clustering method based on statistical model, then deal with the Chinese sentence clustering problem on bilingual lexicographical platform. In the view of cooccurrence data, we develop the Sentence Cluster Model as a multidimensional SMM, and get the solution of parameter estimation by EM algorithm. Based on this model, we represent three methods for sentence clustering, and use Rand index to evaluate our method through experiments on corpus with comparison to the k-means algorithm. We mainly discuss the result on aspect of word sense distinction, part-of-speech distinction and window size choosing.

...read moreread less