scispace - formally typeset
Open AccessJournal ArticleDOI

Comparisons and validation of statistical clustering techniques for microarray gene expression data

Susmita Datta, +1 more
- 01 Mar 2003 - 
- Vol. 19, Iss: 4, pp 459-466
TLDR
Six clustering algorithms are considered and it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes.
Abstract
Motivation: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process Biologists are attempting to group genes based on the temporal pattern of their expression levels While the use of hierarchical clustering (UPGMA) with correlation ‘distance’ has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present We evaluate each of these six clustering methods with these validation measures While the ‘best’ method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes Availability: S+ codes for the partial least squares based clustering are available from the authors upon request All ∗ To whom correspondence should be addressed other clustering methods considered have S+ implementation in the library MASS S+ codes for calculating the validation measures are available from the authors upon request The sporulation data set is publicly available at

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Microarray data analysis: from disarray to consolidation and consensus.

TL;DR: In just a few years, microarrays have gone from obscurity to being almost ubiquitous in biological research, and points of consensus are emerging about the general approaches that warrant use and elaboration.
Journal ArticleDOI

Increased Expression of Genes Converting Adrenal Androgens to Testosterone in Androgen-Independent Prostate Cancer

TL;DR: Enhanced intracellular conversion of adrenal androgens to testosterone and dihydrotestosterone is a mechanism by which prostate cancer cells adapt to androgen deprivation and suggest new therapeutic targets.
Journal ArticleDOI

A systematic comparison and evaluation of biclustering methods for gene expression data

TL;DR: A methodology for comparing and validating biclustering methods that includes a simple binary reference model that captures the essential features of most bic Lustering approaches and proposes a fast divide-and-conquer algorithm (Bimax).
Journal ArticleDOI

Computational cluster validation in post-genomic data analysis

TL;DR: In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.
References
More filters
Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Book

Self-Organizing Maps

TL;DR: The Self-Organising Map (SOM) algorithm was introduced by the author in 1981 as mentioned in this paper, and many applications form one of the major approaches to the contemporary artificial neural networks field, and new technologies have already been based on it.
Journal ArticleDOI

The Elements of Statistical Learning

Eric R. Ziegel
- 01 Aug 2003 - 
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Book

Finding Groups in Data: An Introduction to Cluster Analysis

TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
Related Papers (5)