scispace - formally typeset
Open AccessJournal ArticleDOI

Model-based clustering and data transformations for gene expression data.

Reads0
Chats0
TLDR
The model-based approach has superior performance on synthetic data sets, consistently selecting the correct model and the number of clusters, and the validity of the Gaussian mixture assumption on different transformations of real data is explored.
Abstract
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a ‘good’ clustering method and determining the ‘correct’ number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. Results: We benchmarked the performance of modelbased clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. Availability: MCLUST is available at http://www.stat. washington.edu/fraley/mclust. The software for the diagonal model is under development.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Model-Based Clustering, Discriminant Analysis, and Density Estimation

TL;DR: This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.
Journal ArticleDOI

mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.

TL;DR: This updated version of mclust adds new covariance structures, dimension reduction capabilities for visualisation, model selection criteria, initialisation strategies for the EM algorithm, and bootstrap-based inference, making it a full-featured R package for data analysis via finite mixture modelling.
Journal ArticleDOI

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

TL;DR: A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.
Journal ArticleDOI

Cluster analysis for gene expression data: a survey

TL;DR: This paper divides cluster analysis for gene expression data into three categories, presents specific challenges pertinent to each clustering category and introduces several representative approaches, and suggests the promising trends in this field.
References
More filters
Journal ArticleDOI

Estimating the Dimension of a Model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.

Estimating the dimension of a model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Journal ArticleDOI

Cluster analysis and display of genome-wide expression patterns

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Journal ArticleDOI

An Analysis of Transformations

TL;DR: In this article, Lindley et al. make the less restrictive assumption that such a normal, homoscedastic, linear model is appropriate after some suitable transformation has been applied to the y's.
Journal ArticleDOI

Objective Criteria for the Evaluation of Clustering Methods

TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.
Related Papers (5)