Computational cluster validation in post-genomic data analysis
TLDR
In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.Abstract:
Motivation: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge---whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics.
Results: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation.
Availability: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/
Contact: J.Handl@postgrad.manchester.ac.uk
Supplementary information: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/read more
Citations
More filters
Journal ArticleDOI
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups
Christina Curtis,Christina Curtis,Sohrab P. Shah,Suet-Feung Chin,Gulisa Turashvili,Oscar M. Rueda,Mark J Dunning,Doug Speed,Doug Speed,Andy G. Lynch,Shamith A. Samarajiwa,Yinyin Yuan,Stefan Gräf,Gavin Ha,Gholamreza Haffari,Ali Bashashati,Roslin Russell,Steven McKinney,Anita Langerød,Andrew R. Green,Elena Provenzano,Gordon C. Wishart,Sarah E Pinder,Peter H. Watson,Peter H. Watson,Florian Markowetz,Leigh C. Murphy,Ian O. Ellis,Arnie Purushotham,Arnie Purushotham,Anne Lise Børresen-Dale,Anne Lise Børresen-Dale,James D. Brenton,Simon Tavaré,Carlos Caldas,Samuel Aparicio +35 more
TL;DR: The results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome, and identify novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort.
Journal ArticleDOI
A systematic comparison and evaluation of biclustering methods for gene expression data
Amela Prelić,Stefan Bleuler,Philip Zimmermann,Anja Wille,Peter Bühlmann,Wilhelm Gruissem,Lars Hennig,Lothar Thiele,Eckart Zitzler +8 more
TL;DR: A methodology for comparing and validating biclustering methods that includes a simple binary reference model that captures the essential features of most bic Lustering approaches and proposes a fast divide-and-conquer algorithm (Bimax).
Journal ArticleDOI
Is my network module preserved and reproducible
TL;DR: This work studies several types of network preservation statistics that do not require a module assignment in the test network, and finds that several human cortical modules are less preserved in chimpanzees.
Journal ArticleDOI
Statistical strategies for avoiding false discoveries in metabolomics and related experiments
David Broadhurst,Douglas B. Kell +1 more
TL;DR: A list of some of the simpler checks that might improve one’s confidence that a candidate biomarker is not simply a statistical artefact is provided, and a series of preferred tests and visualisation tools that can assist readers and authors in assessing papers are suggested.
References
More filters
Book
An introduction to the bootstrap
Bradley Efron,Robert Tibshirani +1 more
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Some methods for classification and analysis of multivariate observations
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Book
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Proceedings Article
A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.