Bagging to improve the accuracy of a clustering procedure
Sandrine Dudoit,Jane Fridlyand +1 more
TLDR
Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure to solve the problem of accurate partitioning of tumor samples into clusters.Abstract:Â
MOTIVATION The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gene expression profiles. Essential aspects of this clustering problem include identifying accurate partitions of the tumor samples into clusters and assessing the confidence of cluster assignments for individual samples. RESULTS Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure. In these ensemble methods, a partitioning clustering procedure is applied to bootstrap learning sets and the resulting multiple partitions are combined by voting or the creation of a new dissimilarity matrix. As in prediction, the motivation behind bagging is to reduce variability in the partitioning results via averaging. The performances of the new and existing methods were compared using simulated data and gene expression data from two recently published cancer microarray studies. The bagged clustering procedures were in general at least as accurate and often substantially more accurate than a single application of the partitioning clustering procedure. A valuable by-product of bagged clustering are the cluster votes which can be used to assess the confidence of cluster assignments for individual observations. SUPPLEMENTARY INFORMATION For supplementary information on datasets, analyses, and software, consult http://www.stat.berkeley.edu/~sandrine and http://www.bioconductor.org.read more
Citations
More filters
Journal ArticleDOI
Novel mutations target distinct subgroups of medulloblastoma
Giles W. Robinson,Matthew Parker,Tanya A. Kranenburg,Charles Lu,Charles Lu,Xiang Chen,Li Ding,Li Ding,Timothy N. Phoenix,Erin Hedlund,Lei Wei,Xiaoyan Zhu,Nader Chalhoub,Suzanne J. Baker,Robert Huether,Richard W. Kriwacki,Natasha Curley,Radhika Thiruvenkatam,Jianmin Wang,Gang Wu,Michael Rusch,Xin Hong,Xin Hong,Jared Becksfort,Pankaj Gupta,Jing Ma,John Easton,Bhavin Vadodaria,Arzu Onar-Thomas,Tong Lin,Shaoyi Li,Stanley Pounds,Steven W. Paugh,David Zhao,Daisuke Kawauchi,Martine F. Roussel,David Finkelstein,David W. Ellison,Ching C. Lau,Eric Bouffet,Tim Hassall,Tim Hassall,Sridharan Gururangan,Sridharan Gururangan,Richard J. Cohn,Richard J. Cohn,Robert S. Fulton,Robert S. Fulton,Lucinda L. Fulton,Lucinda L. Fulton,David J. Dooling,David J. Dooling,Kerri Ochoa,Kerri Ochoa,Amar Gajjar,Elaine R. Mardis,Richard K. Wilson,James R. Downing,Jinghui Zhang,Richard J. Gilbertson +59 more
TL;DR: Modelling of mutations in mouse lower rhombic lip progenitors that generate WNT-subgroup tumours identified genes that maintain this cell lineage (DDX3X), as well as mutated genes that initiate (CDH1) or cooperate (PIK3CA) in tumorigenesis.
Journal ArticleDOI
A prediction-based resampling method for estimating the number of clusters in a dataset
Sandrine Dudoit,Jane Fridlyand +1 more
TL;DR: A new prediction-based resampling method, Clest, is developed, to estimate the number of clusters in a dataset, and was generally found to be more accurate and robust than the six existing methods considered in the study.
Journal ArticleDOI
Clustering ensembles: models of consensus and weak partitions
TL;DR: A unified representation for multiple clusterings is introduced and a probabilistic model of consensus is proposed using a finite mixture of multinomial distributions in a space of clusterings in order to define a new consensus function related to the classical intraclass variance criterion.
Journal ArticleDOI
Gene-Expression Patterns in Drug-Resistant Acute Lymphoblastic Leukemia Cells and Response to Treatment
Amy Holleman,Meyling Cheok,Monique L. den Boer,Wenjian Yang,Anjo J.P. Veerman,Karin M. Kazemier,Deqing Pei,Cheng Cheng,Ching-Hon Pui,Mary V. Relling,Gritta Janka-Schaub,Rob Pieters,William E. Evans +12 more
TL;DR: Differential expression of a relatively small number of genes is associated with drug resistance and treatment outcome in childhood ALL.
Journal ArticleDOI
A survey of clustering ensemble algorithms
TL;DR: An overview of clustering ensemble methods that can be very useful for the community of clusters practitioners is presented and a taxonomy of these techniques is presented to illustrate some important applications.
References
More filters
Journal ArticleDOI
Bagging predictors
TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.
Journal ArticleDOI
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Yoav Freund,Robert E. Schapire +1 more
TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Journal ArticleDOI
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Todd R. Golub,Todd R. Golub,Donna K. Slonim,Pablo Tamayo,Christine Huard,Michelle Gaasenbeek,Jill P. Mesirov,Hilary A. Coller,Mignon L. Loh,James R. Downing,Michael A. Caligiuri,Clara D. Bloomfield,Eric S. Lander +12 more
TL;DR: A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case and suggests a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.
Book
Finding Groups in Data: An Introduction to Cluster Analysis
TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
Journal ArticleDOI
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Ash A. Alizadeh,Michael B. Eisen,R. Eric Davis,Izidore S. Lossos,Andreas Rosenwald,Jennifer C. Boldrick,Hajeer Sabet,Truc Tran,Xin Yu,John Powell,Liming Yang,Gerald E. Marti,Troy Moore,James I. Hudson,Li-Sheng Lu,David B. Lewis,Robert Tibshirani,Gavin Sherlock,Wing C. Chan,Timothy C. Greiner,Dennis D. Weisenburger,James O. Armitage,Roger A. Warnke,Ronald Levy,Wyndham H. Wilson,M. R. Grever,John C. Byrd,David Botstein,Patrick O. Brown,Louis M. Staudt +29 more
TL;DR: It is shown that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour.
Related Papers (5)
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
Alexander Strehl,Joydeep Ghosh +1 more