Waste not, want not: why rarefying microbiome data is inadmissible.
Paul J. McMurdie,Susan Holmes +1 more
Reads0
Chats0
TLDR
It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.Abstract:
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.read more
Citations
More filters
Journal ArticleDOI
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Posted ContentDOI
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Journal ArticleDOI
Orchestrating high-throughput genomic analysis with Bioconductor
Wolfgang Huber,Vincent J. Carey,Robert Gentleman,Simon Anders,Marc R. J. Carlson,Benilton S. Carvalho,Héctor Corrada Bravo,Sean Davis,Laurent Gatto,Thomas Girke,Raphael Gottardo,Florian Hahne,Kasper D. Hansen,Rafael A. Irizarry,Michael S. Lawrence,Michael I. Love,James W. MacDonald,Valerie Obenchain,Andrzej K. Oleś,Hervé Pagès,Alejandro Reyes,Paul Shannon,Gordon K. Smyth,Dan Tenenbaum,Levi Waldron,Martin Morgan +25 more
TL;DR: An overview of Bioconductor, an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology, which comprises 934 interoperable packages contributed by a large, diverse community of scientists.
Journal ArticleDOI
Structure, variation, and assembly of the root-associated microbiomes of rice.
Joseph Edwards,Cameron Johnson,Christian Santos-Medellín,Eugene Lurie,Natraj Kumar Podishetty,Srijak Bhatnagar,Jonathan A. Eisen,Venkatesan Sundaresan +7 more
TL;DR: Dynamic changes observed during microbiome acquisition, as well as steady-state compositions of spatial compartments, support a multistep model for root microbiome assembly from soil wherein the rhizoplane plays a selective gating role.
Journal ArticleDOI
Microbiome Datasets Are Compositional: And This Is Not Optional.
TL;DR: The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis.
References
More filters
Journal ArticleDOI
Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes
Gary D. Wu,Jun Chen,Christian Hoffmann,Christian Hoffmann,Kyle Bittinger,Ying-Yu Chen,Sue A. Keilbaugh,Meenakshi Bewtra,Dan Knights,William A. Walters,Rob Knight,Rohini Sinha,Erin Gilroy,Kernika Gupta,Robert N. Baldassano,Lisa Nessel,Hongzhe Li,Frederic D. Bushman,James D. Lewis +18 more
TL;DR: Alternative enterotype states are associated with long-term diet, particularly protein and animal fat (Bacteroides) versus carbohydrates (Prevotella) and other enterotypes distinguished primarily by levels of Bacteroide and Prevotella.
Book
Regression Analysis of Count Data
TL;DR: The authors combine theory and practice to make sophisticated methods of analysis accessible to researchers and practitioners working with widely different types of data and software in areas such as applied statistics, econometrics, marketing, operations research, actuarial studies, demography, biostatistics and quantitative social sciences.
Journal ArticleDOI
Next-generation DNA sequencing.
Jay Shendure,Hanlee P. Ji +1 more
TL;DR: Next-generation DNA sequencing has the potential to dramatically accelerate biological and biomedical research, by enabling the comprehensive analysis of genomes, transcriptomes and interactomes to become inexpensive, routine and widespread, rather than requiring significant production-scale efforts.
Journal ArticleDOI
Some distance properties of latent root and vector methods used in multivariate analysis
TL;DR: In this paper, the authors derived necessary and sufficient conditions for a solution to exist in real Euclidean space for a multivariate multivariate sample of size n as points P1, P2,..., PI in a Euclidian space and discussed the interpretation of the distance A(Pi, Pj) between the ith and jth members of the sample.
Book
Mathematical Statistics and Data Analysis
TL;DR: In this article, the authors present a model for estimating parameters and fitting of probability distributions from the normal distribution. But the model is not suitable for the analysis of categorical data.
Related Papers (5)
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
QIIME allows analysis of high-throughput community sequencing data.
J. Gregory Caporaso,Justin Kuczynski,Jesse Stombaugh,Kyle Bittinger,Frederic D. Bushman,Elizabeth K. Costello,Noah Fierer,Antonio Gonzalez Peña,Julia K. Goodrich,Jeffrey I. Gordon,Gavin A. Huttley,Scott T. Kelley,Dan Knights,Jeremy E. Koenig,Ruth E. Ley,Catherine A. Lozupone,Daniel McDonald,Brian D. Muegge,Meg Pirrung,Jens Reeder,Joel Sevinsky,Peter J. Turnbaugh,William A. Walters,Jeremy Widmann,Tanya Yatsunenko,Jesse R. Zaneveld,Rob Knight,Rob Knight +27 more
Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities
Patrick D. Schloss,Patrick D. Schloss,Sarah L. Westcott,Sarah L. Westcott,Thomas Ryabin,Justine R. Hall,Martin Hartmann,Emily B. Hollister,Ryan A. Lesniewski,Brian B. Oakley,Donovan H. Parks,Courtney J. Robinson,Jason W. Sahl,Blaz Stres,Gerhard G. Thallinger,David J. Van Horn,Carolyn F. Weber +16 more