scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Waste not, want not: why rarefying microbiome data is inadmissible.

03 Apr 2014-PLOS Computational Biology (Public Library of Science)-Vol. 10, Iss: 4
TL;DR: It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract: Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations


Cites background from "Waste not, want not: why rarefying ..."

  • ..., [43]), ribosome profiling [44] and CRISPR/Cas-library assays [45]....

    [...]

Posted ContentDOI
17 Nov 2014-bioRxiv
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

17,014 citations


Cites methods from "Waste not, want not: why rarefying ..."

  • ..., [44]), ribosome profiling [45] and CRISPR/Caslibrary assays [46]....

    [...]

Journal ArticleDOI
TL;DR: An overview of Bioconductor, an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology, which comprises 934 interoperable packages contributed by a large, diverse community of scientists.
Abstract: Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing. We present an overview for prospective users and contributors.

2,818 citations

Journal ArticleDOI
TL;DR: Dynamic changes observed during microbiome acquisition, as well as steady-state compositions of spatial compartments, support a multistep model for root microbiome assembly from soil wherein the rhizoplane plays a selective gating role.
Abstract: Plants depend upon beneficial interactions between roots and microbes for nutrient availability, growth promotion, and disease suppression. High-throughput sequencing approaches have provided recent insights into root microbiomes, but our current understanding is still limited relative to animal microbiomes. Here we present a detailed characterization of the root-associated microbiomes of the crop plant rice by deep sequencing, using plants grown under controlled conditions as well as field cultivation at multiple sites. The spatial resolution of the study distinguished three root-associated compartments, the endosphere (root interior), rhizoplane (root surface), and rhizosphere (soil close to the root surface), each of which was found to harbor a distinct microbiome. Under controlled greenhouse conditions, microbiome composition varied with soil source and genotype. In field conditions, geographical location and cultivation practice, namely organic vs. conventional, were factors contributing to microbiome variation. Rice cultivation is a major source of global methane emissions, and methanogenic archaea could be detected in all spatial compartments of field-grown rice. The depth and scale of this study were used to build coabundance networks that revealed potential microbial consortia, some of which were involved in methane cycling. Dynamic changes observed during microbiome acquisition, as well as steady-state compositions of spatial compartments, support a multistep model for root microbiome assembly from soil wherein the rhizoplane plays a selective gating role. Similarities in the distribution of phyla in the root microbiomes of rice and other plants suggest that conclusions derived from this study might be generally applicable to land plants.

1,673 citations


Cites methods from "Waste not, want not: why rarefying ..."

  • ...This method was chosen due to its sensitivity for detecting differentially abundant taxa compared with traditional microbiome normalization techniques such as rarefaction and relative abundance (20)....

    [...]

Journal ArticleDOI
TL;DR: The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis.
Abstract: Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers, metagenomes or metatranscriptomes are commonplace and being used to study human disease states, ecological differences between sites, and the built environment. There is increasing awareness that microbiome datasets generated by HTS are compositional because they have an arbitrary total imposed by the instrument. However, many investigators are either unaware of this or assume specific properties of the compositional data. The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis. We briefly introduce compositional data, illustrate the pathologies that occur when compositional data are analyzed inappropriately, and finally give guidance and point to resources and examples for the analysis of microbiome datasets using compositional data analysis.

1,511 citations


Cites background or methods from "Waste not, want not: why rarefying ..."

  • ...This is implicitly acknowledged when microbiome datasets are converted to relative abundance values, or normalized counts, or are rarefied (McMurdie and Holmes, 2014; Weiss et al., 2017) prior to analysis....

    [...]

  • ...calculations for multivariate ordinations derived from these distances (McMurdie and Holmes, 2014)....

    [...]

  • ...The use of subsampling has been questioned since it results in a loss of information and precision (McMurdie and Holmes, 2014), and the practice of count normalization from the RNAseq field has been advocated instead....

    [...]

  • ...The table below in (C) shows real and perceived changes for each sample if we transition from one sample to another. calculations for multivariate ordinations derived from these distances (McMurdie and Holmes, 2014)....

    [...]

  • ...Methods applied include count-based strategies such as Bray-Curtis dissimilarity, zero-inflated Gaussianmodels and negative binomial models (McMurdie and Holmes, 2014; Weiss et al., 2017)....

    [...]

References
More filters
Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

272,030 citations

Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Waste not, want not: why rarefying ..." refers methods in this paper

  • ...All tests were corrected for multiple inferences using the Benjamini-Hochberg method to control the False Discovery Rate [53]....

    [...]

  • ...All tests were corrected for multiple inferences using the Benjamini-Hochberg method to control the False Discovery Rate [52]....

    [...]

  • ...For all methods, detection among multiple tests was defined using a False Discovery Rate (Benjamini-Hochberg [52]) significance threshold of 0....

    [...]

Book
13 Aug 2009
TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

29,504 citations


"Waste not, want not: why rarefying ..." refers background or methods in this paper

  • ...These simulations, analyses, and graphics rely upon the cluster [58], foreach [59], ggplot2 [60], phyloseq [53], plyr [61], reshape2 [62], and ROCR [39] R packages; in addition to the DESeq [3], edgeR [2], and PoiClaClu [63] R packages for RNASeq data, and tools available in the standard R distribution [64]....

    [...]

  • ...Hadley Wickham created and continues to support the ggplot2 [60] and reshape [62]/plyr [61] packages that have proven useful for graphical representation and manipulation of data, respectively....

    [...]

Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations


"Waste not, want not: why rarefying ..." refers methods in this paper

  • ...This approach is already wellcharacterized and implemented for RNA-Seq data in R packages such as edgeR [2] and DESeq [3]....

    [...]

  • ...We would like to thank the developers of the open source packages leveraged here for improved insights into microbiome data, in particular Gordon Smyth and his group for edgeR [2], and Wolfgang Huber and his team for DESeq [3]; whose useful documentation and continued support have been invaluable....

    [...]

  • ...We utilize the most popular implementations of this approach currently used in RNA-Seq analysis, namely edgeR [2] and DESeq [3], adapted here for microbiome data....

    [...]

  • ...These simulations, analyses, and graphics rely upon the cluster [58], foreach [59], ggplot2 [60], phyloseq [53], plyr [61], reshape2 [62], and ROCR [39] R packages; in addition to the DESeq [3], edgeR [2], and PoiClaClu [63] R packages for RNASeq data, and tools available in the standard R distribution [64]....

    [...]

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations


"Waste not, want not: why rarefying ..." refers background in this paper

  • ...We would also like to thank Rob Knight and his lab for QIIME [26], which has drastically decreased the time required to get from raw phylogenetic sequence data to OTU counts....

    [...]

  • ...Rarefying is now an exceedingly common precursor to microbiome multivariate workflows that seek to relate sample covariates to sample-wise distance matrices [19, 27, 28]; for example, integrated as a recommended option in QIIME’s [29] beta_diversity_through_plots.py workflow, in Sub.sample in the mothur software library [30], in daisychopper.pl [31], and is even supported in phyloseq’s rarefy_even_depth function [32] (though not recommended in its documentation)....

    [...]

  • ...We would also like to thank Rob Knight and his lab for QIIME [29], which has drastically decreased the time required to get from raw phylogenetic sequence data to OTU counts....

    [...]

  • ...Rarefying is now an exceedingly common precursor to microbiome multivariate workflows that seek to relate sample covariates to sample-wise distance matrices [6, 24, 25]; for example, integrated as a recommended option in QIIME’s [26] beta-diversity-through-plots....

    [...]

  • ...Some heuristics for filtering low-abundance OTUs are already described in the documentation of various microbiome analysis workflows [26, 27], and in many cases these are a form of Independent Filtering....

    [...]

Related Papers (5)