Waste not, want not: why rarefying microbiome data is inadmissible.

doi:10.1371/JOURNAL.PCBI.1003531

Open AccessJournal ArticleDOI

Waste not, want not: why rarefying microbiome data is inadmissible.

Paul J. McMurdie, +1 more

- 03 Apr 2014 -

PLOS Computational Biology

- Vol. 10, Iss: 4

Chats0

TLDR

It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.

Abstract:

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Citations

PDF

Open Access

More filters

Posted ContentDOI

Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data

Sophie Weiss, +10 more

TL;DR: Evaluating methods developed in the literature to address challenges to ecological and statistical interpretation of 16S amplicon sequencing finds rarefying paired with a non-parametric test, such as the Mann-Whitney test, can also yield equally high sensitivity.

...read moreread less

Book ChapterDOI

Correlation and association analyses in microbiome study integrating multiomics in health and disease

Yinglin Xia

- 01 Jan 2020 -

Progress in Molecular Biology and Transl...

TL;DR: An overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models are provided.

...read moreread less

Journal ArticleDOI

Effects of host species and environment on the skin microbiome of Plethodontid salamanders

Carly R. Muletz Wolz, +5 more

- 01 Mar 2018 -

Journal of Animal Ecology

TL;DR: It is concluded that environment is more influential in shaping skin microbiome structure than host differences in these congeneric species, and suggest that environmental characteristics that covary with elevation influence microbiome structure.

...read moreread less

Journal ArticleDOI

Asymptomatic Intestinal Colonization with Protist Blastocystis Is Strongly Associated with Distinct Microbiome Ecological Patterns.

M. E. Nieves-Ramírez, +20 more

TL;DR: This work is the first to show a direct association between the presence of Blastocystis and shifts in the gut bacterial and eukaryotic microbiome in the absence of gastrointestinal disease or inflammation.

...read moreread less

Journal ArticleDOI

Impact of soil salinity on the structure of the bacterial endophytic community identified from the roots of caliph medic (Medicago truncatula)

Mahmoud W. Yaish, +4 more

- 08 Jul 2016 -

PLOS ONE

TL;DR: Determination of the amendments to the bacterial community due to salinity stress in Caliph medic provides a crucial step toward developing an understanding of the association of these endophytes, under salt stress conditions, in this model plant.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal Article

R: A language and environment for statistical computing.

R Core Team

- 01 Jan 2014 -

MSOR connections

TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.

...read moreread less

Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Book

ggplot2: Elegant Graphics for Data Analysis

Hadley Wickham

TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.

...read moreread less

Journal ArticleDOI

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Mark D. Robinson, +2 more

- 01 Jan 2010 -

Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Journal ArticleDOI

QIIME allows analysis of high-throughput community sequencing data.

J. Gregory Caporaso, +27 more

- 11 Apr 2010 -

Nature Methods

TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.

...read moreread less

Collapse

Related Papers (5)

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love, +3 more

- 05 Dec 2014 -

Genome Biology

DADA2: High-resolution sample inference from Illumina amplicon data

Benjamin J. Callahan, +5 more

- 01 Jul 2016 -

Nature Methods

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

Christian Quast, +7 more

- 28 Nov 2012 -

Nucleic Acids Research

Waste not, want not: why rarefying microbiome data is inadmissible.

Citations

Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data

Correlation and association analyses in microbiome study integrating multiomics in health and disease

Effects of host species and environment on the skin microbiome of Plethodontid salamanders

Asymptomatic Intestinal Colonization with Protist Blastocystis Is Strongly Associated with Distinct Microbiome Ecological Patterns.

Impact of soil salinity on the structure of the bacterial endophytic community identified from the roots of caliph medic (Medicago truncatula)

References

R: A language and environment for statistical computing.

Controlling the false discovery rate: a practical and powerful approach to multiple testing

ggplot2: Elegant Graphics for Data Analysis

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

QIIME allows analysis of high-throughput community sequencing data.

Related Papers (5)

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

QIIME allows analysis of high-throughput community sequencing data.

DADA2: High-resolution sample inference from Illumina amplicon data

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities