scispace - formally typeset
Open AccessJournal ArticleDOI

Waste not, want not: why rarefying microbiome data is inadmissible.

Reads0
Chats0
TLDR
It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Microbial Community, Newly Sequestered Soil Organic Carbon, and δ15N Variations Driven by Tree Roots.

TL;DR: Results showed that microbial communities and newly sequestered soil organic carbon (SOC) contents changed with different tree species, environments, and successive stages, and the δ15N of soil organic matter could be an important indicator to estimate root-driven carbon sequestration.
Journal ArticleDOI

Can Gut Microbiota Be a Good Predictor for Parkinson’s Disease? A Machine Learning Approach

TL;DR: The involvement of the gut microbiota in Parkinson's disease (PD), investigated in several studies, identified some common alterations of the microbial community, such as a decrease in Lachnospiraceae and an increase in Verrucomicrobiaceae families in PD patients.
Journal ArticleDOI

Rarity of microbial species: In search of reliable associations.

TL;DR: It is found that a large proportion of pairwise associations, especially negative associations, cannot be reliably tested, which could hamper the identification of candidate biological agents that could be used to control rare pathogens.
Journal ArticleDOI

Isotope Fractionation in Biogas Allows Direct Microbial Community Stability Monitoring in Anaerobic Digestion

TL;DR: 13C isotope fractionation in the biogas can predict process stability in anaerobic digestion, as it directly reflects shifts in the total and active microbial community, yet, due to its temporal character, further validation is needed.
Journal ArticleDOI

Community structure of soil fungi in a novel perennial crop monoculture, annual agriculture, and native prairie reconstruction.

TL;DR: Similarities in the overall and functional fungal communities between the perennial monoculture and native vegetation suggest Kernza® cropping systems have the potential to mimic reconstructed natural systems.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Book

ggplot2: Elegant Graphics for Data Analysis

TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Journal ArticleDOI

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Related Papers (5)