scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Differential abundance analysis for microbial marker-gene surveys

TL;DR: It is shown that metagenomeSeq outperforms the tools currently used in this field and relies on a novel normalization technique and a statistical model that accounts for undersampling in large-scale marker-gene studies.
Abstract: We introduce a methodology to assess differential abundance in sparse high-throughput microbial marker-gene survey data. Our approach, implemented in the metagenomeSeq Bioconductor package, relies on a novel normalization technique and a statistical model that accounts for undersampling-a common feature of large-scale marker-gene studies. Using simulated data and several published microbiota data sets, we show that metagenomeSeq outperforms the tools currently used in this field.
Citations
More filters
Journal ArticleDOI
TL;DR: This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts.
Abstract: High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

3,755 citations

Journal ArticleDOI
TL;DR: It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract: Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

2,184 citations


Cites background or methods from "Differential abundance analysis for..."

  • ...An Expectation-Maximization estimate of the posterior probabilities of differential abundance based on a Zero Inflated Gaussian model, implemented in the fitZig method of the metagenomeSeq package [40]....

    [...]

  • ...These simulations, analyses, and graphics rely upon the cluster [54], foreach [55], ggplot2 [56], metagenomeSeq [40], phyloseq [32], plyr [57], reshape2 [58], and ROCR [59] R packages; in addition to the DESeq(2) [13], edgeR [41], and PoiClaClu [46] R packages for RNA-Seq data, and tools available in the standard R distribution [60]....

    [...]

  • ...We also compare the performance of the GammaPoisson mixture model against a method that models OTU proportions using a zero-inflated Gaussian distribution, implemented in a recently-released package called metagenomeSeq [40]....

    [...]

  • ...We would like to thank the developers of the open source packages leveraged here for improved insights into microbiome data, in particular Gordon Smyth and his group for edgeR [41], to Mihai Pop and his team for metagenomeSeq [40] and Wolfgang Huber and his team for DESeq and DESeq2 [13]; whose useful documentation and continued support have been invaluable....

    [...]

  • ...It should be noted that we have adopted the recently coined term differential abundance [39,40] as a direct analogy to differential expression from RNA-Seq....

    [...]

Journal ArticleDOI
TL;DR: This review addresses the concept of endophytism, considering the latest insights into evolution, plant ecosystem functioning, and multipartite interactions.
Abstract: All plants are inhabited internally by diverse microbial communities comprising bacterial, archaeal, fungal, and protistic taxa. These microorganisms showing endophytic lifestyles play crucial roles in plant development, growth, fitness, and diversification. The increasing awareness of and information on endophytes provide insight into the complexity of the plant microbiome. The nature of plant-endophyte interactions ranges from mutualism to pathogenicity. This depends on a set of abiotic and biotic factors, including the genotypes of plants and microbes, environmental conditions, and the dynamic network of interactions within the plant biome. In this review, we address the concept of endophytism, considering the latest insights into evolution, plant ecosystem functioning, and multipartite interactions.

1,677 citations


Cites methods from "Differential abundance analysis for..."

  • ...The assigned KO tags were normalized by cumulative sum scaling (CSS) normalization, and then a mixture model that implements a zero-inflated Gaussian distribution was computed to detect differentially abundant properties by using the metagenomeSeq package (249)....

    [...]

Journal ArticleDOI
TL;DR: The performance of ANCOM is illustrated using two publicly available microbial datasets in the human gut, demonstrating its general applicability to testing hypotheses about compositional differences in microbial communities and accounting for compositionality using log-ratio analysis results in significantly improved inference in microbiota survey data.
Abstract: Background : Understanding the factors regulating our microbiota is important but requires appropriate statistical methodology. When comparing two or more populations most existing approaches either discount the underlying compositional structure in the microbiome data or use probability models such as the multinomial and Dirichlet-multinomial distributions, which may impose a correlation structure not suitable for microbiome data. Objective : To develop a methodology that accounts for compositional constraints to reduce false discoveries in detecting differentially abundant taxa at an ecosystem level, while maintaining high statistical power. Methods : We introduced a novel statistical framework called analysis of composition of microbiomes (ANCOM). ANCOM accounts for the underlying structure in the data and can be used for comparing the composition of microbiomes in two or more populations. ANCOM makes no distributional assumptions and can be implemented in a linear model framework to adjust for covariates as well as model longitudinal data. ANCOM also scales well to compare samples involving thousands of taxa. Results : We compared the performance of ANCOM to the standard t -test and a recently published methodology called Zero Inflated Gaussian (ZIG) methodology (1) for drawing inferences on the mean taxa abundance in two or more populations. ANCOM controlled the false discovery rate (FDR) at the desired nominal level while also improving power, whereas the t -test and ZIG had inflated FDRs, in some instances as high as 68% for the t -test and 60% for ZIG. We illustrate the performance of ANCOM using two publicly available microbial datasets in the human gut, demonstrating its general applicability to testing hypotheses about compositional differences in microbial communities. Conclusion : Accounting for compositionality using log-ratio analysis results in significantly improved inference in microbiota survey data. Keywords: constrained; relative abundance; log-ratio (Published: 29 May 2015) Citation: Microbial Ecology in Health & Disease 2015, 26: 27663 - http://dx.doi.org/10.3402/mehd.v26.27663 To access the supplementary material for this article, please see Supplementary files under ‘Article Tools’

1,371 citations


Cites background or methods from "Differential abundance analysis for..."

  • ...Our extensive simulation studies show that ANCOM outperforms Zero Inflated Gaussian (ZIG) methodology (1) by substantially reducing the FDR and increasing power....

    [...]

  • ...However, it is not clear how (or whether) this information is used in the distributional assumptions made in (1)....

    [...]

  • ...Furthermore, according to the statistical model given in the middle of page 2 of the online supplementary files of (1), the ZIG methodology appears to implicitly require that the sum of all observed OTUs be a constant, and not a random variable....

    [...]

  • ...Lastly, our simulation studies indicate that the ZIG methodology (1) can produce unacceptably high FDRs and hence may not be suitable for comparing the mean taxa abundance at the ecosystem level between two or more populations....

    [...]

  • ...Results: We compared the performance of ANCOM to the standard t-test and a recently published methodology called Zero Inflated Gaussian (ZIG) methodology (1) for drawing inferences on the mean taxa abundance in two or more populations....

    [...]

Journal ArticleDOI
21 Jul 2016-Nature
TL;DR: It is shown how the human gut microbiome impacts the serum metabolome and associates with insulin resistance in 277 non-diabetic Danish individuals and suggested that microbial targets may have the potential to diminish insulin resistance and reduce the incidence of common metabolic and cardiovascular disorders.
Abstract: Insulin resistance is a forerunner state of ischaemic cardiovascular disease and type 2 diabetes. Here we show how the human gut microbiome impacts the serum metabolome and associates with insulin resistance in 277 non-diabetic Danish individuals. The serum metabolome of insulin-resistant individuals is characterized by increased levels of branched-chain amino acids (BCAAs), which correlate with a gut microbiome that has an enriched biosynthetic potential for BCAAs and is deprived of genes encoding bacterial inward transporters for these amino acids. Prevotella copri and Bacteroides vulgatus are identified as the main species driving the association between biosynthesis of BCAAs and insulin resistance, and in mice we demonstrate that P. copri can induce insulin resistance, aggravate glucose intolerance and augment circulating levels of BCAAs. Our findings suggest that microbial targets may have the potential to diminish insulin resistance and reduce the incidence of common metabolic and cardiovascular disorders.

1,309 citations

References
More filters
Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations

Journal ArticleDOI
TL;DR: The RDP Classifier can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes, and the majority of the classification errors appear to be due to anomalies in the current taxonomies.
Abstract: The Ribosomal Database Project (RDP) Classifier, a naive Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes (2nd ed., release 5.0, Springer-Verlag, New York, NY, 2004). It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The majority of classifications (98%) were of high estimated confidence (≥95%) and high accuracy (98%). In addition to being tested with the corpus of 5,014 type strain sequences from Bergey's outline, the RDP Classifier was tested with a corpus of 23,095 rRNA sequences as assigned by the NCBI into their alternative higher-order taxonomy. The results from leave-one-out testing on both corpora show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level, and the majority of the classification errors appear to be due to anomalies in the current taxonomies. For shorter rRNA segments, such as those that might be generated by pyrosequencing, the error rate varied greatly over the length of the 16S rRNA gene, with segments around the V2 and V4 variable regions giving the lowest error rates. The RDP Classifier is suitable both for the analysis of single rRNA sequences and for the analysis of libraries of thousands of sequences. Another related tool, RDP Library Compare, was developed to facilitate microbial-community comparison based on 16S rRNA gene sequence libraries. It combines the RDP Classifier with a statistical test to flag taxa differentially represented between samples. The RDP Classifier and RDP Library Compare are available online at http://rdp.cme.msu.edu/.

16,048 citations

Journal ArticleDOI
TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

13,356 citations

Journal ArticleDOI
TL;DR: A new method for metagenomic biomarker discovery is described and validates by way of class comparison, tests of biological consistency and effect size estimation to address the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities.
Abstract: This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.

9,057 citations