scispace - formally typeset
Search or ask a question
Book ChapterDOI

Advancing our understanding of the human microbiome using QIIME

TL;DR: This chapter demonstrates the use of the QIIME pipeline to analyze microbial communities obtained from several sites on the bodies of transgenic and wild-type mice, as assessed using 16S rRNA gene sequences generated on the Illumina MiSeq platform.
Abstract: High-throughput DNA sequencing technologies, coupled with advanced bioinformatics tools, have enabled rapid advances in microbial ecology and our understanding of the human microbiome. QIIME (Quantitative Insights Into Microbial Ecology) is an open-source bioinformatics software package designed for microbial community analysis based on DNA sequence data, which provides a single analysis framework for analysis of raw sequence data through publication-quality statistical analyses and interactive visualizations. In this chapter, we demonstrate the use of the QIIME pipeline to analyze microbial communities obtained from several sites on the bodies of transgenic and wild-type mice, as assessed using 16S rRNA gene sequences generated on the Illumina MiSeq platform. We present our recommended pipeline for performing microbial community analysis and provide guidelines for making critical choices in the process. We present examples of some of the types of analyses that are enabled by QIIME and discuss how other tools, such as phyloseq and R, can be applied to expand upon these analyses.
Citations
More filters
Journal Article
TL;DR: FastTree as mentioned in this paper uses sequence profiles of internal nodes in the tree to implement neighbor-joining and uses heuristics to quickly identify candidate joins, then uses nearest-neighbor interchanges to reduce the length of the tree.
Abstract: Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement neighbor-joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest-neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N^2) space and O(N^2 L) time, but FastTree requires just O( NLa + N sqrt(N) ) memory and O( N sqrt(N) log(N) L a ) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 hours and 2.4 gigabytes of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 hours and 50 gigabytes of memory. In simulations, FastTree was slightly more accurate than neighbor joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.

2,436 citations

Journal ArticleDOI
TL;DR: It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract: Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

2,184 citations


Cites background or methods from "Advancing our understanding of the ..."

  • ...This experimental design appears in many clinical settings (health/disease, target/control, etc.), and other settings for which there is sufficient a priori knowledge about the microbiological conditions, and we want to enumerate the OTUs that are different between these microbiomes, along with a…...

    [...]

  • ...In both simulation types we varied the library and effect sizes across a range of levels that are relevant for recently-published microbiome investigations, and followed with commonly used statistical analyses from the microbiome and/or RNA-Seq literature (Figure 2)....

    [...]

Journal ArticleDOI
TL;DR: It is found that a distinct and abundant microbiome drives suppressive monocytic cellular differentiation in pancreatic cancer via selective Toll-like receptor ligation leading to T-cell anergy, and that the microbiome has potential as a therapeutic target in the modulation of disease progression.
Abstract: We found that the cancerous pancreas harbors a markedly more abundant microbiome compared with normal pancreas in both mice and humans, and select bacteria are differentially increased in the tumorous pancreas compared with gut. Ablation of the microbiome protects against preinvasive and invasive pancreatic ductal adenocarcinoma (PDA), whereas transfer of bacteria from PDA-bearing hosts, but not controls, reverses tumor protection. Bacterial ablation was associated with immunogenic reprogramming of the PDA tumor microenvironment, including a reduction in myeloid-derived suppressor cells and an increase in M1 macrophage differentiation, promoting TH1 differentiation of CD4+ T cells and CD8+ T-cell activation. Bacterial ablation also enabled efficacy for checkpoint-targeted immunotherapy by upregulating PD-1 expression. Mechanistically, the PDA microbiome generated a tolerogenic immune program by differentially activating select Toll-like receptors in monocytic cells. These data suggest that endogenous microbiota promote the crippling immune-suppression characteristic of PDA and that the microbiome has potential as a therapeutic target in the modulation of disease progression.Significance: We found that a distinct and abundant microbiome drives suppressive monocytic cellular differentiation in pancreatic cancer via selective Toll-like receptor ligation leading to T-cell anergy. Targeting the microbiome protects against oncogenesis, reverses intratumoral immune tolerance, and enables efficacy for checkpoint-based immunotherapy. These data have implications for understanding immune suppression in pancreatic cancer and its reversal in the clinic. Cancer Discov; 8(4); 403-16. ©2018 AACR.See related commentary by Riquelme et al., p. 386This article is highlighted in the In This Issue feature, p. 371.

715 citations

Journal ArticleDOI
TL;DR: O Ongoing large-scale population-based studies of the gut microbiome and brain imaging studies looking at the effect of gut microbiome modulation on brain responses to emotion-related stimuli are seeking to validate speculations.
Abstract: The discovery of the size and complexity of the human microbiome has resulted in an ongoing reevaluation of many concepts of health and disease, including diseases affecting the CNS. A growing body of preclinical literature has demonstrated bidirectional signaling between the brain and the gut microbiome, involving multiple neurocrine and endocrine signaling mechanisms. While psychological and physical stressors can affect the composition and metabolic activity of the gut microbiota, experimental changes to the gut microbiome can affect emotional behavior and related brain systems. These findings have resulted in speculation that alterations in the gut microbiome may play a pathophysiological role in human brain diseases, including autism spectrum disorder, anxiety, depression, and chronic pain. Ongoing large-scale population-based studies of the gut microbiome and brain imaging studies looking at the effect of gut microbiome modulation on brain responses to emotion-related stimuli are seeking to validate these speculations. This article is a summary of emerging topics covered in a symposium and is not meant to be a comprehensive review of the subject.

666 citations


Cites background from "Advancing our understanding of the ..."

  • ...Of particular importance are tools to integrate large, highly multivariate datasets (Gonzalez and Knight, 2012; Navas-Molina et al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: The two main approaches for analyzing the microbiome, 16S ribosomal RNA gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses and several methods for taxonomic classification of bacterial sequences are discussed.
Abstract: The advent of next generation sequencing (NGS) has enabled investigations of the gut microbiome with unprecedented resolution and throughput. This has stimulated the development of sophisticated bioinformatics tools to analyze the massive amounts of data generated. Researchers therefore need a clear understanding of the key concepts required for the design, execution and interpretation of NGS experiments on microbiomes. We conducted a literature review and used our own data to determine which approaches work best. The two main approaches for analyzing the microbiome, 16S ribosomal RNA (rRNA) gene amplicons and shotgun metagenomics, are illustrated with analyses of libraries designed to highlight their strengths and weaknesses. Several methods for taxonomic classification of bacterial sequences are discussed. We present simulations to assess the number of sequences that are required to perform reliable appraisals of bacterial community structure. To the extent that fluctuations in the diversity of gut bacterial populations correlate with health and disease, we emphasize various techniques for the analysis of bacterial communities within samples (α-diversity) and between samples (β-diversity). Finally, we demonstrate techniques to infer the metabolic capabilities of a bacteria community from these 16S and shotgun data.

647 citations


Cites background or methods from "Advancing our understanding of the ..."

  • ...The selection of each metric will depend on the hypothesis being evaluated as some phenotypes are more strongly influenced by relative abundance of taxa rather than presence or absence of specific taxa (Navas-Molina et al., 2013)....

    [...]

  • ...One example is UniFrac (unique fraction), which has been reported to correlate well with the biological properties of samples (Navas-Molina et al., 2013) and measures the amount of “unique evolution” of a community in comparison to others (Lozupone and Knight, 2005; Lozupone et al., 2006)....

    [...]

  • ...For this procedure, calculations are reiterated after omitting one observation (taxa, OTU, etc.) and then the average is represented in a PCoA plot while the variance is depicted as confidence ellipsoids (Efron and Stein, 1981; Navas-Molina et al., 2013)....

    [...]

  • ...For the analysis of 16S amplicon libraries, we evaluated QIIME (Caporaso et al., 2010; Navas-Molina et al., 2013) and mothur (Schloss et al., 2009), the most widely adopted pipelines, and the MiSeq Reporter v2....

    [...]

  • ...QIIME and mothur offer the possibility to readily calculate many β-diversity metrics (Schloss et al., 2009; Navas-Molina et al., 2013) and so does the R package vegan (Oksanen et al., 2015)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations


"Advancing our understanding of the ..." refers methods in this paper

  • ...…against 389Understanding the Human Microbiome Using QIIME any of the given databases, or against a custom database, using several methods: BLAST (Altschul et al., 1990), RDP classifier (Wang et al., 2007), rtax (Soergel, Dey, Knight, & Brenner, 2012), mothur (Schloss et al., 2009), and tax2tree…...

    [...]

  • ...QIIME can assign taxonomy against 389Understanding the Human Microbiome Using QIIME any of the given databases, or against a custom database, using several methods: BLAST (Altschul et al., 1990), RDP classifier (Wang et al., 2007), rtax (Soergel, Dey, Knight, & Brenner, 2012), mothur (Schloss et al., 2009), and tax2tree (McDonald, Price, et al., 2012)....

    [...]

  • ...QIIME currently supports three different methods for detecting chimeras: blast fragments, a taxonomy-assignment-based approach using BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990); ChimeraSlayer (Haas et al., 2011), which uses BLAST to identify potential chimera parents; and usearch 6.1 (Edgar, 2010), which can perform de novo chimera detection based on abundances as well as reference-based chimera detection....

    [...]

Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations


"Advancing our understanding of the ..." refers methods in this paper

  • ...QIIME uses the random forest (Breiman, 2001) supervised classifier implemented in R (Liaw and Wiener, 2002) to recover the mislabeled samples by training the classifier with the relative abundance taxa (Knights, Costello, & Knight, 2011)....

    [...]

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations


"Advancing our understanding of the ..." refers methods in this paper

  • ...Currently, QIIME supports the following methods for performing sequence alignment: PyNAST (Caporaso, Bittinger, et al., 2010), Infernal (Nawrocki, Kolbe, & Eddy, 2009), clustalw (Larkin et al., 2007), muscle (Edgar, 2004), and mafft (Katoh, Misawa, Kuma, & Miyata, 2002)....

    [...]

  • ...The current methods supported for inferring the phylogenetic tree in QIIME are FastTree (Price, Dehal, & Arkin, 2009), clearcut (Evans, Sheneman, & Foster, 2006), clustalw (Larkin et al., 2007), raxml (Stamatakis, Ludwig, &Meier, 2005), andmuscle (Edgar, 2004)....

    [...]

  • ..., 2007), raxml (Stamatakis, Ludwig, &Meier, 2005), andmuscle (Edgar, 2004)....

    [...]

  • ..., 2007), muscle (Edgar, 2004), and mafft (Katoh, Misawa, Kuma, & Miyata, 2002)....

    [...]

Journal ArticleDOI
TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Abstract: Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

32,980 citations


"Advancing our understanding of the ..." refers methods in this paper

  • ...Cytoscape is not wrapped in the QIIME pipeline, and it is run as a separate program....

    [...]

  • ...This script generates the OTU-network files to be passed into Cytoscape (Shannon et al., 2003) and statistics for those networks (specifically, a bipartite graph in which nodes represent either OTUs or samples, and edges represent a connection between an OTU and a sample; Ley et al., 2008)....

    [...]

  • ...The files used by Cytoscape 2.8.2 are the real edge table (real_edge_table.txt) which contains the columns “from,” “to,” “eweight,” and “consensus_lin,” among others dictated by the headers in the mapping file; and the real node file (real_node_table.txt) which contains a node for each OTU and each sample in the study....

    [...]

  • ...This script generates the OTU-network files to be passed into Cytoscape (Shannon et al., 2003) and statistics for those networks (specifically, a bipartite graph in which nodes represent either OTUs or samples, and edges represent a connection between an OTU and a sample; Ley et al....

    [...]

  • ...In the network diagram, both types of nodes, OTU nodes and sample nodes, can be easily modified using Cytoscape’s graphical user interface, with symbols such as filled circles for OTUs and filled squares for samples....

    [...]

Book
13 Aug 2009
TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

29,504 citations

Related Papers (5)