scispace - formally typeset
Search or ask a question
Journal ArticleDOI

phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data.

22 Apr 2013-PLOS ONE (Public Library of Science)-Vol. 8, Iss: 4
TL;DR: The phyloseq project for R is a new open-source software package dedicated to the object-oriented representation and analysis of microbiome census data in R, which supports importing data from a variety of common formats, as well as many analysis techniques.
Abstract: Background The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data. Results Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research. Conclusions The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
Evan Bolyen1, Jai Ram Rideout1, Matthew R. Dillon1, Nicholas A. Bokulich1, Christian C. Abnet2, Gabriel A. Al-Ghalith3, Harriet Alexander4, Harriet Alexander5, Eric J. Alm6, Manimozhiyan Arumugam7, Francesco Asnicar8, Yang Bai9, Jordan E. Bisanz10, Kyle Bittinger11, Asker Daniel Brejnrod7, Colin J. Brislawn12, C. Titus Brown5, Benjamin J. Callahan13, Andrés Mauricio Caraballo-Rodríguez14, John Chase1, Emily K. Cope1, Ricardo Silva14, Christian Diener15, Pieter C. Dorrestein14, Gavin M. Douglas16, Daniel M. Durall17, Claire Duvallet6, Christian F. Edwardson, Madeleine Ernst18, Madeleine Ernst14, Mehrbod Estaki17, Jennifer Fouquier19, Julia M. Gauglitz14, Sean M. Gibbons20, Sean M. Gibbons15, Deanna L. Gibson17, Antonio Gonzalez14, Kestrel Gorlick1, Jiarong Guo21, Benjamin Hillmann3, Susan Holmes22, Hannes Holste14, Curtis Huttenhower23, Curtis Huttenhower24, Gavin A. Huttley25, Stefan Janssen26, Alan K. Jarmusch14, Lingjing Jiang14, Benjamin D. Kaehler27, Benjamin D. Kaehler25, Kyo Bin Kang14, Kyo Bin Kang28, Christopher R. Keefe1, Paul Keim1, Scott T. Kelley29, Dan Knights3, Irina Koester14, Tomasz Kosciolek14, Jorden Kreps1, Morgan G. I. Langille16, Joslynn S. Lee30, Ruth E. Ley31, Ruth E. Ley32, Yong-Xin Liu, Erikka Loftfield2, Catherine A. Lozupone19, Massoud Maher14, Clarisse Marotz14, Bryan D Martin20, Daniel McDonald14, Lauren J. McIver24, Lauren J. McIver23, Alexey V. Melnik14, Jessica L. Metcalf33, Sydney C. Morgan17, Jamie Morton14, Ahmad Turan Naimey1, Jose A. Navas-Molina34, Jose A. Navas-Molina14, Louis-Félix Nothias14, Stephanie B. Orchanian, Talima Pearson1, Samuel L. Peoples35, Samuel L. Peoples20, Daniel Petras14, Mary L. Preuss36, Elmar Pruesse19, Lasse Buur Rasmussen7, Adam R. Rivers37, Michael S. Robeson38, Patrick Rosenthal36, Nicola Segata8, Michael Shaffer19, Arron Shiffer1, Rashmi Sinha2, Se Jin Song14, John R. Spear39, Austin D. Swafford, Luke R. Thompson40, Luke R. Thompson41, Pedro J. Torres29, Pauline Trinh20, Anupriya Tripathi14, Peter J. Turnbaugh10, Sabah Ul-Hasan42, Justin J. J. van der Hooft43, Fernando Vargas, Yoshiki Vázquez-Baeza14, Emily Vogtmann2, Max von Hippel44, William A. Walters32, Yunhu Wan2, Mingxun Wang14, Jonathan Warren45, Kyle C. Weber46, Kyle C. Weber37, Charles H. D. Williamson1, Amy D. Willis20, Zhenjiang Zech Xu14, Jesse R. Zaneveld20, Yilong Zhang47, Qiyun Zhu14, Rob Knight14, J. Gregory Caporaso1 
TL;DR: QIIME 2 development was primarily funded by NSF Awards 1565100 to J.G.C. and R.K.P. and partial support was also provided by the following: grants NIH U54CA143925 and U54MD012388.
Abstract: QIIME 2 development was primarily funded by NSF Awards 1565100 to J.G.C. and 1565057 to R.K. Partial support was also provided by the following: grants NIH U54CA143925 (J.G.C. and T.P.) and U54MD012388 (J.G.C. and T.P.); grants from the Alfred P. Sloan Foundation (J.G.C. and R.K.); ERCSTG project MetaPG (N.S.); the Strategic Priority Research Program of the Chinese Academy of Sciences QYZDB-SSW-SMC021 (Y.B.); the Australian National Health and Medical Research Council APP1085372 (G.A.H., J.G.C., Von Bing Yap and R.K.); the Natural Sciences and Engineering Research Council (NSERC) to D.L.G.; and the State of Arizona Technology and Research Initiative Fund (TRIF), administered by the Arizona Board of Regents, through Northern Arizona University. All NCI coauthors were supported by the Intramural Research Program of the National Cancer Institute. S.M.G. and C. Diener were supported by the Washington Research Foundation Distinguished Investigator Award.

8,821 citations

Journal ArticleDOI
TL;DR: An r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees, which can read more tree file formats than other softwares, and support visualization of phylo, multiphylo, phylo4, phyla4d, obkdata and phyloseq tree objects defined in other r packages.
Abstract: Summary We present an r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees. ggtree can read more tree file formats than other softwares, including newick, nexus, NHX, phylip and jplace formats, and support visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages. It can also extract the tree/branch/node-specific and other data from the analysis outputs of beast, epa, hyphy, paml, phylodog, pplacer, r8s, raxml and revbayes software, and allows using these data to annotate the tree. The package allows colouring and annotation of a tree by numerical/categorical node attributes, manipulating a tree by rotating, collapsing and zooming out clades, highlighting user selected clades or operational taxonomic units and exploration of a large tree by zooming into a selected portion. A two-dimensional tree can be drawn by scaling the tree width based on an attribute of the nodes. A tree can be annotated with an associated numerical matrix (as a heat map), multiple sequence alignment, subplots or silhouette images. The package ggtree is released under the artistic-2.0 license. The source code and documents are freely available through bioconductor (http://www.bioconductor.org/packages/ggtree).

2,692 citations


Cites methods from "phyloseq: an R package for reproduc..."

  • ...Some packages, including APE (Paradis, Claude & Strimmer 2004) and PHYTOOLS (Revell 2012), which are capable of displaying and annotating trees, are developed using the base graphics system of R. OUTBREAKTOOLS (Jombart et al. 2014) and PHYLOSEQ (McMurdie & Holmes 2013) extended GGPLOT2 to draw phylogenetic trees....

    [...]

  • ...For example, if we have plotted a tree without taxa labels, OUTBREAKTOOLS and PHYLOSEQ provide no easy way for general R users, who have little knowledge about the infrastructures of these packages, to add a layer of taxa labels....

    [...]

  • ...Even though OUTBREAKTOOLS and PHYLOSEQ are developed based on GGPLOT2, the most valuable part of GGPLOT2 syntax – adding layers of annotations – is not supported in these packages....

    [...]

  • ...…including APE (Paradis, Claude & Strimmer 2004) and PHYTOOLS (Revell 2012), which are capable of displaying and annotating trees, are developed using the base graphics system of R. OUTBREAKTOOLS (Jombart et al. 2014) and PHYLOSEQ (McMurdie & Holmes 2013) extended GGPLOT2 to draw phylogenetic trees....

    [...]

Journal ArticleDOI
TL;DR: It is advocated that investigators avoid rarefying altogether and supported statistical theory is provided that simultaneously accounts for library size differences and biological variability using an appropriate mixture model.
Abstract: Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

2,184 citations


Cites methods from "phyloseq: an R package for reproduc..."

  • ...We have provided convenient wrappers for edgeR and DESeq that are tailored for microbiome count data, and these wrappers are included in the most recent release of the phyloseq package [53] with corresponding tutorials....

    [...]

  • ...These simulations, analyses, and graphics rely upon the cluster [58], foreach [59], ggplot2 [60], phyloseq [53], plyr [61], reshape2 [62], and ROCR [39] R packages; in addition to the DESeq [3], edgeR [2], and PoiClaClu [63] R packages for RNASeq data, and tools available in the standard R distribution [64]....

    [...]

Journal ArticleDOI
TL;DR: The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis.
Abstract: Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers, metagenomes or metatranscriptomes are commonplace and being used to study human disease states, ecological differences between sites, and the built environment. There is increasing awareness that microbiome datasets generated by HTS are compositional because they have an arbitrary total imposed by the instrument. However, many investigators are either unaware of this or assume specific properties of the compositional data. The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis. We briefly introduce compositional data, illustrate the pathologies that occur when compositional data are analyzed inappropriately, and finally give guidance and point to resources and examples for the analysis of microbiome datasets using compositional data analysis.

1,511 citations


Cites methods from "phyloseq: an R package for reproduc..."

  • ..., 2017), and the similar correspondence analysis implemented in the phyloseq package (McMurdie and Holmes, 2013)....

    [...]

  • ...…et al., 2014; Lovell et al., 2015; Mandal et al., 2015; McMillan et al., 2015; Gloor and Reid, 2016; Gloor et al., 2016b; Bian et al., 2017; Silverman et al., 2017; Quinn et al., 2017), and the similar correspondence analysis implemented in the phyloseq package (McMurdie and Holmes, 2013)....

    [...]

Journal ArticleDOI
21 Jul 2016-Nature
TL;DR: It is shown how the human gut microbiome impacts the serum metabolome and associates with insulin resistance in 277 non-diabetic Danish individuals and suggested that microbial targets may have the potential to diminish insulin resistance and reduce the incidence of common metabolic and cardiovascular disorders.
Abstract: Insulin resistance is a forerunner state of ischaemic cardiovascular disease and type 2 diabetes. Here we show how the human gut microbiome impacts the serum metabolome and associates with insulin resistance in 277 non-diabetic Danish individuals. The serum metabolome of insulin-resistant individuals is characterized by increased levels of branched-chain amino acids (BCAAs), which correlate with a gut microbiome that has an enriched biosynthetic potential for BCAAs and is deprived of genes encoding bacterial inward transporters for these amino acids. Prevotella copri and Bacteroides vulgatus are identified as the main species driving the association between biosynthesis of BCAAs and insulin resistance, and in mice we demonstrate that P. copri can induce insulin resistance, aggravate glucose intolerance and augment circulating levels of BCAAs. Our findings suggest that microbial targets may have the potential to diminish insulin resistance and reduce the incidence of common metabolic and cardiovascular disorders.

1,309 citations

References
More filters
Journal ArticleDOI
TL;DR: The results illustrate that UniFrac provides a new way of characterizing microbial communities, using the wealth of environmental rRNA sequences, and allows quantitative insight into the factors that underlie the distribution of lineages among environments.
Abstract: We introduce here a new method for computing differences between microbial communities based on phylogenetic information. This method, UniFrac, measures the phylogenetic distance between sets of taxa in a phylogenetic tree as the fraction of the branch length of the tree that leads to descendants from either one environment or the other, but not both. UniFrac can be used to determine whether communities are significantly different, to compare many communities simultaneously using clustering and ordination techniques, and to measure the relative contributions of different factors, such as chemistry and geography, to similarities between samples. We demonstrate the utility of UniFrac by applying it to published 16S rRNA gene libraries from cultured isolates and environmental clones of bacteria in marine sediment, water, and ice. Our results reveal that (i) cultured isolates from ice, water, and sediment resemble each other and environmental clone sequences from sea ice, but not environmental clone sequences from sediment and water; (ii) the geographical location does not correlate strongly with bacterial community differences in ice and sediment from the Arctic and Antarctic; and (iii) bacterial communities differ between terrestrially impacted seawater (whether polar or temperate) and warm oligotrophic seawater, whereas those in individual seawater samples are not more similar to each other than to those in sediment or ice samples. These results illustrate that UniFrac provides a new way of characterizing microbial communities, using the wealth of environmental rRNA sequences, and allows quantitative insight into the factors that underlie the distribution of lineages among environments.

6,679 citations

Journal Article
TL;DR: The Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far, finding the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals.
Abstract: Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

6,350 citations

Journal ArticleDOI
28 Jul 1995-Science
TL;DR: An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence of the genome from the bacterium Haemophilus influenzae Rd.
Abstract: An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism.

5,944 citations


"phyloseq: an R package for reproduc..." refers background in this paper

  • ...sequencing [20,21] of un-amplified metagenomic DNA [22], in...

    [...]

Journal ArticleDOI
TL;DR: SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains.
Abstract: Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity. The ARB software suite with its corresponding rRNA datasets has been accepted by researchers worldwide as a standard tool for large scale rRNA analysis. However, the rapid increase of publicly available rRNA sequence data has recently hampered the maintenance of comprehensive and curated rRNA knowledge databases. A new system, SILVA (from Latin silva, forest), was implemented to provide a central comprehensive web resource for up to date, quality controlled databases of aligned rRNA sequences from the Bacteria, Archaea and Eukarya domains. All sequences are checked for anomalies, carry a rich set of sequence associated contextual information, have multiple taxonomic classifications, and the latest validly described nomenclature. Furthermore, two precompiled sequence datasets compatible with ARB are offered for download on the SILVA website: (i) the reference (Ref) datasets, comprising only high quality, nearly full length sequences suitable for in-depth phylogenetic analysis and probe design and (ii) the comprehensive Parc datasets with all publicly available rRNA sequences longer than 300 nucleotides suitable for biodiversity analyses. The latest publicly available database release 91 (August 2007) hosts 547 521 sequences split into 461 823 small subunit and 85 689 large subunit rRNAs.

5,733 citations


Additional excerpts

  • ...large reference databases [6–8]....

    [...]

Journal ArticleDOI
01 Oct 1986-Ecology
TL;DR: In this article, a new multivariate analysis technique, called canonical correspondence analysis (CCA), was developed to relate community composition to known variation in the environment, where ordination axes are chosen in the light of known environmental variables by imposing the extra restriction that the axes be linear combinations of environmental variables.
Abstract: A new multivariate analysis technique, developed to relate community composition to known variation in the environment, is described. The technique is an extension of correspondence analysis (reciprocal averaging), a popular ordination technique that extracts continuous axes of variation from species occurrence or abundance data. Such ordination axes are typically interpreted with the help of external knowledge and data on environmental variables; this two—step approach (ordination followed by environmental gradient identification) is termed indirect gradient analysis. In the new technique, called canonical correspondence analysis, ordination axes are chosen in the light of known environmental variables by imposing the extra restriction that the axes be linear combinations of environmental variables. In this way community variation can be directly related to environmental variation. The environmental variables may be quantitative or nominal. As many axes can be extracted as there are environmental variables. The method of detrending can be incorporated in the technique to remove arch effects. (Detrended) canonical correspondence analysis is an efficient ordination technique when species have bell—shaped response curves or surfaces with respect to environmental gradients, and is therefore more appropriate for analyzing data on community composition and environmental variables than canonical correlation analysis. The new technique leads to an ordination diagram in which points represent species and sites, and vectors represent environmental variables. Such a diagram shows the patterns of variation in community composition that can be explained best by the environmental variables and also visualizes approximately the "centers" of the species distributions along each of the environmental variables. Such diagrams effectively summarized relationships between community and environment for data sets on hunting spiders, dyke vegetation, and algae along a pollution gradient.

5,689 citations

Related Papers (5)