scispace - formally typeset
Open AccessPosted ContentDOI

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

Reads0
Chats0
TLDR
With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.
Abstract
Culture-independent analyses of microbial communities have advanced dramatically in the last decade, particularly due to advances in methods for biological profiling via shotgun metagenomics. Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity. These methods detected novel disease-microbiome links in applications to CRC (1,262 metagenomes) and IBD (1,635 metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4,077 metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional structure of the common gut microbe Ruminococcus bromii, previously described by only 15 isolate genomes. With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.

read more

Content maybe subject to copyright    Report

Integrating taxonomic, functional, and strain-level profiling
of diverse microbial communities with bioBakery 3
Francesco Beghini
*,1
, Lauren J. McIver
*,2
, Aitor Blanco-Míguez
1
, Leonard Dubois
1
, Francesco
Asnicar
1
, Sagun Maharjan
2,3
, Ana Mailyan
2,3
, Andrew Maltez Thomas
1
, Paolo Manghi
1
, Mireia
Valles-Colomer
1
, George Weingart
2,3
, Yancong Zhang
2,3
, Moreno Zolfo
1
, Curtis Huttenhower
^,2,3
,
Eric A. Franzosa
^,2,3
, Nicola Segata
^,1,4
1. Department CIBIO, University of Trento, Italy
2. Harvard T.H. Chan School of Public Health, Boston, MA, USA
3. The Broad Institute of MIT and Harvard, Cambridge, MA, USA
4. IEO, European Institute of Oncology IRCCS, Milan, Italy
* Joint first authors
^ Joint senior authors
Correspondence to: chuttenh@hsph.harvard.edu, franzosa@hsph.harvard.edu, nicola.segata@unitn.it
Abstract
Culture-independent analyses of microbial communities have advanced dramatically in the last
decade, particularly due to advances in methods for biological profiling via shotgun metagenomics.
Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial
reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of
integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of
metagenomes newly developed to build on the largest set of reference sequences now available.
Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and
HUMAnN 3 improves that of functional potential and activity. These methods detected novel
disease-microbiome links in applications to CRC (1,262 metagenomes) and IBD (1,635
metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4,077
metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional
structure of the common gut microbe Ruminococcus bromii
, previously described by only 15 isolate
genomes. With open-source implementations and cloud-deployable reproducible workflows, the
bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of
multi-omic profiling for microbial community studies.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Introduction
Studies of microbial community biology continue to be enriched by the growth of
culture-independent sequencing and high-throughput isolate genomics (Almeida et al., 2020, 2019;
Forster et al., 2019; Parks et al., 2017; Pasolli et al., 2019; Poyet et al., 2019; Zou et al., 2019).
Shotgun metagenomic and metatranscriptomic (i.e. “meta-omic”) measurements can be used to
address an increasing range of questions as diverse as the transmission and evolution of strains in
situ (Asnicar et al., 2017; Ferretti et al., 2018; Truong et al., 2017; Yassour et al., 2018), the
mechanisms of multi-organism biochemical responses in the environment (Alivisatos et al., 2015;
Blaser et al., 2016), or the epidemiology of the human microbiome for biomarkers and therapy
(Gopalakrishnan et al., 2018; Le Chatelier et al., 2013; Thomas et al., 2019; Zeller et al., 2014).
Using such analyses for accurate discovery, however, requires efficient ways to integrate hundreds
of thousands of (potentially fragmentary) isolate genomes with community profiles to detect novel
species and strains, non-bacterial community members, microbial phylogeny and evolution, and
biochemical and molecular signaling mechanisms. Correspondingly, this computational challenge
has necessitated the continued development of platforms for the detailed functional interpretation
of microbial communities.
The past decade of metagenomics has seen remarkable growth both in the biology accessible via
high-throughput sequencing and in the methods for doing so. Beginning with the now-classic
questions of “who’s there?” and “what are they doing?” in microbial ecology (Human Microbiome
Project Consortium, 2012), shotgun metagenomics provide a complementary means of taxonomic
profiling to amplicon-based (e.g. 16S rRNA gene) sequencing, as well as functional profiling of
genes or biochemical pathways (Morgan et al., 2013; Quince et al., 2017; Segata et al., 2013).
More recently, metagenomic functional profiles have been joined by metatranscriptomics to also
capture community regulation of gene expression (Lloyd-Price et al., 2019). Methods have been
developed to focus on all variants of particular taxa of interest within a set of communities (Pasolli
et al., 2019), to discover new variants of gene families or biochemical activities (Franzosa et al.,
2018; Kaminski et al., 2015), or to link the presence and evolution of closely related strains within
or between communities over time, space, and around the globe (Beghini et al., 2017; Karcher et
al., 2020; Tett et al., 2019). Critically, all of these analyses (and the use of the word “microbiome”
throughout this manuscript) are equally applicable to both bacterial and non-bacterial community
members (e.g. viruses and eukaryotes) (Beghini et al., 2017; Olm et al., 2019; Yutin et al., 2018).
Finally, although not addressed in depth by this study, shotgun meta-omics have increasingly also
been combined with other community profiling techniques such as metabolomics (Heinken et al.,
2019; Lloyd-Price et al., 2017; Sun et al., 2018) and proteomics (Xiong et al., 2015) to provide
richer pictures of microbial community membership, function, and ecology.
Methods enabling such analyses of meta-omic sequencing have developed in roughly two
complementary types, either relying on metagenomic assembly or using largely
assembly-independent, reference-based approaches (Quince et al., 2017). The latter is especially
supported by the corresponding growth of fragmentary, draft, and finished microbial isolate
genomes, and their consistent annotation and clustering into genome groups and pan-genomes
(Almeida et al., 2020, 2019; Pasolli et al., 2019). Most such methods focus on addressing a single
profiling task within (most often) metagenomes, such as taxonomic profiling (Lu et al., 2017;
Milanese et al., 2019; Truong et al., 2015; Wood et al., 2019), strain identification (Luo et al., 2015;
Nayfach et al., 2016; Scholz et al., 2016; Truong et al., 2017), or functional profiling (Franzosa et
al., 2018; Kaminski et al., 2015; Nayfach et al., 2015; Nazeen et al., 2020). In a few cases,
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

platforms such as the bioBakery (McIver et al., 2018), QIIME 2 (Bolyen et al., 2019), or MEGAN
(Mitra et al., 2011) integrate several such methods within an overarching environment. While not a
primary focus of this study, metagenomic assembly methods enabling the former types of analyses
(e.g. novel organism discovery or gene cataloging (Lesker et al., 2020; Stewart et al., 2019)) have
also advanced tremendously (Li et al., 2015; Nurk et al., 2017) and are now reaching a point of
integrating microbial community and isolate genomics, particularly for phylogeny (Asnicar et al.,
2020; Zhu et al., 2019). These efforts have also led to increased consistency in microbial
systematics and phylogeny, facilitating the types of automated, high-throughput analyses
necessary when manual curation cannot keep up with such rapid growth (Asnicar et al., 2020;
Chaumeil et al., 2019).
Here, to further increase the scope of feasible microbial community studies, we introduce a suite of
updated and expanded computational methods in a new version of the bioBakery platform. The
bioBakery 3 includes updated sequence-level quality control and contaminant depletion guidelines
(KneadData), MetaPhlAn 3 for taxonomic profiling, HUMAnN 3 for functional profiling, StrainPhlAn
3 and PanPhlAn 3 for nucleotide- and gene-variant-based strain profiling, and PhyloPhlAn 3 for
phylogenetic placement and putative taxonomic assignment of new assemblies (metagenomic or
isolate). Most of these tools leverage an updated ChocoPhlAn 3 database of systematically
organized and annotated microbial genomes and gene family clusters, newly derived from
UniProt/UniRef (Suzek et al., 2007) and NCBI (NCBI Resource Coordinators, 2014). Our
quantitative evaluations show each individual tool to be more accurate and, typically, more efficient
than its previous version and other comparable methods, increasing sensitivity and specificity by
sometimes more than 2-fold (e.g. in non-human-associated microbial communities). Biomarker
identifications in 1,262 colorectal cancer (CRC) metagenomes, 1,635 inflammatory bowel disease
(IBD) metagenomes, and 817 metatranscriptomes show both the platform’s efficiency and its ability
to detect hundreds of species and thousands of gene families not previously profiled. Finally, in
4,077 human gut metagenomes containing Ruminococcus bromii
, the bioBakery 3 platform permits
an initial integration of assembly- and reference-based metagenomics, discovering a novel
biogeographical and functional structure within the clade’s evolution and global distribution. All
components are available as open-source implementations with documentation, source code, and
workflows enabling provenance, reproducibility, and local or cloud deployment at
http://segatalab.cibio.unitn.it/tools/biobakery and http://huttenhower.sph.harvard.edu/biobakery.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Results
The bioBakery provides a complete meta-omic tool suite and analysis environment, including
methods for individual meta-omic (and other microbial community) processing steps, downstream
statistics, integrated reproducible workflows, standardized packaging and documentation via
open-source repositories (GitHub, Conda, PyPI, and R/Bioconductor), grid- and cloud-deployable
images (AWS, GCP, and Docker), online training material and demonstration data, and a public
community support forum. For any sample set, quality control, taxonomic profiling, functional
profiling, strain profiling, and resulting data products and reports can all be generated with a single
workflow, while maintaining version control and provenance logging. All of the methods
themselves, the associated training material, quality control using KneadData, and packaging for
distribution and use have been updated in this version. For example, Docker images have been
scaled down in size to optimize use in cloud environments, and workflows have been ported to
AWS (Amazon Web Services) Batch and Terra/Cromwell (Google Compute Engine) to reduce
costs through the use of spot and pre-emptive instances, respectively. All base images and
dependencies have been updated as well, including the most recent Python (v3.7+) and R (v4.0+,
see Methods
). New and updated documentation of all tools, including detailed instructions on
installation in different environments and package managers, is available at
http://huttenhower.sph.harvard.edu/biobakery.
High-quality reference sequences for improved meta-omic profiling
The majority of methods within the bioBakery 3 suite leverage a newly-updated reference genome
and gene cataloging procedure, the results of which are packaged as ChocoPhlAn 3 (Fig. 1A
)
(McIver et al., 2018). ChocoPhlAn uses publicly available genomes and standardized gene calls
and gene families to generate markers for taxonomic and strain-level profiling of metagenomes
with MetaPhlAn 3, StrainPhlAn 3, and PanPhlAn 3, phylogenetic profiling of genomes and MAGs
with PhyloPhlAn 3, and functional profiling of metagenomes with HUMAnN 3.
ChocoPhlAn 3 is based on a genomic repository of 99.2k high-quality, fully annotated reference
microbial genomes from 16.8k species available in the UniProt Proteomes portal as of January
2019 (UniProt Consortium, 2019) and the corresponding functionally-annotated 87.3M UniRef90
gene families (Suzek et al., 2015). From this resource, ChocoPhlAn initially generates annotated
species-level pangenomes associating each microbial species with its sequenced genomes and
repertoire of UniRef-based gene (nucleotide) and protein (amino acid sequence) families. These
pangenomes provide a uniform shared resource for subsequent profiling across bioBakery 3.
HUMAnN 3 and PanPhlAn 3 are directly based on complete pangenomes for overall functional and
strain profiling, whereas other tools use additional information annotated onto the catalog.
PhyloPhlAn 3 focuses on the subset of conserved core gene families (i.e. present in almost all
strains of a species) for inferring accurate phylogenies, and both MetaPhlAn 3 and StrainPhlAn 3
further refine core gene families into species-specific unique gene families to generate
unambiguous markers for metagenomic species identification and strain-level genetic
characterization.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Figure 1: bioBakery 3 includes new microbial community profiling approaches that outperform previous
versions and current methods. (A) The newly developed ChocoPhlAn 3 consolidates, quality controls, and annotates
isolate-derived reference sequences to enable metagenomic profiling in subsequent bioBakery methods. (*The 1.1M
MetaPhlAn 3 markers also comprise for 61.8k viral markers from MetaPhlAn 2 (Truong et al., 2015); other version
descriptions in (Asnicar et al., 2020; Scholz et al., 2016; Truong et al., 2017)) (B) MetaPhlAn 3 was applied to a set of
113 total evaluation datasets provided by CAMI (Fritz et al., 2019) representing diverse human-associated microbiomes
and 5 datasets of non-human-associated microbiomes (Table S1
). MetaPhlAn 3 showed increased performance
compared with the previous version MetaPhlAn 2 (Truong et al., 2015), mOTUs2 (Milanese et al., 2019), and Bracken
2.5 (Lu et al., 2017). We report here the F1 scores (harmonic mean of the species-level precision and recall, see Fig. S1
for other evaluation scores). (C) MetaPhlAn 3 better recapitulates relative abundance profiles both from human and
murine gastrointestinal metagenomes as well from non-human-associated communities compared to the other currently
available tools (full results in Fig. S1
). Bracken is reported both using its original estimates based on the fraction of reads
assigned to each taxon and after re-normalizing them using the genome lengths of the taxa in the gold standard to match
the taxa abundance estimate of the other tools. (D) Compared with HUMAnN 2 (Franzosa et al., 2018) and Carnelian
(Nazeen et al., 2020), HUMAnN 3 produces more accurate estimates of EC abundances and displays a higher species
true positive rate compared to HUMAnN 2.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Nivolumab plus ipilimumab with or without live bacterial supplementation in metastatic renal cell carcinoma: a randomized phase 1 trial

TL;DR: In this article , a bifidogenic live bacterial product (CBM588) was proposed to augment the response to checkpoint inhibitors through modulation of the gut microbiome, and the results showed that CBM588 appeared to enhance the clinical outcome in patients with metastatic renal cell carcinoma treated with nivolumab-ipilimumab.
References
More filters
Journal ArticleDOI

Basic Local Alignment Search Tool

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

Trimmomatic: a flexible trimmer for Illumina sequence data

TL;DR: Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.
Journal ArticleDOI

Fast gapped-read alignment with Bowtie 2

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3" ?

To leverage these, the authors present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. CC-BY 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity. 

The bioBakery 3 begins to overcome this challenge by combining a greatly expanded set of reference sequences with ways of “ falling back ” gracefully when encountering new sequences, while also paving the way for further integration of assembly-based discovery in the future ( discussed below ). The authors thus anticipate improved integration of reference- and assembly-based meta-omic analyses to be one of the main areas of future development for the bioBakery, along with expanded methods for other types of multi-omics in addition to transcription. In addition to making a novel sub-species phylogenetic and biogeographic structure apparent, the combination of MetaPhlAn, HUMAnN, PanPhlAn, StrainPhlAn, and PhyloPhlAn together confirmed that most R. bromii strains are “ personal ” ( i. e. specific to and retained within individuals, like most microbiome members ), rarely transmissible across hosts, and that genomic differences characterize each subspecies ( suggesting a degree of functional adaptation and specialization ). These components of metagenomes - and, for RNA viruses, metatranscriptomes - are often measured with surprising heterogeneity during the initial generation of sequencing data themselves ( Zolfo et al., 2019 ), suggesting necessary improvements in analytical quality control and normalization as well. 

To further improve the quality of the read mapping, the authors adopted quality controls before and after mapping by discarding low-quality sequences and alignments (reads shorter than 70bp and alignment with a MAPQ value less than 5). 

By default, consensus markers reconstructed with less than 8 reads or with a breadth of coverage (i.e. fraction of the marker covered by reads) lower than 80% are discarded (“--breadth_threshold” parameter). 

A final area of improvement for the bioBakery, relatedly, is the increased integration between reference-based and assembly-based approaches - begun here via PhyloPhlAn 3 - in order to better leverage MAGs (Almeida et al., 2020), SGBs (Pasolli et al., 2019), and novel gene families. 

The inclusion of DNA abundance as a covariate in the above model accounts for the strong dependence between a function’s gene (metagenomic) copy number and its metatranscriptomic abundance. 

The authors constructed additional synthetic metagenomes by sampling sequencing reads from curated microbial genome sets using ART (Huang et al., 2012) with an Illumina HiSeq 2500 error model. 

Feedback on any aspect of the methods or their applications in diverse host-associated or environmental microbiome settings can be submitted at https://forum.biobakery.org, and the authors hope the bioBakery will continue to provide a flexible, convenient, reproducible, and accurate discovery platform for microbial community biology. 

To obtain a nucleotide representation of each pan-proteome, the authors identified a representative of the cluster for each pan-protein by selecting a UniProtKB or UniParc entry taxonomically assigned to the desired species. 

As in previous studies (Pasolli et al., 2016; Thomas et al., 2019), RFs using functional features performed similarly (0.69 Cross Validation and 0.71 LODO ROC AUC on pathways relative abundance), indicating a tight link between strain-specific taxonomy and gene carriage in this setting. 

The relative absence of biomarkers for active UC may result both from its generally more benign phenotype (Lloyd-Price et al., 2019) and from the smaller number of active UC samples (n=23) compared with active CD samples (n=76); as a result, the authors focused their subsequent analyses on expression differences within the CD subcohort. 

the regular expressions used to filter low-quality taxonomic annotations are:“ (C|c)andidat(e|us) | _sp(_.*|$) | (.*_|^)(b|B)acterium(_.*|) | .*(eury|)archaeo(n_|te|n$).* | .*(endo|)symbiont.* | .*genomosp_.* | .*unidentified.* | .*_bacteria_.* | .*_taxon_.* | .*_et_al_.* | .*_and_.* | .*(cyano|proteo|actino)bacterium_.*) 

Human and murine synthetic metagenomes and gold standards provided by the CAMI Challenge are available at https://data.cami-challenge.org/participate.Non-human synthetic metagenomes and gold standards are available at http://segatalab.cibio.unitn.it/tools/biobakery/.