Open AccessPosted ContentDOI

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

- 21 Nov 2020 -

Chats0

TLDR

With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.

Abstract:

Culture-independent analyses of microbial communities have advanced dramatically in the last decade, particularly due to advances in methods for biological profiling via shotgun metagenomics. Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity. These methods detected novel disease-microbiome links in applications to CRC (1,262 metagenomes) and IBD (1,635 metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4,077 metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional structure of the common gut microbe Ruminococcus bromii, previously described by only 15 isolate genomes. With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.

Content maybe subject to copyright Report

Integrating taxonomic, functional, and strain-level profiling

of diverse microbial communities with bioBakery 3

Francesco Beghini 

*,1

, Lauren J. McIver 

*,2

, Aitor Blanco-Míguez 

1

, Leonard Dubois 

1

, Francesco

Asnicar 

1

, Sagun Maharjan 

2,3

, Ana Mailyan 

2,3

, Andrew Maltez Thomas 

1

, Paolo Manghi 

1

, Mireia

Valles-Colomer 

1

, George Weingart 

2,3

, Yancong Zhang 

2,3

, Moreno Zolfo 

1

, Curtis Huttenhower 

^,2,3

Eric A. Franzosa 

^,2,3

, Nicola Segata 

^,1,4

1. Department CIBIO, University of Trento, Italy

2. Harvard T.H. Chan School of Public Health, Boston, MA, USA

3. The Broad Institute of MIT and Harvard, Cambridge, MA, USA

4. IEO, European Institute of Oncology IRCCS, Milan, Italy

* Joint first authors

^ Joint senior authors

Correspondence to: chuttenh@hsph.harvard.edu, franzosa@hsph.harvard.edu, nicola.segata@unitn.it

Abstract

Culture-independent analyses of microbial communities have advanced dramatically in the last

decade, particularly due to advances in methods for biological profiling via shotgun metagenomics.

Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial

reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of

integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of

metagenomes newly developed to build on the largest set of reference sequences now available.

Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and

HUMAnN 3 improves that of functional potential and activity. These methods detected novel

disease-microbiome links in applications to CRC (1,262 metagenomes) and IBD (1,635

metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4,077

metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional

structure of the common gut microbe Ruminococcus bromii



, previously described by only 15 isolate

genomes. With open-source implementations and cloud-deployable reproducible workflows, the

bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of

multi-omic profiling for microbial community studies.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

 
Introduction 
Studies of microbial community biology continue to be enriched by the growth of                         
culture-independent sequencing and high-throughput isolate genomics (Almeida et al., 2020, 2019;                     
Forster et al., 2019; Parks et al., 2017; Pasolli et al., 2019; Poyet et al., 2019; Zou et al., 2019).                                       
Shotgun metagenomic and metatranscriptomic (i.e. “meta-omic”) measurements can be used to                     
address an increasing range of questions as diverse as the transmission and evolution of strains in                               
situ (Asnicar et al., 2017; Ferretti et al., 2018; Truong et al., 2017; Yassour et al., 2018), the                                   
mechanisms of multi-organism biochemical responses in the environment (Alivisatos et al., 2015;                       
Blaser et al., 2016), or the epidemiology of the human microbiome for biomarkers and therapy                             
(Gopalakrishnan et al., 2018; Le Chatelier et al., 2013; Thomas et al., 2019; Zeller et al., 2014).                                 
Using such analyses for accurate discovery, however, requires efficient ways to integrate hundreds                         
of thousands of (potentially fragmentary) isolate genomes with community profiles to detect novel                         
species and strains, non-bacterial community members, microbial phylogeny and evolution, and                     
biochemical and molecular signaling mechanisms. Correspondingly, this computational challenge                 
has necessitated the continued development of platforms for the detailed functional interpretation                       
of microbial communities. 
The past decade of metagenomics has seen remarkable growth both in the biology accessible via                             
high-throughput sequencing and in the methods for doing so. Beginning with the now-classic                         
questions of “who’s there?” and “what are they doing?” in microbial ecology (Human Microbiome                           
Project Consortium, 2012), shotgun metagenomics provide a complementary means of taxonomic                     
profiling to amplicon-based (e.g. 16S rRNA gene) sequencing, as well as functional profiling of                           
genes or biochemical pathways (Morgan et al., 2013; Quince et al., 2017; Segata et al., 2013).                               
More recently, metagenomic functional profiles have been joined by metatranscriptomics to also                       
capture community regulation of gene expression (Lloyd-Price et al., 2019). Methods have been                         
developed to focus on all variants of particular taxa of interest within a set of communities (Pasolli                                 
et al., 2019), to discover new variants of gene families or biochemical activities (Franzosa et al.,                               
2018; Kaminski et al., 2015), or to link the presence and evolution of closely related strains within                                 
or between communities over time, space, and around the globe (Beghini et al., 2017; Karcher et                               
al., 2020; Tett et al., 2019). Critically, all of these analyses (and the use of the word “microbiome”                                   
throughout this manuscript) are equally applicable to both bacterial and non-bacterial community                       
members (e.g. viruses and eukaryotes) (Beghini et al., 2017; Olm et al., 2019; Yutin et al., 2018).                                 
Finally, although not addressed in depth by this study, shotgun meta-omics have increasingly also                           
been combined with other community profiling techniques such as metabolomics (Heinken et al.,                         
2019; Lloyd-Price et al., 2017; Sun et al., 2018) and proteomics (Xiong et al., 2015) to provide                                 
richer pictures of microbial community membership, function, and ecology. 
Methods enabling such analyses of meta-omic sequencing have developed in roughly two                       
complementary types, either relying on metagenomic assembly or using largely                   
assembly-independent, reference-based approaches (Quince et al., 2017). The latter is especially                     
supported by the corresponding growth of fragmentary, draft, and finished microbial isolate                       
genomes, and their consistent annotation and clustering into genome groups and pan-genomes                       
(Almeida et al., 2020, 2019; Pasolli et al., 2019). Most such methods focus on addressing a single                                 
profiling task within (most often) metagenomes, such as taxonomic profiling (Lu et al., 2017;                           
Milanese et al., 2019; Truong et al., 2015; Wood et al., 2019), strain identification (Luo et al., 2015;                                   
Nayfach et al., 2016; Scholz et al., 2016; Truong et al., 2017), or functional profiling (Franzosa et                                 
al., 2018; Kaminski et al., 2015; Nayfach et al., 2015; Nazeen et al., 2020). In a few cases,                                   
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 
The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint 

platforms such as the bioBakery (McIver et al., 2018), QIIME 2 (Bolyen et al., 2019), or MEGAN

(Mitra et al., 2011) integrate several such methods within an overarching environment. While not a

primary focus of this study, metagenomic assembly methods enabling the former types of analyses

(e.g. novel organism discovery or gene cataloging (Lesker et al., 2020; Stewart et al., 2019)) have

also advanced tremendously (Li et al., 2015; Nurk et al., 2017) and are now reaching a point of

integrating microbial community and isolate genomics, particularly for phylogeny (Asnicar et al.,

2020; Zhu et al., 2019). These efforts have also led to increased consistency in microbial

systematics and phylogeny, facilitating the types of automated, high-throughput analyses

necessary when manual curation cannot keep up with such rapid growth (Asnicar et al., 2020;

Chaumeil et al., 2019).

Here, to further increase the scope of feasible microbial community studies, we introduce a suite of

updated and expanded computational methods in a new version of the bioBakery platform. The

bioBakery 3 includes updated sequence-level quality control and contaminant depletion guidelines

(KneadData), MetaPhlAn 3 for taxonomic profiling, HUMAnN 3 for functional profiling, StrainPhlAn

3 and PanPhlAn 3 for nucleotide- and gene-variant-based strain profiling, and PhyloPhlAn 3 for

phylogenetic placement and putative taxonomic assignment of new assemblies (metagenomic or

isolate). Most of these tools leverage an updated ChocoPhlAn 3 database of systematically

organized and annotated microbial genomes and gene family clusters, newly derived from

UniProt/UniRef (Suzek et al., 2007) and NCBI (NCBI Resource Coordinators, 2014). Our

quantitative evaluations show each individual tool to be more accurate and, typically, more efficient

than its previous version and other comparable methods, increasing sensitivity and specificity by

sometimes more than 2-fold (e.g. in non-human-associated microbial communities). Biomarker

identifications in 1,262 colorectal cancer (CRC) metagenomes, 1,635 inflammatory bowel disease

(IBD) metagenomes, and 817 metatranscriptomes show both the platform’s efficiency and its ability

to detect hundreds of species and thousands of gene families not previously profiled. Finally, in

4,077 human gut metagenomes containing Ruminococcus bromii



, the bioBakery 3 platform permits

an initial integration of assembly- and reference-based metagenomics, discovering a novel

biogeographical and functional structure within the clade’s evolution and global distribution. All

components are available as open-source implementations with documentation, source code, and

workflows enabling provenance, reproducibility, and local or cloud deployment at

http://segatalab.cibio.unitn.it/tools/biobakery and http://huttenhower.sph.harvard.edu/biobakery.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Results

The bioBakery provides a complete meta-omic tool suite and analysis environment, including

methods for individual meta-omic (and other microbial community) processing steps, downstream

statistics, integrated reproducible workflows, standardized packaging and documentation via

open-source repositories (GitHub, Conda, PyPI, and R/Bioconductor), grid- and cloud-deployable

images (AWS, GCP, and Docker), online training material and demonstration data, and a public

community support forum. For any sample set, quality control, taxonomic profiling, functional

profiling, strain profiling, and resulting data products and reports can all be generated with a single

workflow, while maintaining version control and provenance logging. All of the methods

themselves, the associated training material, quality control using KneadData, and packaging for

distribution and use have been updated in this version. For example, Docker images have been

scaled down in size to optimize use in cloud environments, and workflows have been ported to

AWS (Amazon Web Services) Batch and Terra/Cromwell (Google Compute Engine) to reduce

costs through the use of spot and pre-emptive instances, respectively. All base images and

dependencies have been updated as well, including the most recent Python (v3.7+) and R (v4.0+,

see Methods

). New and updated documentation of all tools, including detailed instructions on

installation in different environments and package managers, is available at

http://huttenhower.sph.harvard.edu/biobakery.

High-quality reference sequences for improved meta-omic profiling

The majority of methods within the bioBakery 3 suite leverage a newly-updated reference genome

and gene cataloging procedure, the results of which are packaged as ChocoPhlAn 3 (Fig. 1A

)

(McIver et al., 2018). ChocoPhlAn uses publicly available genomes and standardized gene calls

and gene families to generate markers for taxonomic and strain-level profiling of metagenomes

with MetaPhlAn 3, StrainPhlAn 3, and PanPhlAn 3, phylogenetic profiling of genomes and MAGs

with PhyloPhlAn 3, and functional profiling of metagenomes with HUMAnN 3.

ChocoPhlAn 3 is based on a genomic repository of 99.2k high-quality, fully annotated reference

microbial genomes from 16.8k species available in the UniProt Proteomes portal as of January

2019 (UniProt Consortium, 2019) and the corresponding functionally-annotated 87.3M UniRef90

gene families (Suzek et al., 2015). From this resource, ChocoPhlAn initially generates annotated

species-level pangenomes associating each microbial species with its sequenced genomes and

repertoire of UniRef-based gene (nucleotide) and protein (amino acid sequence) families. These

pangenomes provide a uniform shared resource for subsequent profiling across bioBakery 3.

HUMAnN 3 and PanPhlAn 3 are directly based on complete pangenomes for overall functional and

strain profiling, whereas other tools use additional information annotated onto the catalog.

PhyloPhlAn 3 focuses on the subset of conserved core gene families (i.e. present in almost all

strains of a species) for inferring accurate phylogenies, and both MetaPhlAn 3 and StrainPhlAn 3

further refine core gene families into species-specific unique gene families to generate

unambiguous markers for metagenomic species identification and strain-level genetic

characterization.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

Figure 1: bioBakery 3 includes new microbial community profiling approaches that outperform previous

versions and current methods. (A) The newly developed ChocoPhlAn 3 consolidates, quality controls, and annotates

isolate-derived reference sequences to enable metagenomic profiling in subsequent bioBakery methods. (*The 1.1M

MetaPhlAn 3 markers also comprise for 61.8k viral markers from MetaPhlAn 2 (Truong et al., 2015); other version

descriptions in (Asnicar et al., 2020; Scholz et al., 2016; Truong et al., 2017)) (B) MetaPhlAn 3 was applied to a set of

113 total evaluation datasets provided by CAMI (Fritz et al., 2019) representing diverse human-associated microbiomes

and 5 datasets of non-human-associated microbiomes (Table S1

). MetaPhlAn 3 showed increased performance

compared with the previous version MetaPhlAn 2 (Truong et al., 2015), mOTUs2 (Milanese et al., 2019), and Bracken

2.5 (Lu et al., 2017). We report here the F1 scores (harmonic mean of the species-level precision and recall, see Fig. S1

for other evaluation scores). (C) MetaPhlAn 3 better recapitulates relative abundance profiles both from human and

murine gastrointestinal metagenomes as well from non-human-associated communities compared to the other currently

available tools (full results in Fig. S1

). Bracken is reported both using its original estimates based on the fraction of reads

assigned to each taxon and after re-normalizing them using the genome lengths of the taxa in the gold standard to match

the taxa abundance estimate of the other tools. (D) Compared with HUMAnN 2 (Franzosa et al., 2018) and Carnelian

(Nazeen et al., 2020), HUMAnN 3 produces more accurate estimates of EC abundances and displays a higher species

true positive rate compared to HUMAnN 2.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted November 21, 2020. ; https://doi.org/10.1101/2020.11.19.388223doi: bioRxiv preprint

HTML Viewer

Figures

Figure 3: Longitudinal taxonomic and functional meta-omics of IBD. (A) Comparison of MetaPhlAn and HUMAnN profiles of IBDMDB metagenomes and metatranscriptomes using v2 and v3 software (sequencing data and v2 profiles downloaded from http://ibdmdb.org ). (B) >500 Enzyme Commission (EC) families were significantly [linear mixed-effects

Figure 4: Population-scale strain-level phylogenetic and pangenomic analyses of Ruminococcus bromii from over 4,000 human gut metagenomes. (A) StrainPhlAn 3 profiling revealed stratification of Ruminococcus bromii clades with genetic content and variants frequently structured with respect to geographic origin and lifestyle. Genetically divergent subclades were identified, labeled as “Cluster 1” (mainly composed of strains retrieved from Chinese subjects) and a subspecies-like Cluster 2. (B) Strain tracking of R. bromii . While unrelated individuals from diverse populations very rarely share highly genetically similar strains, pairs of related strains are readily detected by StrainPhlAn from longitudinal samples from the same individuals (quantifying short- and medium-term strain retention at about 75%) and in mother-infant pairs (confirming this species is at least partially vertically transmitted). Normalized phylogenetic distances (nPD) were computed on the StrainPhlAn tree. (C) PanPhlAn 3 gene profiles of R. bromii strains from metagenomes highlights the variability and the structure of the accessory genes across datasets (core genes were removed for clarity). A total of 6,151 UniRef90 gene families from the R. bromii pangenome were detected across the 2,679 of the 4,077 samples in which a strain of this species was present at a sufficient abundance to be profiled by PanPhlAn. The 13 highest-rooted gene clusters are shown, highlighting co-occurrence of blocks likely to be functionally related. The most common GO annotations are also reported together with two operons containing genes verified to be on the same locus by analysis of the reference genomes in the PanPhlAn 3 database. (D) Genetic (SNV on marker genes from StrainPhlAn 3) and genomic (gene presence/absence from PanPhlAn 3) distances between R. bromii strains are correlated (Pearson's r=0.632, p-value<2.2e-16) pointing at generally consistent functional divergence in this species.

Figure 1: bioBakery 3 includes new microbial community profiling approaches that outperform previous versions and current methods. (A) The newly developed ChocoPhlAn 3 consolidates, quality controls, and annotates isolate-derived reference sequences to enable metagenomic profiling in subsequent bioBakery methods. (*The 1.1M MetaPhlAn 3 markers also comprise for 61.8k viral markers from MetaPhlAn 2 (Truong et al., 2015); other version descriptions in (Asnicar et al., 2020; Scholz et al., 2016; Truong et al., 2017)) (B) MetaPhlAn 3 was applied to a set of 113 total evaluation datasets provided by CAMI (Fritz et al., 2019) representing diverse human-associated microbiomes and 5 datasets of non-human-associated microbiomes ( Table S1). MetaPhlAn 3 showed increased performance compared with the previous version MetaPhlAn 2 (Truong et al., 2015), mOTUs2 (Milanese et al., 2019), and Bracken 2.5 (Lu et al., 2017). We report here the F1 scores (harmonic mean of the species-level precision and recall, see Fig. S1 for other evaluation scores). (C) MetaPhlAn 3 better recapitulates relative abundance profiles both from human and murine gastrointestinal metagenomes as well from non-human-associated communities compared to the other currently available tools (full results in Fig. S1 ). Bracken is reported both using its original estimates based on the fraction of reads assigned to each taxon and after re-normalizing them using the genome lengths of the taxa in the gold standard to match the taxa abundance estimate of the other tools. (D) Compared with HUMAnN 2 (Franzosa et al., 2018) and Carnelian (Nazeen et al., 2020), HUMAnN 3 produces more accurate estimates of EC abundances and displays a higher species true positive rate compared to HUMAnN 2.

Figure 2: Meta-analysis with MetaPhlAn 3 and HUMAnN 3 expands taxonomic and functional associations with the CRC microbiome. (A) We considered a total of nine independent datasets (1,262 total samples) that highly but not completely overlap in composition based on ordination (multidimensional scaling) of weighted UniFrac distances (Lozupone and Knight, 2005) computed from the MetaPhlAn 3 species relative abundances. (B) Meta-analysis based on standardized mean differences and a random effects model yielded 11 MetaPhlAn 3 species significantly (Wilcoxon rank-sum test FDR P<0.05) associated with colorectal cancer at effect size>0.35 (see Methods). (C) Species richness is significantly higher in CRC samples compared to control (Wilcoxon rank-sum test P<0.05 in 7/9 datasets), and the expanded MetaPhlAn 3 species catalog detects more species compared to MetaPhlAn 2 (CRC mean median increase 37.1%, controls mean median increase 36.3%). (D) Distribution of cutC gene relative abundance (log10 count-per-million normalized) from HUMAnN 3 gene family profiles supporting the potential link between choline metabolism and CRC (Thomas et al., 2019). (E) Random forest (RF) classification using MetaPhlAn 3 features and HUMAnN 3 features (F) confirms that CRC patients can be predicted at (treatment-naive) baseline from the composition of their gut microbiome with performances reaching ~0.85 cross-validated or leave-one-dataset-out (LODO) ROC AUC (see Methods).

Citations

PDF

Open Access

More filters

PDF

Open Access

More filters

Journal ArticleDOI

Basic Local Alignment Search Tool

Stephen F. Altschul, +4 more

- 01 Oct 1990 -

Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

Journal ArticleDOI

Random Forests

Leo Breiman

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Journal ArticleDOI

Trimmomatic: a flexible trimmer for Illumina sequence data

Anthony Bolger, +2 more

- 01 Aug 2014 -

Bioinformatics

TL;DR: Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.

...read moreread less

Journal ArticleDOI

Fast gapped-read alignment with Bowtie 2

Ben Langmead, +3 more

- 01 Apr 2012 -

Nature Methods

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

...read moreread less

Collapse

Fast gapped-read alignment with Bowtie 2

Ben Langmead, +3 more

- 01 Apr 2012 -

Nature Methods

The Sequence Alignment/Map format and SAMtools

Heng Li, +8 more

- 01 Aug 2009 -

Bioinformatics

Trimmomatic: a flexible trimmer for Illumina sequence data

Anthony Bolger, +2 more

- 01 Aug 2014 -

Bioinformatics

Fast and sensitive protein alignment using DIAMOND

Benjamin Buchfink, +2 more

- 01 Jan 2015 -

Nature Methods

Cutadapt removes adapter sequences from high-throughput sequencing reads

Marcel Martin

- 02 May 2011 -

EMBnet.journal

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3" ?

To leverage these, the authors present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. CC-BY 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity.

Q2. What have the authors stated for future works in "Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3" ?

The bioBakery 3 begins to overcome this challenge by combining a greatly expanded set of reference sequences with ways of “ falling back ” gracefully when encountering new sequences, while also paving the way for further integration of assembly-based discovery in the future ( discussed below ). The authors thus anticipate improved integration of reference- and assembly-based meta-omic analyses to be one of the main areas of future development for the bioBakery, along with expanded methods for other types of multi-omics in addition to transcription. In addition to making a novel sub-species phylogenetic and biogeographic structure apparent, the combination of MetaPhlAn, HUMAnN, PanPhlAn, StrainPhlAn, and PhyloPhlAn together confirmed that most R. bromii strains are “ personal ” ( i. e. specific to and retained within individuals, like most microbiome members ), rarely transmissible across hosts, and that genomic differences characterize each subspecies ( suggesting a degree of functional adaptation and specialization ). These components of metagenomes - and, for RNA viruses, metatranscriptomes - are often measured with surprising heterogeneity during the initial generation of sequencing data themselves ( Zolfo et al., 2019 ), suggesting necessary improvements in analytical quality control and normalization as well.

Q3. What is the way to improve the quality of the read mapping?

To further improve the quality of the read mapping, the authors adopted quality controls before and after mapping by discarding low-quality sequences and alignments (reads shorter than 70bp and alignment with a MAPQ value less than 5).

Q4. What is the default for consensus markers?

By default, consensus markers reconstructed with less than 8 reads or with a breadth of coverage (i.e. fraction of the marker covered by reads) lower than 80% are discarded (“--breadth_threshold” parameter).

Q5. What is the main area of improvement for the bioBakery?

A final area of improvement for the bioBakery, relatedly, is the increased integration between reference-based and assembly-based approaches - begun here via PhyloPhlAn 3 - in order to better leverage MAGs (Almeida et al., 2020), SGBs (Pasolli et al., 2019), and novel gene families.

Q6. What is the significance of the inclusion of DNA abundance in the above model?

The inclusion of DNA abundance as a covariate in the above model accounts for the strong dependence between a function’s gene (metagenomic) copy number and its metatranscriptomic abundance.

Q7. How did the authors construct additional synthetic metagenomes?

The authors constructed additional synthetic metagenomes by sampling sequencing reads from curated microbial genome sets using ART (Huang et al., 2012) with an Illumina HiSeq 2500 error model.

Q8. What is the main goal of the bioBakery?

Feedback on any aspect of the methods or their applications in diverse host-associated or environmental microbiome settings can be submitted at https://forum.biobakery.org, and the authors hope the bioBakery will continue to provide a flexible, convenient, reproducible, and accurate discovery platform for microbial community biology.

Q9. How did the authors identify the representative of each pan-proteome?

To obtain a nucleotide representation of each pan-proteome, the authors identified a representative of the cluster for each pan-protein by selecting a UniProtKB or UniParc entry taxonomically assigned to the desired species.

Q10. How did the RFs perform in the IBD dataset?

As in previous studies (Pasolli et al., 2016; Thomas et al., 2019), RFs using functional features performed similarly (0.69 Cross Validation and 0.71 LODO ROC AUC on pathways relative abundance), indicating a tight link between strain-specific taxonomy and gene carriage in this setting.

Q11. What is the reason for the absence of biomarkers for active UC?

The relative absence of biomarkers for active UC may result both from its generally more benign phenotype (Lloyd-Price et al., 2019) and from the smaller number of active UC samples (n=23) compared with active CD samples (n=76); as a result, the authors focused their subsequent analyses on expression differences within the CD subcohort.

Q12. What is the common way to filter low-quality taxonomic annotations?

Q13. What are the available synthetic metagenomes and gold standards?

Human and murine synthetic metagenomes and gold standards provided by the CAMI Challenge are available at https://data.cami-challenge.org/participate.Non-human synthetic metagenomes and gold standards are available at http://segatalab.cibio.unitn.it/tools/biobakery/.

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

Figures

Citations

The Human Microbiome.

Intestinal Akkermansia muciniphila predicts clinical response to PD-1 blockade in patients with advanced non-small-cell lung cancer

Nivolumab plus ipilimumab with or without live bacterial supplementation in metastatic renal cell carcinoma: a randomized phase 1 trial

Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma

Targeted suppression of human IBD-associated gut microbiota commensals by phage consortia for treatment of intestinal inflammation

References

Basic Local Alignment Search Tool

Random Forests

Trimmomatic: a flexible trimmer for Illumina sequence data

Fast gapped-read alignment with Bowtie 2

Gene Ontology: tool for the unification of biology

Related Papers (5)

Fast gapped-read alignment with Bowtie 2

The Sequence Alignment/Map format and SAMtools

Trimmomatic: a flexible trimmer for Illumina sequence data

Fast and sensitive protein alignment using DIAMOND

Cutadapt removes adapter sequences from high-throughput sequencing reads

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3" ?

Q2. What have the authors stated for future works in "Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3" ?

Q3. What is the way to improve the quality of the read mapping?

Q4. What is the default for consensus markers?

Q5. What is the main area of improvement for the bioBakery?

Q6. What is the significance of the inclusion of DNA abundance in the above model?

Q7. How did the authors construct additional synthetic metagenomes?

Q8. What is the main goal of the bioBakery?

Q9. How did the authors identify the representative of each pan-proteome?

Q10. How did the RFs perform in the IBD dataset?

Q11. What is the reason for the absence of biomarkers for active UC?

Q12. What is the common way to filter low-quality taxonomic annotations?

Q13. What are the available synthetic metagenomes and gold standards?