scispace - formally typeset
Open AccessJournal ArticleDOI

Shotgun metagenomics, from sampling to analysis

TLDR
Computational approaches to overcome the challenges that affect both assembly-based and mapping-based metagenomic profiling, particularly of high-complexity samples or environments containing organisms with limited similarity to sequenced genomes, are needed.
Abstract
Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial roles in the environment and in human health. However, microbes are frequently difficult to culture in the laboratory, which can confound cataloging of members and understanding of how communities function. High-throughput sequencing technologies and a suite of computational pipelines have been combined into shotgun metagenomics methods that have transformed microbiology. Still, computational approaches to overcome the challenges that affect both assembly-based and mapping-based metagenomic profiling, particularly of high-complexity samples or environments containing organisms with limited similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific strains of these communities offers biotechnological promise in therapeutic discovery and innovative ways to synthesize products using microbial factories and can pinpoint the contributions of microorganisms to planetary, animal and human health.

read more

Content maybe subject to copyright    Report

Shotgun metagenomics, from sampling to sequencing and analysis
Christopher Quince
1,^
, Alan W. Walker
2,^
, Jared T. Simpson
3,4
, Nicholas J. Loman
5
, Nicola Segata
6,*
1
Warwick Medical School, University of Warwick, Warwick, UK.
2
Microbiology Group, The Rowett Institute, University of Aberdeen, Aberdeen, UK.
3
Ontario Institute for Cancer Research, Toronto, Canada
4
Department of Computer Science, University of Toronto, Toronto, Canada.
5
Institute for Microbiology and Infection, University of Birmingham, Birmingham, UK.
6
Centre for Integrative Biology, University of Trento, Trento, Italy.
^ These authors contributed equally
* Corresponding author: Nicola Segata (nicola.segata@unitn.it)
Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial
roles in the environment and human health. However, microbes are frequently difficult to culture in the
laboratory, which can confound cataloging members and understanding how communities function.
Cheap, high-throughput sequencing technologies and a suite of computational pipelines have been
combined into shotgun metagenomics methods that have transformed microbiology. Still, computational
approaches to overcome challenges that affect both assembly-based and mapping-based metagenomic
profiling, particularly of high-complexity samples, or environments containing organisms with limited
similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific
strains of these communities offer biotechnological promise in therapeutic discovery, or innovative ways
to synthesize products using microbial factories, but can also pinpoint the contributions of
microorganisms to planetary, animal and human health.
Introduction
High throughput sequencing approaches enable genomic analyses of ideally all microbes in a sample, not
just those that are more amenable to cultivation. One such method, shotgun metagenomics, is the
untargeted (“shotgun”) sequencing of all (“meta”) of the microbial genomes (“genomics”) present in a
sample. Shotgun sequencing can be used to profile taxonomic composition and functional potential of
microbial communities, and to recover whole genome sequences. Approaches such as high-throughput 16S
rRNA gene sequencing
1
, which profile selected organisms or single marker genes are sometimes mistakenly
referred to as metagenomics but are not metagenomic methods, because they do not target the entire
genomic content of a sample.
In the past 15 years since it was first used, metagenomics has enabled large-scale investigations of complex
microbiomes
2-7
. Discoveries enabled by this technology include the identification of previously unknown
environmental bacterial phyla with endosymbiotic behavior
8
, and species that can carry out complete
nitrification of ammonia
9,10
. Other striking findings include the widespread presence of antibiotic genes in
commensal gut bacteria
11
, tracking of human outbreak pathogens
4
, the strong association of both the viral
12
and bacterial
13
fraction of the microbiome with inflammatory bowel diseases, and the ability to monitor
strain-level changes in the gut microbiota after perturbations such as those induced by faecal microbiome
transplantation
14
.
In this Review we discuss best-practice for shotgun metagenomics studies, including identifying and tackling
limitations, and provide an outlook for metagenomics in the future.

Figure 1. Summary of a metagenomics workflow. Step 1: Study design and experimental protocol, the importance of this step is
often underestimated in metagenomics. Step 2: Computational pre-processing. Computational quality control steps minimize
fundamental sequence biases or artefacts e.g. removal of sequencing adaptors, quality trimming, removal of sequencing duplicates
(using e.g. fastqc, trimmomatic
122
, and Picard tools). Foreign or non-target DNA sequences are also filtered and samples are sub-
sampled to normalize read numbers, if the diversity of taxa or functions is compared. Step 3: Sequence analysis. This should comprise
a combination of ‘read-based’ and ‘assembly-based’ approaches depending on the experimental objectives. Both approaches have
advantages and limitations (See Table 4 for a detailed discussion). Step 4: Post-processing. Various multivariate statistical
techniques can be used to interpret the data. Step 5: Validation. Conclusions from high dimensional biological data are susceptible
to study driven biases so follow-up analyses are vital.
Shotgun metagenomics study design
A typical shotgun metagenomics study comprises five steps following the initial study design; (i) the
collection, processing, and sequencing of the samples, (ii) the preprocessing of the sequencing reads, (iii)
the sequence analysis to profile taxonomic, functional, and genomic features of the microbiome, (iv) the
postprocessing statistical and biological analysis, and (v) the validation (Figure 1). Numerous experimental
and computational approaches are available to carry out each step, which means that researchers are faced
with a daunting choice. And, despite its apparent simplicity, shotgun metagenomics has limitations, owing
to potential experimental biases and the complexity of computational analysis and their interpretation. We
assess the choices that need to be made at each step and how to overcome common problems.

The steps involved in the design of hypothesis-based studies are outlined in Supplementary Figure 1 with
specific recommendations summarized in Supplementary Box 1. Individual samples from the same
environment can be variable in microbial content, which makes it challenging to detect statistically
significant, and biologically meaningful, differences among small sets of samples. It is therefore important
to establish that studies are sufficiently powered to detect differences, especially if the effect size is small
15
. One useful strategy may be to generate pilot data to inform power calculations
16,17
. Alternatively, a two-
tier approach in which shotgun metagenomics is carried out on a subset of samples that have been pre-
screened with less expensive microbial surveys such as 16S rRNA gene sequencing, may be adopted
18
.
Controls are also important but it can be difficult to obtain representative samples from a suitable control
group, particularly when studying environments such as humans, in which the resident microbial
communities are influenced, to a different extent, by factors such as host genotype
19
, age, diet and
environmental surroundings
20
. Where feasible, we recommend longitudinal studies that incorporate
samples from the same habitat over time rather than simple cross-sectional studies that compare
“snapshots” of two sample sets
21
. Importantly, longitudinal studies do not rely on results from a single
sample that might be a non-representative outlier. Exclusion of samples that may be confounded by an
unwanted variable is also prudent. For example, in studies of human subjects, exclusion criteria might
include exposure to drugs that are known to impact the microbiome, e.g. antibiotics. If this is not feasible,
then potential confounders should be factored into comparative analyses (see Supplementary Box 1).
If samples originate in animal models, particularly those involving co-housed rodents, the roles of animal
age and housing environment
22,23
, and the sex of the person handling the animals
24
, may have on microbial
community profiles should be taken into account. It is usually possible to mitigate against potential
confounders in the study design by housing animals individually to prevent the spread of microbes between
cage mates (although this may introduce behavioural changes, potentially resulting in different biases),
mixing animals derived from different experimental cohorts together within the same cage, or repeating
experiments with mouse lines obtained from different vendors or with different genetic backgrounds
25
.
Finally, regardless of the type of sample being studied, it is crucial to collect detailed and accurate metadata.
MiMARKS and MIxS standards were set out to provide guidance for required metadata
26
, but
metagenomics is now applied on such disparate kinds of environments that it is difficult to choose
parameters that are suitable and feasible to obtain for every sample type. We recommend associating as
much descriptive and detailed metadata as possible with each sample, in order to make it more likely that
comparisons between study cohorts or sample types can be correlated with a particular environmental
variable
21
.
Sample collection and DNA extraction
Sample collection and preservation protocols can affect both quality and accuracy of metagenomics data.
Importantly, the effect size of these steps, in some circumstances, can be greater than the effect size of the
biological variables of interest
27
. Indeed variations in sample processing protocols can also be important
confounders in meta-analyses of datasets from different studies (Supplementary Box 1). Collection and
storage methods that have been validated for one type of sample type cannot be assumed to be optimal

for different sample types. As such, careful preliminary work to optimize processing conditions for sample
types is often necessary (Supplementary Figure 1).
Enrichment technique
Advantages
Limitations
Whole genome
amplification
123
Highly sensitive - can generate sufficient DNA for sequencing from
even tiny amounts of starting material.
Cost effective - can be applied directly to extracted environmental
DNA, no need to isolate cells.
Non-specific and untargeted - can amplify DNA from the whole
range of species present within a given sample.
Amplification step can introduce significant biases, which skew
resulting metagenomics profiles.
Chimeric molecules can be formed during amplification, which
can confound the assembly step.
Non-specific unlikely to improve proportional abundance of
DNA from a species of interest.
Single-cell genomics
72
Can generate genomes from uncultured organisms.
Can be combined with targeting approaches such as fluorescence
in situ hybridization to select specific taxa, including those that
might be rare members of the microbial community.
Places genomic data within its correct phylogenetic context.
Reference genomes can aid metagenomics assemblies.
Can be expensive to isolate single cells, requires specialist
equipment.
Requires whole genome amplification step see limitations
above.
Biases introduced during genome amplification mean that it is
usually only possible to recover partial genomes.
Prone to contamination.
Flow-sorting
124
High throughput means to sort cells of interest.
Targeted approach - can select specific taxa, including those that
might be rare members of the microbial community.
Expensive equipment, requiring specialist operators.
Requires intact cells.
Any cells in the sample that are attached to surfaces or fixed in
structures e.g. biofilms may not be recovered.
Flow rates and sort volumes limit the number of cells that can be
collected.
In situ enrichment
125
Simplifies microbial community structure - can make it easier to
assemble genomes from metagenomics data.
Presence of particular taxa within enriched samples can give clues
as to their functional roles within the microbial community.
Requires that cells of interest can be maintained stably in a
microcosm over the entire enrichment period
Simplifies microbial community structure - biases results in
favour of organisms that were able to thrive within the
microcosm.
Culture/microculture
71
Cultured isolates can be extensively tested for phenotypic
features.
Reference genomes can aid metagenomics assemblies.
Functional data can improve metagenomics annotations.
Places genomic data within its correct phylogenetic context.
Low throughput, can be highly labor intensive.
Extremely biased - many microbes are inherently difficult to
culture in the laboratory.
Unlikely to recover rarer members of a microbial community, as
cultured isolate collections will be dominated by the most
abundant organisms.
Sequence capture
technologies
126
Oligonucleotide probes can be used to identify species of interest
as recently demonstrated for culture-independent viral
diagnostics
By focusing only on species of interest, higher sensitivity can be
achieved particularly when large amounts of host contamination
are present
Capture kits can be expensive
Like PCR, capture fails when target organisms vary compared to
the reference sequences used to design the probes
Genome coverage of targeted organisms can be uneven,
affecting assemblies
Immunomagnetic
separation
127
Targeted approach - can enrich specific taxa, including those that
might be comparatively rare members of the microbial community
Far less expensive than many other targeted enrichment
techniques such as single cell genomics or flow sorting.
Less technically challenging and time consuming than other
targeted enrichment techniques.
Requires intact cells.
Requires a specific antibody for the target cells of interest.
If target cell numbers are low, whole genome amplification may
be needed following cell separation see limitations above.
Background (e.g.
human / eukaryotic)
depletion techniques
128
Particularly useful for samples where microbial cell numbers are
much lower than eukaryotic cells (e.g. biopsies)
Improves sensitivity - enhanced detection of microbial genomic
data.
Lower sequence depth required to obtain good coverage of
microbial genomes, reduced sequencing costs.
Relatively inexpensive, not technically challenging.
Concomitant loss of bacterial DNA of interest can occur during
processing steps, can bias subsequent microbiome profiling.
May introduce contamination.
Table 1: Summary of the advantages and limitations of methods to enrich for microbial cells/DNA before sequencing.
Key objectives are to collect sufficient microbial biomass for sequencing, and to minimize contamination of
samples. Enrichment methods can be used for those environments in which microbes are scarce (see Table
1). However, enrichment procedures can introduce bias into sequencing data
28
. Since several studies have
shown that factors such as length of time between sample collection and freezing
29
or the number of times
samples go through freeze-thaw cycles can affect the microbial community profiles that are detected, both
collection and storage protocols/conditions should be recorded (Supplementary Box 1).
The choice of DNA extraction method can affect the composition of downstream sequence data
30
. The
extraction method must be able to lyse diverse microbial taxa, otherwise sequencing results may be
dominated by DNA derived from easy-to-lyse microbes. DNA extraction methods that include mechanical

lysis (or bead-beating) are often considered superior to those that rely on chemical lysis
31
. However, bead-
beating based approaches do vary in their efficiency
32
. Vigorous extraction techniques such as bead-
beating can result in shortened DNA fragments, which can contribute to DNA loss during library preparation
methods that use fragment size selection techniques.
Contamination can be during sample processing stages. Kit/laboratory reagents may contain variable
amounts of microbial contaminants
33
. Metagenomics datasets from low biomass samples (e.g. skin swabs)
are particularly vulnerable to this problem, because there is less “real” signal to compete with low-levels of
contamination
34
. We advise those working with low biomass samples to use ultraclean reagents
35
, and to
incorporate “blank” sequencing controls, in which reagents are sequenced without adding sample template
34
. Other types of contamination are cross-over from previous sequencing runs, presence of PhiX control
DNA that is typically used as part of Illumina-based sequencing protocols, and human or host DNA.
Library preparation and sequencing
Choosing a library preparation and sequencing method hinges on availability of materials and services, cost,
ease of automation, and DNA sample quantification. The Illumina platform has become dominant as a
choice for shotgun metagenomics due to its wide availability, very high outputs (up to 1.5 Tb per run) and
high accuracy (with a typical error rate of between 0.1-1%), although the competing Ion Torrent S5/S5 XL
instrument is an alternative choice. Recently, long read sequencing technologies such as the Oxford
Nanopore MinION and Pacific Biosciences Sequel have scaled up output and can reliably generate up to 10
gigabases per run and may therefore soon start to see adoption for metagenomics studies.
Given the very high outputs achievable on a single instrument run, multiple metagenomic samples are
usually sequenced on the same sequencing run, by multiplexing up to 96 or 384 samples typically using dual
indexing barcode sets available for all library preparation protocols. The Illumina platforms are known to
suffer from issues of carry-over (between runs) and carry-between (within runs)
36
. Recently, concern has
been raised that newer Illumina instruments using isothermal cluster generation (ExAmp) suffer from high
rates of ‘index hopping’ where incorrect barcode identifiers are incorporated into growing clusters
37
although the extent of this problem on typical metagenomics projects has not been evaluated and
approaches to mitigate it have been suggested. To help evaluate the extent of such issues, randomly chosen
control wells containing known spiked-in organisms as positive controls, and template negative controls
should be used to assess the impact of these issues. Such controls are particularly critical for diagnostic
metagenomics projects where small numbers of pathogen reads may be a signal of infection against a
background of high host contamination. Although still uncommon in the field, performing technical
replicates would be useful to assess variability, and even subjecting a subset of samples to replication may
give enough information to disentangle technical from true variability.
Multiple methods are available for the generation of Illumina sequencing libraries: these are usually
distinguished by the method of fragmentation used. Transposase-based tagmentation”, for example in the
Illumina Nextera and Nextera XT products, are popular owing to their low cost (list prices of $25-40 per
sample, with dilution methods potentially able to reduce these costs even further
38
). Tagmentation
approaches only require small DNA inputs (1 ng of DNA recommended, but lower amounts can be used).
Such low inputs are achieved due to a subsequent PCR amplification step. However, as tagmentation targets

Figures
Citations
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI

Metagenomic biomarker discovery and explanation

TL;DR: In this article, a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation is described and validated, which addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities.
Journal ArticleDOI

MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis.

TL;DR: MetaWRAP is an easy-to-use modular pipeline that automates the core tasks in metagenomic analysis, while contributing significant improvements to the extraction and interpretation of high-quality metagenomics bins.
References
More filters
Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Journal ArticleDOI

Trimmomatic: a flexible trimmer for Illumina sequence data

TL;DR: Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.
Posted ContentDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Related Papers (5)

Structure, function and diversity of the healthy human microbiome

Curtis Huttenhower, +253 more
- 14 Jun 2012 - 
Frequently Asked Questions (17)
Q1. What are the contributions in "Shotgun metagenomics, from sampling to sequencing and analysis" ?

In this paper, Segata et al. discuss best-practice for shotgun metagenomics studies, including identifying and tackling limitations, and provide an outlook for metagenomic in the future. 

Vigorous extraction techniques such as beadbeating can result in shortened DNA fragments, which can contribute to DNA loss during library preparation methods that use fragment size selection techniques. 

The main limitations facing researchers now are the costs of training computational scientists for analyzing the complex metagenomic datasets, and of sequencing enough samples for properly powered study designs. 

Taxonomic profiling by selecting representative or discriminative genes (markers) from available reference sequences is another fast and accurate assembly-free approach that has been implemented with several variations. 

Choosing a library preparation and sequencing method hinges on availability of materials and services, cost, ease of automation, and DNA sample quantification. 

To help evaluate the extent of such issues, randomly chosen control wells containing known spiked-in organisms as positive controls, and template negative controls should be used to assess the impact of these issues. 

Assembly-free taxonomic profilers with species-level resolution utilize information available in reference genomes 74 and in environment-specific assemblies 75, and have been used in the largest human-associated metagenomics investigations performed so far 2,5,75-80. 

long read sequencing technologies such as the Oxford Nanopore MinION and Pacific Biosciences Sequel have scaled up output and can reliably generate up to 10 gigabases per run and may therefore soon start to see adoption for metagenomics studies. 

By looking at co-abundant markers from pre-assembled environment specific gene catalogs 84,85, for example, the MetaHIT consortium was able to characterize known and novel organisms in the human gut 5,75. 

For large datasets with hundreds of samples on which performing or interpreting metagenomics assembly is impractical, markerbased approaches are currently the method of choice especially for environments with a substantial fraction of microbial diversity covered by well-characterized sequenced species. 

The first algorithms, e.g. extended self-organising maps 64, required human input to perform the clustering, which is based on coverage information and composition that could be visualized in 2D 65. 

Metrics based on these k-mer frequencies can be used to bin contigs, with tetramers considered the most informative for binning of metagenomics data 58. 

The recovery of nearly a thousand MAGs from candidate phyla, with no cultured representatives, from acetate enriched and filtered groundwater samples showcased the potential of this approach8. 

Regardless of whether an assembly-free or assembly-based approach is adopted, the main limiting factor in profiling the metabolic potential of a community is the lack of annotations for accessory genes in most microbial species (with the exception of selected model organisms, Box 1). 

There is little community consensus on how well different assemblers perform with respect to key metrics such as completeness, continuity and propensity to generate chimeric contigs. 

It is usually possible to mitigate against potential confounders in the study design by housing animals individually to prevent the spread of microbes between cage mates (although this may introduce behavioural changes, potentially resulting in different biases), mixing animals derived from different experimental cohorts together within the same cage, or repeating experiments with mouse lines obtained from different vendors or with different genetic backgrounds 25. 

Since several studies have shown that factors such as length of time between sample collection and freezing 29 or the number of times samples go through freeze-thaw cycles can affect the microbial community profiles that are detected, both collection and storage protocols/conditions should be recorded (Supplementary Box 1).