What are the main limitations facing researchers now?

The main limitations facing researchers now are the costs of training computational scientists for analyzing the complex metagenomic datasets, and of sequencing enough samples for properly powered study designs.

What are the main advantages of assembly-free methods?

Taxonomic profiling by selecting representative or discriminative genes (markers) from available reference sequences is another fast and accurate assembly-free approach that has been implemented with several variations.

What are the main advantages of assembly-free taxonomic profilers?

Assembly-free taxonomic profilers with species-level resolution utilize information available in reference genomes 74 and in environment-specific assemblies 75, and have been used in the largest human-associated metagenomics investigations performed so far 2,5,75-80.

How did the MetaHIT consortium characterize organisms in the human gut?

By looking at co-abundant markers from pre-assembled environment specific gene catalogs 84,85, for example, the MetaHIT consortium was able to characterize known and novel organisms in the human gut 5,75.

What are the main reasons for using marker-based methods?

For large datasets with hundreds of samples on which performing or interpreting metagenomics assembly is impractical, markerbased approaches are currently the method of choice especially for environments with a substantial fraction of microbial diversity covered by well-characterized sequenced species.

How did the first algorithms perform the clustering?

The first algorithms, e.g. extended self-organising maps 64, required human input to perform the clustering, which is based on coverage information and composition that could be visualized in 2D 65.

What is the informative metric for binning?

Metrics based on these k-mer frequencies can be used to bin contigs, with tetramers considered the most informative for binning of metagenomics data 58.

How many MAGs were recovered from acetate enriched and filtered groundwater?

The recovery of nearly a thousand MAGs from candidate phyla, with no cultured representatives, from acetate enriched and filtered groundwater samples showcased the potential of this approach8.

What is the main limiting factor in profiling the metabolic potential of a community?

Regardless of whether an assembly-free or assembly-based approach is adopted, the main limiting factor in profiling the metabolic potential of a community is the lack of annotations for accessory genes in most microbial species (with the exception of selected model organisms, Box 1).

What is the consensus on how well different assemblers perform?

There is little community consensus on how well different assemblers perform with respect to key metrics such as completeness, continuity and propensity to generate chimeric contigs.

(Open Access) Shotgun metagenomics, from sampling to analysis (2017) | Christopher Quince

Q: What are the contributions in "Shotgun metagenomics, from sampling to sequencing and analysis" ?

In this paper, Segata et al. discuss best-practice for shotgun metagenomics studies, including identifying and tackling limitations, and provide an outlook for metagenomic in the future.

Q: How can a beadbeating technique be used to extract DNA?

Vigorous extraction techniques such as beadbeating can result in shortened DNA fragments, which can contribute to DNA loss during library preparation methods that use fragment size selection techniques.

Shotgun metagenomics, from sampling to sequencing and analysis

Christopher Quince

1,^

, Alan W. Walker

2,^

, Jared T. Simpson

3,4

, Nicholas J. Loman

, Nicola Segata

6,*

Warwick Medical School, University of Warwick, Warwick, UK.

Microbiology Group, The Rowett Institute, University of Aberdeen, Aberdeen, UK.

Ontario Institute for Cancer Research, Toronto, Canada

Department of Computer Science, University of Toronto, Toronto, Canada.

Institute for Microbiology and Infection, University of Birmingham, Birmingham, UK.

Centre for Integrative Biology, University of Trento, Trento, Italy.

^ These authors contributed equally

* Corresponding author: Nicola Segata (nicola.segata@unitn.it)

Diverse microbial communities of bacteria, archaea, viruses and single-celled eukaryotes have crucial

roles in the environment and human health. However, microbes are frequently difficult to culture in the

laboratory, which can confound cataloging members and understanding how communities function.

Cheap, high-throughput sequencing technologies and a suite of computational pipelines have been

combined into shotgun metagenomics methods that have transformed microbiology. Still, computational

approaches to overcome challenges that affect both assembly-based and mapping-based metagenomic

profiling, particularly of high-complexity samples, or environments containing organisms with limited

similarity to sequenced genomes, are needed. Understanding the functions and characterizing specific

strains of these communities offer biotechnological promise in therapeutic discovery, or innovative ways

to synthesize products using microbial factories, but can also pinpoint the contributions of

microorganisms to planetary, animal and human health.

Introduction

High throughput sequencing approaches enable genomic analyses of ideally all microbes in a sample, not

just those that are more amenable to cultivation. One such method, shotgun metagenomics, is the

untargeted (“shotgun”) sequencing of all (“meta”) of the microbial genomes (“genomics”) present in a

sample. Shotgun sequencing can be used to profile taxonomic composition and functional potential of

microbial communities, and to recover whole genome sequences. Approaches such as high-throughput 16S

rRNA gene sequencing

, which profile selected organisms or single marker genes are sometimes mistakenly

referred to as metagenomics but are not metagenomic methods, because they do not target the entire

genomic content of a sample.

In the past 15 years since it was first used, metagenomics has enabled large-scale investigations of complex

microbiomes

2-7

. Discoveries enabled by this technology include the identification of previously unknown

environmental bacterial phyla with endosymbiotic behavior

, and species that can carry out complete

nitrification of ammonia

9,10

. Other striking findings include the widespread presence of antibiotic genes in

commensal gut bacteria

, tracking of human outbreak pathogens

, the strong association of both the viral

and bacterial

fraction of the microbiome with inflammatory bowel diseases, and the ability to monitor

strain-level changes in the gut microbiota after perturbations such as those induced by faecal microbiome

transplantation

In this Review we discuss best-practice for shotgun metagenomics studies, including identifying and tackling

limitations, and provide an outlook for metagenomics in the future.

Figure 1. Summary of a metagenomics workflow. Step 1: Study design and experimental protocol, the importance of this step is

often underestimated in metagenomics. Step 2: Computational pre-processing. Computational quality control steps minimize

fundamental sequence biases or artefacts e.g. removal of sequencing adaptors, quality trimming, removal of sequencing duplicates

(using e.g. fastqc, trimmomatic

122

, and Picard tools). Foreign or non-target DNA sequences are also filtered and samples are sub-

sampled to normalize read numbers, if the diversity of taxa or functions is compared. Step 3: Sequence analysis. This should comprise

a combination of ‘read-based’ and ‘assembly-based’ approaches depending on the experimental objectives. Both approaches have

advantages and limitations (See Table 4 for a detailed discussion). Step 4: Post-processing. Various multivariate statistical

techniques can be used to interpret the data. Step 5: Validation. Conclusions from high dimensional biological data are susceptible

to study driven biases so follow-up analyses are vital.

Shotgun metagenomics study design

A typical shotgun metagenomics study comprises five steps following the initial study design; (i) the

collection, processing, and sequencing of the samples, (ii) the preprocessing of the sequencing reads, (iii)

the sequence analysis to profile taxonomic, functional, and genomic features of the microbiome, (iv) the

postprocessing statistical and biological analysis, and (v) the validation (Figure 1). Numerous experimental

and computational approaches are available to carry out each step, which means that researchers are faced

with a daunting choice. And, despite its apparent simplicity, shotgun metagenomics has limitations, owing

to potential experimental biases and the complexity of computational analysis and their interpretation. We

assess the choices that need to be made at each step and how to overcome common problems.

The steps involved in the design of hypothesis-based studies are outlined in Supplementary Figure 1 with

specific recommendations summarized in Supplementary Box 1. Individual samples from the same

environment can be variable in microbial content, which makes it challenging to detect statistically

significant, and biologically meaningful, differences among small sets of samples. It is therefore important

to establish that studies are sufficiently powered to detect differences, especially if the effect size is small

. One useful strategy may be to generate pilot data to inform power calculations

16,17

. Alternatively, a two-

tier approach in which shotgun metagenomics is carried out on a subset of samples that have been pre-

screened with less expensive microbial surveys such as 16S rRNA gene sequencing, may be adopted

Controls are also important but it can be difficult to obtain representative samples from a suitable control

group, particularly when studying environments such as humans, in which the resident microbial

communities are influenced, to a different extent, by factors such as host genotype

, age, diet and

environmental surroundings

. Where feasible, we recommend longitudinal studies that incorporate

samples from the same habitat over time rather than simple cross-sectional studies that compare

“snapshots” of two sample sets

. Importantly, longitudinal studies do not rely on results from a single

sample that might be a non-representative outlier. Exclusion of samples that may be confounded by an

unwanted variable is also prudent. For example, in studies of human subjects, exclusion criteria might

include exposure to drugs that are known to impact the microbiome, e.g. antibiotics. If this is not feasible,

then potential confounders should be factored into comparative analyses (see Supplementary Box 1).

If samples originate in animal models, particularly those involving co-housed rodents, the roles of animal

age and housing environment

22,23

, and the sex of the person handling the animals

, may have on microbial

community profiles should be taken into account. It is usually possible to mitigate against potential

confounders in the study design by housing animals individually to prevent the spread of microbes between

cage mates (although this may introduce behavioural changes, potentially resulting in different biases),

mixing animals derived from different experimental cohorts together within the same cage, or repeating

experiments with mouse lines obtained from different vendors or with different genetic backgrounds

Finally, regardless of the type of sample being studied, it is crucial to collect detailed and accurate metadata.

MiMARKS and MIxS standards were set out to provide guidance for required metadata

, but

metagenomics is now applied on such disparate kinds of environments that it is difficult to choose

parameters that are suitable and feasible to obtain for every sample type. We recommend associating as

much descriptive and detailed metadata as possible with each sample, in order to make it more likely that

comparisons between study cohorts or sample types can be correlated with a particular environmental

variable

Sample collection and DNA extraction

Sample collection and preservation protocols can affect both quality and accuracy of metagenomics data.

Importantly, the effect size of these steps, in some circumstances, can be greater than the effect size of the

biological variables of interest

. Indeed variations in sample processing protocols can also be important

confounders in meta-analyses of datasets from different studies (Supplementary Box 1). Collection and

storage methods that have been validated for one type of sample type cannot be assumed to be optimal

for different sample types. As such, careful preliminary work to optimize processing conditions for sample

types is often necessary (Supplementary Figure 1).

Enrichment technique

Advantages

Limitations

Whole genome

amplification

123

 Highly sensitive - can generate sufficient DNA for sequencing from

even tiny amounts of starting material.

 Cost effective - can be applied directly to extracted environmental

DNA, no need to isolate cells.

 Non-specific and untargeted - can amplify DNA from the whole

range of species present within a given sample.

 Amplification step can introduce significant biases, which skew

resulting metagenomics profiles.

 Chimeric molecules can be formed during amplification, which

can confound the assembly step.

 Non-specific – unlikely to improve proportional abundance of

DNA from a species of interest.

Single-cell genomics

 Can generate genomes from uncultured organisms.

 Can be combined with targeting approaches such as fluorescence

in situ hybridization to select specific taxa, including those that

might be rare members of the microbial community.

 Places genomic data within its correct phylogenetic context.

 Reference genomes can aid metagenomics assemblies.

 Can be expensive to isolate single cells, requires specialist

equipment.

 Requires whole genome amplification step – see limitations

above.

 Biases introduced during genome amplification mean that it is

usually only possible to recover partial genomes.

 Prone to contamination.

Flow-sorting

124

 High throughput means to sort cells of interest.

 Targeted approach - can select specific taxa, including those that

might be rare members of the microbial community.

 Expensive equipment, requiring specialist operators.

 Requires intact cells.

 Any cells in the sample that are attached to surfaces or fixed in

structures e.g. biofilms may not be recovered.

 Flow rates and sort volumes limit the number of cells that can be

collected.

In situ enrichment

125

 Simplifies microbial community structure - can make it easier to

assemble genomes from metagenomics data.

 Presence of particular taxa within enriched samples can give clues

as to their functional roles within the microbial community.

 Requires that cells of interest can be maintained stably in a

microcosm over the entire enrichment period

 Simplifies microbial community structure - biases results in

favour of organisms that were able to thrive within the

microcosm.

Culture/microculture

 Cultured isolates can be extensively tested for phenotypic

features.

 Reference genomes can aid metagenomics assemblies.

 Functional data can improve metagenomics annotations.

 Places genomic data within its correct phylogenetic context.

 Low throughput, can be highly labor intensive.

 Extremely biased - many microbes are inherently difficult to

culture in the laboratory.

 Unlikely to recover rarer members of a microbial community, as

cultured isolate collections will be dominated by the most

abundant organisms.

Sequence capture

technologies

126

 Oligonucleotide probes can be used to identify species of interest

as recently demonstrated for culture-independent viral

diagnostics

 By focusing only on species of interest, higher sensitivity can be

achieved particularly when large amounts of host contamination

are present

 Capture kits can be expensive

 Like PCR, capture fails when target organisms vary compared to

the reference sequences used to design the probes

 Genome coverage of targeted organisms can be uneven,

affecting assemblies

Immunomagnetic

separation

127

 Targeted approach - can enrich specific taxa, including those that

might be comparatively rare members of the microbial community

 Far less expensive than many other targeted enrichment

techniques such as single cell genomics or flow sorting.

 Less technically challenging and time consuming than other

targeted enrichment techniques.

 Requires intact cells.

 Requires a specific antibody for the target cells of interest.

 If target cell numbers are low, whole genome amplification may

be needed following cell separation – see limitations above.

Background (e.g.

human / eukaryotic)

depletion techniques

128

 Particularly useful for samples where microbial cell numbers are

much lower than eukaryotic cells (e.g. biopsies)

 Improves sensitivity - enhanced detection of microbial genomic

data.

 Lower sequence depth required to obtain good coverage of

microbial genomes, reduced sequencing costs.

 Relatively inexpensive, not technically challenging.

 Concomitant loss of bacterial DNA of interest can occur during

processing steps, can bias subsequent microbiome profiling.

 May introduce contamination.

Table 1: Summary of the advantages and limitations of methods to enrich for microbial cells/DNA before sequencing.

Key objectives are to collect sufficient microbial biomass for sequencing, and to minimize contamination of

samples. Enrichment methods can be used for those environments in which microbes are scarce (see Table

1). However, enrichment procedures can introduce bias into sequencing data

. Since several studies have

shown that factors such as length of time between sample collection and freezing

or the number of times

samples go through freeze-thaw cycles can affect the microbial community profiles that are detected, both

collection and storage protocols/conditions should be recorded (Supplementary Box 1).

The choice of DNA extraction method can affect the composition of downstream sequence data

. The

extraction method must be able to lyse diverse microbial taxa, otherwise sequencing results may be

dominated by DNA derived from easy-to-lyse microbes. DNA extraction methods that include mechanical

lysis (or bead-beating) are often considered superior to those that rely on chemical lysis

. However, bead-

beating based approaches do vary in their efficiency

. Vigorous extraction techniques such as bead-

beating can result in shortened DNA fragments, which can contribute to DNA loss during library preparation

methods that use fragment size selection techniques.

Contamination can be during sample processing stages. Kit/laboratory reagents may contain variable

amounts of microbial contaminants

. Metagenomics datasets from low biomass samples (e.g. skin swabs)

are particularly vulnerable to this problem, because there is less “real” signal to compete with low-levels of

contamination

. We advise those working with low biomass samples to use ultraclean reagents

, and to

incorporate “blank” sequencing controls, in which reagents are sequenced without adding sample template

. Other types of contamination are cross-over from previous sequencing runs, presence of PhiX control

DNA that is typically used as part of Illumina-based sequencing protocols, and human or host DNA.

Library preparation and sequencing

Choosing a library preparation and sequencing method hinges on availability of materials and services, cost,

ease of automation, and DNA sample quantification. The Illumina platform has become dominant as a

choice for shotgun metagenomics due to its wide availability, very high outputs (up to 1.5 Tb per run) and

high accuracy (with a typical error rate of between 0.1-1%), although the competing Ion Torrent S5/S5 XL

instrument is an alternative choice. Recently, long read sequencing technologies such as the Oxford

Nanopore MinION and Pacific Biosciences Sequel have scaled up output and can reliably generate up to 10

gigabases per run and may therefore soon start to see adoption for metagenomics studies.

Given the very high outputs achievable on a single instrument run, multiple metagenomic samples are

usually sequenced on the same sequencing run, by multiplexing up to 96 or 384 samples typically using dual

indexing barcode sets available for all library preparation protocols. The Illumina platforms are known to

suffer from issues of carry-over (between runs) and carry-between (within runs)

. Recently, concern has

been raised that newer Illumina instruments using isothermal cluster generation (ExAmp) suffer from high

rates of ‘index hopping’ where incorrect barcode identifiers are incorporated into growing clusters

although the extent of this problem on typical metagenomics projects has not been evaluated and

approaches to mitigate it have been suggested. To help evaluate the extent of such issues, randomly chosen

control wells containing known spiked-in organisms as positive controls, and template negative controls

should be used to assess the impact of these issues. Such controls are particularly critical for diagnostic

metagenomics projects where small numbers of pathogen reads may be a signal of infection against a

background of high host contamination. Although still uncommon in the field, performing technical

replicates would be useful to assess variability, and even subjecting a subset of samples to replication may

give enough information to disentangle technical from true variability.

Multiple methods are available for the generation of Illumina sequencing libraries: these are usually

distinguished by the method of fragmentation used. Transposase-based “tagmentation”, for example in the

Illumina Nextera and Nextera XT products, are popular owing to their low cost (list prices of $25-40 per

sample, with dilution methods potentially able to reduce these costs even further

). Tagmentation

approaches only require small DNA inputs (1 ng of DNA recommended, but lower amounts can be used).

Such low inputs are achieved due to a subsequent PCR amplification step. However, as tagmentation targets

Shotgun metagenomics, from sampling to analysis

Figures

Citations

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Metagenomic biomarker discovery and explanation

Best practices for analysing microbiomes.

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle.

MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis.

References

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Trimmomatic: a flexible trimmer for Illumina sequence data

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Related Papers (5)

Fast gapped-read alignment with Bowtie 2

Structure, function and diversity of the healthy human microbiome

Fast and sensitive protein alignment using DIAMOND

QIIME allows analysis of high-throughput community sequencing data.

Prodigal: prokaryotic gene recognition and translation initiation site identification

Frequently Asked Questions (17)

Q1. What are the contributions in "Shotgun metagenomics, from sampling to sequencing and analysis" ?

Q2. How can a beadbeating technique be used to extract DNA?

Q3. What are the main limitations facing researchers now?

Q4. What are the main advantages of assembly-free methods?

Q5. What are the main factors in choosing a library preparation and sequencing method?

Q6. What are the ways to assess the impact of such issues?

Q7. What are the main advantages of assembly-free taxonomic profilers?

Q8. How many reagents can be used to generate up to 10 gigabases per?

Q9. How did the MetaHIT consortium characterize organisms in the human gut?

Q10. What are the main reasons for using marker-based methods?

Q11. How did the first algorithms perform the clustering?

Q12. What is the informative metric for binning?

Q13. How many MAGs were recovered from acetate enriched and filtered groundwater?

Q14. What is the main limiting factor in profiling the metabolic potential of a community?

Q15. What is the consensus on how well different assemblers perform?

Q16. How can a study be conducted to prevent the spread of microbes?

Q17. How many times do samples go through freeze-thaw cycles?