scispace - formally typeset
Open AccessPosted ContentDOI

UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing

Robert C. Edgar
- 15 Oct 2016 - 
- pp 081257
TLDR
UNOISE2 is described, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and it is shown that it has comparable or better accuracy than DADA2.
Abstract
Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for investigating microbial populations. In such experiments, sequence errors caused by PCR and sequencing are difficult to distinguish from true biological variation. I describe UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.

read more

Content maybe subject to copyright    Report

UNOISE2: improved error-correction for
Illumina 16S and ITS amplicon
sequencing
Robert C. Edgar
Independent Investigator
Tiburon, California, USA.
robert@drive5.com
Abstract
Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for
investigating microbial populations. In such experiments, sequence errors caused by PCR
and sequencing are difficult to distinguish from true biological variation. I describe
UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting)
Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.
Introduction
Recent examples of microbial tag sequencing experiments include the Human Microbiome
Project(HMP Consortium, 2012) and a survey of the Arabidopsis root
microbiome(Lundberg et al., 2012). The experimental protocol in such studies includes
amplification by PCR followed by sequencing, which introduces errors in several ways.
Amplification introduces substitution and gap errors (point errors) due to incorrect base
pairing and polymerase slippage respectively(Turnbaugh et al., 2010). PCR chimeras form
when an incomplete amplicon primes extension into a different biological template(Haas et
al., 2011). Sequencing also introduces point errors due to substitutions (incorrect base
calls) and gaps (omitted or spurious base calls). Contaminants from reagents and other
sources can introduce spurious species(Edgar, 2013). Spurious species can also be
introduced when reads are assigned to incorrect samples due to cross-talk, also known as
tag switching or barcode switching(Carlsen et al., 2012).
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

The first amplicon sequencing error-correction methods were designed for
pyrosequencing flowgrams(Quince et al., 2011, 2009; Reeder and Knight, 2010; Rosen et
al., 2013). More recently, Illumina denoisers have been described including UNOISE(Edgar
and Flyvbjerg, 2014), MED(Eren et al., 2015) and DADA2(Callahan et al., 2016). The goal of
these methods is to infer accurate biological template sequences from noisy reads. This
task is generally divided into two phases: 1. correcting point errors to obtain an accurate
set of amplicon sequences (denoising) and 2. filtering of chimeric amplicons. The result is a
set of predicted biological sequences that I call ZOTUs (zero-radius OTUs). ZOTUs are valid
operational taxonomic units that are superior to conventional 97% OTUs for most
purposes because they provide the maximum possible biological resolution given the data
while using 97% identity may merge phenotypically different strains with distinct
sequences into a single cluster(Tikhonov et al., 2015; Callahan et al., 2016).
The high-level strategy used by UNOISE and UNOISE2 is to cluster the unique sequences in
the reads. A cluster has a centroid sequence with higher abundance plus similar sequences
(members) having lower abundances (Fig. 1). The centroid is inferred to be correct and its
members are inferred to be reads of the same template sequence containing one or more
point errors. The clustering criteria in UNOISE2 have been redesigned as described below.
UNOISE2 uses a one-pass clustering strategy that does not use quality (Q) scores and has
only two parameters with pre-set values that work well on different datasets. By contrast,
DADA2 uses quality scores in an iterative divisive partitioning clustering strategy based on
a Poisson model with hundreds of parameters (a 4×4 transition matrix for each Q score)
that is re-trained on each dataset. UCHIME2 and DADA2 are thus quite different, suggesting
that their approaches may have complementary strengths and weaknesses. With this in
mind, I will show that taking the subset of ZOTUs predicted by both algorithms reduces the
number of incorrect sequences compared with using either one alone.
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

UNOISE2 algorithm
Let C be a cluster centroid sequence with abundance a
C
and M be a member sequence of
that cluster with abundance a
M
. Let d be the Levenshtein distance (number of differences
including both substitutions and gaps) between M and C. The abundance skew of M with
respect to C is defined to be skew(M, C)=a
M
/ a
C
(Edgar et al., 2011). If M has small enough d
and small enough skew with respect to C, then it is probably an incorrect read of C with d
point errors (Fig. 1). This intuition is made concrete by introducing the following function:
β(d)=1/2
αd + 1
. (Eq.1)
The user-settable parameter α is set to 2 by default, giving β(1)=1/8, β(2)=1/32,
β(3)=1/128.... If skew(M, C) β(d) then M is a valid member of a cluster defined by C; i.e., β
is the maximum skew allowed for a member with d differences. As d increases, β decreases
exponentially, reflecting that more errors are less probable and the abundance skew
should therefore be lower. The β function was designed by hand as a model of error
abundance distributions obtained for several mock and in vivo Illumina datasets using the
FASTX-LEARN algorithm (http://drive5.com/usearch/manual/cmd_fastx_learn.html). This
is what physicists call a phenomenological modela simple mathematical function that fits
the data (and only to a very rough approximation in this case; see Discussion) without
using an underlying theory.
Changing the α parameter trades sensitivity to small differences against an increase in the
number of bad sequences which are wrongly predicted to be good. For example, setting
α=3 gives β(1)=1/16, β(2)=1/128, β(3)=1/1024... which are smaller minimum skews
compared to the default α=2. Thus, with α=3 a variant with d=1 and skew between 1/8 and
1/16 is predicted as a correct sequence while with α=2 it is predicted to have one error.
Conversely, setting α=1 gives β(1)=1/4, β(2)=1/8, β(3)=1/16... so that a variant with d=1
must have a skew of at least 1/4 to be predicted as correct.
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

Input to the UNOISE2 algorithm is the set of unique read sequences with abundance γ,
where γ=4 by default. Low-abundance uniques are discarded because they are more prone
to contain errors that are reproduced by chance or bias. A database of cluster centroids is
initially empty. Sequences are considered in order of decreasing abundance. A sequence
(Q) is assigned to cluster C if skew(Q, C) β(d). If no such C exists, Q becomes a new
centroid. The final set of centroids are reported as the predicted amplicons. These
amplicons are filtered by the UCHIME2 algorithm using denoised de novo mode(Edgar,
2016).
ZOTU table construction
A table with the number of reads for each ZOTU in each sample is constructed by
considering all reads before any quality filtering, including those with abundance <γ. The
same matching criteria are used, but no new centroids are created. Thus, if a read R is
identical to ZOTU C, or if skew(R, C) β(d), then R is assigned to C. In practice, a large
majority of reads with low quality or low unique sequence abundance are due to errors in
high-abundance ZOTU sequences, and this procedure thus improves sensitivity by
recovering most of the reads that were discarded for the denoising step. Most of the reads
not assigned to ZOTUs are usually accounted for by chimeras, which the user can verify by
making a ZOTU table using predicted amplicons prior to chimera filtering.
Sample pooling and sensitivity to rare sequences
Correct biological sequences with abundance <γ are lost in the denoising step and thus do
not appear in the ZOTU table. I therefore recommend pooling reads from all samples in the
denoising step rather than denoising each sample individually. In a typical dual-indexed
sequencing run, there are ~100 samples and pooling thus increases the abundance of most
correct sequences by one or two orders of magnitude, depending on how many samples
contain a given strain. A sequence with abundance <4 over ~100 samples is very rare in
the readsit appears in at most three samples, in which case it would be a singleton in
each, and has a maximum abundance of three in 1/100 of the samples. The data can say
little about the ecological significance of this sequence. For example, it has (or should have)
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

no significant effect on well-chosen alpha or beta diversity measures. I would therefore
argue that in most cases, the loss in sensitivity due to setting γ=4 is inconsequential. If the
user prefers to increase sensitivity at the cost of a possibly large increase in spurious
ZOTUs, a smaller value of γ can be used.
The authors of DADA2 suggest denoising samples individually to enable detection of
variants that would be lost by pooling. This happens in a scenario when a close variant (V)
of a more dominant strain (D) has high abundance in one or a few samples but low
abundance overall, causing V to be misidentified as D with errors. This is a valid point, and
applies equally to UNOISE2. However, there are also disadvantages to this strategy. With
~100 samples, abundances are ~100× smaller in one sample and are therefore subject to
much larger fluctuations which may degrade discrimination of errors from correct
sequences. Some low-abundance variants may be lost that would be correctly identified by
pooling, e.g., because they are singletons in some of the samples where they occur. Also, the
denoiser may make different mistakes in different samples, causing a given ZOTU to
contain different combinations of phenotypes. If that happens, ZOTUs are not directly
comparable between samples. For example, V might be correctly identified as a biological
variant in a few samples but misidentified as an error in others (this seems likely to occur
in the motivating scenario where V has low overall abundance). Then, in some samples the
ZOTU for D would contain V while in others D and V would be assigned to separate ZOTUs.
When samples are pooled, a ZOTU will always contain the same phenotypes (hopefully, but
not necessarily, just one) and this problem is avoided. With these caveats in mind, it is
reasonable to try both strategies and compare the results.
Global trimming and defining abundance
Calculating unique sequence abundance is problematic when reads of the same template
sequence vary in length, e.g. because reads are truncated when the quality score drops
below a threshold. Consider a case with two reads A and B where B is shorter but otherwise
identical to A. Here, abundance could be defined in three different ways. (1) There are two
unique sequences A and B, each with abundance one. (2) There is one unique sequence A
with abundance two. (3) There is one unique sequence B with abundance two. All of these
certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

TL;DR: It is argued that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.
Journal ArticleDOI

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

TL;DR: A novel sub-operational-taxonomic-unit (sOTU) approach that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms, Deblur, which substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity.
Journal ArticleDOI

Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis

TL;DR: The authors explore the potential of the 16S gene for discriminating bacterial taxa and show that full-length sequencing combined with appropriate clustering of intragenomic sequence variation can provide accurate representation of bacterial species in microbiome datasets.
Journal ArticleDOI

NRT1.1B is associated with root microbiota composition and nitrogen use in field-grown rice

TL;DR: The links between plant genotype and root microbiota membership established in this study will inform breeding strategies to improve nitrogen use in crops and coordinate recruitment of the root microbiota to optimize nitrogen acquisition from soil.
Posted ContentDOI

PICRUSt2: An improved and extensible approach for metagenome inference

TL;DR: PICRUSt2 as mentioned in this paper extends the capabilities of the original PICrUSt method to predict approximate functional potential of a community based on marker gene sequencing profiles, including an expanded database of gene families and reference genomes, a new approach compatible with any OTU-picking or denoising algorithm, novel phenotype predictions, and novel fungal reference databases that enable predictions from 18S rRNA gene and internal transcribed spacer amplicon data.
References
More filters
Journal ArticleDOI

Search and clustering orders of magnitude faster than BLAST

Robert C. Edgar
- 01 Oct 2010 - 
TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
Journal ArticleDOI

DADA2: High-resolution sample inference from Illumina amplicon data

TL;DR: The open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors is presented, revealing a diversity of previously undetected Lactobacillus crispatus variants.
Journal ArticleDOI

UCHIME improves sensitivity and speed of chimera detection

TL;DR: UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences, and in testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus.
Journal ArticleDOI

UPARSE: highly accurate OTU sequences from microbial amplicon reads

Robert C. Edgar
- 01 Oct 2013 - 
TL;DR: The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% correct bases commonly reported by other methods.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Unoise2: improved error-correction for illumina 16s and its amplicon sequencing" ?

Introduction Recent examples of microbial tag sequencing experiments include the Human Microbiome Project ( HMP Consortium, 2012 ) and a survey of the Arabidopsis root microbiome ( Lundberg et al., 2012 ). The experimental protocol in such studies includes amplification by PCR followed by sequencing, which introduces errors in several ways. Spurious species can also be introduced when reads are assigned to incorrect samples due to cross-talk, also known as tag switching or barcode switching ( Carlsen et al., 2012 ). certified by peer review ) is the author/funder. 

If multiple primers were used which do not bind to the same locus, then trimming is required to ensure that reads of the same template amplified by different primers start and end at the same position in the biological sequence. 

denoising is more effective with quality-filtered reads because sequencing error bias can cause some errors to have sufficiently high abundances that they could be mistaken for biological variants, and these often have lower quality scores. 

the reference database for the HMP mock community (21 strains) has 115 different 16S sequences, an average of 5.5 distinct 16S sequences per strain. 

Global trimming and defining abundance Calculating unique sequence abundance is problematic when reads of the same template sequence vary in length, e.g. because reads are truncated when the quality score drops below a threshold. 

A sequence cannot be reliably classified as nonchimeric unless it is identical to a reference sequence(Edgar, 2016), and amplicons with uncorrected point errors therefore cannot be reliably classified. 

If fluctuations in the abundance ratio are equally likely to give values <2 and >2, then approximately half of the chimeras formed in the first round will have abundance ratio <2, i.e. 1/(2N). 

Recent examples of microbial tag sequencing experiments include the Human Microbiome Project(HMP Consortium, 2012) and a survey of the Arabidopsis root microbiome(Lundberg et al., 2012). 

Given that UCHIME2 agrees with 400/657 of the DADA2 chimera predictions with ratios >2, it seems likely that many of the UCHIME2 predictions are also false positives, despite using more stringent parameters (no differences allowed in the model, abundance ratio ≥2). 

Some low-abundance variants may be lost that would be correctly identified by pooling, e.g., because they are singletons in some of the samples where they occur. 

The experimental protocol in such studies includes amplification by PCR followed by sequencing, which introduces errors in several ways. 

I believe that false positive chimeras will have a much higher frequency than uncorrected point errors, given the high accuracy of DADA2 on most of the mock datasets and the observation that fake chimeric models are very common, especially when differences are allowed(Edgar, 2016). 

If skew(M, C) ≤ β(d) then M is a valid member of a cluster defined by C; i.e., βis the maximum skew allowed for a member with d differences. 

These problems are avoided by ensuring that reads of the same template sequence have the same length (global trimming, implying that reads of the same template should be globally alignable, though more distantly related sequences need not be). 

With (1), a given template sequence with high abundance in the amplicons will typically have many different unique sequences with low abundances because its reads are truncated to many different lengths. 

The first amplicon sequencing error-correction methods were designed for pyrosequencing flowgrams(Quince et al., 2011, 2009; Reeder and Knight, 2010; Rosen et al., 2013). 

Trending Questions (2)
What are the best bioinformatic methods for analysis of Illumina sequencing ITS?

UNOISE2 and DADA2 are both accurate methods for error-correction in Illumina sequencing of ITS amplicons.

What are the best bioinformatic methods for analysis of Illumina sequencing ITS?

UNOISE2 and DADA2 are both accurate methods for error-correction in Illumina sequencing of ITS.