Open AccessPosted ContentDOI

UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing

Q: What is the probability that many of the UCHIME2 predictions are also false positives?

Given that UCHIME2 agrees with 400/657 of the DADA2 chimera predictions with ratios >2, it seems likely that many of the UCHIME2 predictions are also false positives, despite using more stringent parameters (no differences allowed in the model, abundance ratio ≥2).

Robert C. Edgar

- 15 Oct 2016 -

bioRxiv

- pp 081257

TLDR

UNOISE2 is described, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and it is shown that it has comparable or better accuracy than DADA2.

Abstract:

Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for investigating microbial populations. In such experiments, sequence errors caused by PCR and sequencing are difficult to distinguish from true biological variation. I describe UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.

Content maybe subject to copyright Report

UNOISE2: improved error-correction for

Illumina 16S and ITS amplicon

sequencing

Robert C. Edgar

Independent Investigator

Tiburon, California, USA.

robert@drive5.com

Abstract

Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for

investigating microbial populations. In such experiments, sequence errors caused by PCR

and sequencing are difficult to distinguish from true biological variation. I describe

UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting)

Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.

Introduction

Recent examples of microbial tag sequencing experiments include the Human Microbiome

Project(HMP Consortium, 2012) and a survey of the Arabidopsis root

microbiome(Lundberg et al., 2012). The experimental protocol in such studies includes

amplification by PCR followed by sequencing, which introduces errors in several ways.

Amplification introduces substitution and gap errors (point errors) due to incorrect base

pairing and polymerase slippage respectively(Turnbaugh et al., 2010). PCR chimeras form

when an incomplete amplicon primes extension into a different biological template(Haas et

al., 2011). Sequencing also introduces point errors due to substitutions (incorrect base

calls) and gaps (omitted or spurious base calls). Contaminants from reagents and other

sources can introduce spurious species(Edgar, 2013). Spurious species can also be

introduced when reads are assigned to incorrect samples due to cross-talk, also known as

tag switching or barcode switching(Carlsen et al., 2012).

The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

The first amplicon sequencing error-correction methods were designed for

pyrosequencing flowgrams(Quince et al., 2011, 2009; Reeder and Knight, 2010; Rosen et

al., 2013). More recently, Illumina denoisers have been described including UNOISE(Edgar

and Flyvbjerg, 2014), MED(Eren et al., 2015) and DADA2(Callahan et al., 2016). The goal of

these methods is to infer accurate biological template sequences from noisy reads. This

task is generally divided into two phases: 1. correcting point errors to obtain an accurate

set of amplicon sequences (denoising) and 2. filtering of chimeric amplicons. The result is a

set of predicted biological sequences that I call ZOTUs (zero-radius OTUs). ZOTUs are valid

operational taxonomic units that are superior to conventional 97% OTUs for most

purposes because they provide the maximum possible biological resolution given the data

while using 97% identity may merge phenotypically different strains with distinct

sequences into a single cluster(Tikhonov et al., 2015; Callahan et al., 2016).

The high-level strategy used by UNOISE and UNOISE2 is to cluster the unique sequences in

the reads. A cluster has a centroid sequence with higher abundance plus similar sequences

(members) having lower abundances (Fig. 1). The centroid is inferred to be correct and its

members are inferred to be reads of the same template sequence containing one or more

point errors. The clustering criteria in UNOISE2 have been redesigned as described below.

UNOISE2 uses a one-pass clustering strategy that does not use quality (Q) scores and has

only two parameters with pre-set values that work well on different datasets. By contrast,

DADA2 uses quality scores in an iterative divisive partitioning clustering strategy based on

a Poisson model with hundreds of parameters (a 4×4 transition matrix for each Q score)

that is re-trained on each dataset. UCHIME2 and DADA2 are thus quite different, suggesting

that their approaches may have complementary strengths and weaknesses. With this in

mind, I will show that taking the subset of ZOTUs predicted by both algorithms reduces the

number of incorrect sequences compared with using either one alone.

The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

UNOISE2 algorithm

Let C be a cluster centroid sequence with abundance a

and M be a member sequence of

that cluster with abundance a

. Let d be the Levenshtein distance (number of differences

including both substitutions and gaps) between M and C. The abundance skew of M with

respect to C is defined to be skew(M, C)=a

/ a

(Edgar et al., 2011). If M has small enough d

and small enough skew with respect to C, then it is probably an incorrect read of C with d

point errors (Fig. 1). This intuition is made concrete by introducing the following function:

β(d)=1/2

αd + 1

. (Eq.1)

The user-settable parameter α is set to 2 by default, giving β(1)=1/8, β(2)=1/32,

β(3)=1/128.... If skew(M, C) ≤ β(d) then M is a valid member of a cluster defined by C; i.e., β

is the maximum skew allowed for a member with d differences. As d increases, β decreases

exponentially, reflecting that more errors are less probable and the abundance skew

should therefore be lower. The β function was designed by hand as a model of error

abundance distributions obtained for several mock and in vivo Illumina datasets using the

FASTX-LEARN algorithm (http://drive5.com/usearch/manual/cmd_fastx_learn.html). This

is what physicists call a phenomenological model—a simple mathematical function that fits

the data (and only to a very rough approximation in this case; see Discussion) without

using an underlying theory.

Changing the α parameter trades sensitivity to small differences against an increase in the

number of bad sequences which are wrongly predicted to be good. For example, setting

α=3 gives β(1)=1/16, β(2)=1/128, β(3)=1/1024... which are smaller minimum skews

compared to the default α=2. Thus, with α=3 a variant with d=1 and skew between 1/8 and

1/16 is predicted as a correct sequence while with α=2 it is predicted to have one error.

Conversely, setting α=1 gives β(1)=1/4, β(2)=1/8, β(3)=1/16... so that a variant with d=1

must have a skew of at least 1/4 to be predicted as correct.

The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

Input to the UNOISE2 algorithm is the set of unique read sequences with abundance ≥γ,

where γ=4 by default. Low-abundance uniques are discarded because they are more prone

to contain errors that are reproduced by chance or bias. A database of cluster centroids is

initially empty. Sequences are considered in order of decreasing abundance. A sequence

(Q) is assigned to cluster C if skew(Q, C) ≤ β(d). If no such C exists, Q becomes a new

centroid. The final set of centroids are reported as the predicted amplicons. These

amplicons are filtered by the UCHIME2 algorithm using denoised de novo mode(Edgar,

2016).

ZOTU table construction

A table with the number of reads for each ZOTU in each sample is constructed by

considering all reads before any quality filtering, including those with abundance <γ. The

same matching criteria are used, but no new centroids are created. Thus, if a read R is

identical to ZOTU C, or if skew(R, C) ≤ β(d), then R is assigned to C. In practice, a large

majority of reads with low quality or low unique sequence abundance are due to errors in

high-abundance ZOTU sequences, and this procedure thus improves sensitivity by

recovering most of the reads that were discarded for the denoising step. Most of the reads

not assigned to ZOTUs are usually accounted for by chimeras, which the user can verify by

making a ZOTU table using predicted amplicons prior to chimera filtering.

Sample pooling and sensitivity to rare sequences

Correct biological sequences with abundance <γ are lost in the denoising step and thus do

not appear in the ZOTU table. I therefore recommend pooling reads from all samples in the

denoising step rather than denoising each sample individually. In a typical dual-indexed

sequencing run, there are ~100 samples and pooling thus increases the abundance of most

correct sequences by one or two orders of magnitude, depending on how many samples

contain a given strain. A sequence with abundance <4 over ~100 samples is very rare in

the reads—it appears in at most three samples, in which case it would be a singleton in

each, and has a maximum abundance of three in 1/100 of the samples. The data can say

little about the ecological significance of this sequence. For example, it has (or should have)

The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

no significant effect on well-chosen alpha or beta diversity measures. I would therefore

argue that in most cases, the loss in sensitivity due to setting γ=4 is inconsequential. If the

user prefers to increase sensitivity at the cost of a possibly large increase in spurious

ZOTUs, a smaller value of γ can be used.

The authors of DADA2 suggest denoising samples individually to enable detection of

variants that would be lost by pooling. This happens in a scenario when a close variant (V)

of a more dominant strain (D) has high abundance in one or a few samples but low

abundance overall, causing V to be misidentified as D with errors. This is a valid point, and

applies equally to UNOISE2. However, there are also disadvantages to this strategy. With

~100 samples, abundances are ~100× smaller in one sample and are therefore subject to

much larger fluctuations which may degrade discrimination of errors from correct

sequences. Some low-abundance variants may be lost that would be correctly identified by

pooling, e.g., because they are singletons in some of the samples where they occur. Also, the

denoiser may make different mistakes in different samples, causing a given ZOTU to

contain different combinations of phenotypes. If that happens, ZOTUs are not directly

comparable between samples. For example, V might be correctly identified as a biological

variant in a few samples but misidentified as an error in others (this seems likely to occur

in the motivating scenario where V has low overall abundance). Then, in some samples the

ZOTU for D would contain V while in others D and V would be assigned to separate ZOTUs.

When samples are pooled, a ZOTU will always contain the same phenotypes (hopefully, but

not necessarily, just one) and this problem is avoided. With these caveats in mind, it is

reasonable to try both strategies and compare the results.

Global trimming and defining abundance

Calculating unique sequence abundance is problematic when reads of the same template

sequence vary in length, e.g. because reads are truncated when the quality score drops

below a threshold. Consider a case with two reads A and B where B is shorter but otherwise

identical to A. Here, abundance could be defined in three different ways. (1) There are two

unique sequences A and B, each with abundance one. (2) There is one unique sequence A

with abundance two. (3) There is one unique sequence B with abundance two. All of these

The copyright holder for this preprint (which was notthis version posted October 15, 2016. ; https://doi.org/10.1101/081257doi: bioRxiv preprint

HTML Viewer

Figures

Table 3. Results on the Extreme mock community. The table shows strains identified by UNOISE2 or DADA2. Shaded rows are ZOTUs which are not found in the mock reference database but are exact matches to SILVA. U&D is ZOTUs predicted by both UNOISE2 and DADA2.

Figure 2. Chimeras predicted by DADA2 and UNOISE2 on the Soil1 dataset. Each histogram bar gives the number of predicted chimeras in an abundance ratio (AR) range labeled by its upper value, so the first bin contains chimeras with 1.0≤AR<1.2, the second 1.2≤AR<1.4 and so on. The last bin has all chimeras with AR>10. Notice that DADA2 predicts more than twice as many chimeras as UCHIME2, many of which have AR<2, while most chimeras would be expected to have AR≥2 because the parents undergo at least one more round of PCR amplification. In the bins with AR≥2 the two programs agree on most predictions. DADA2 predicts a few more in each bin because it sometimes allows one difference in the chimeric model built from the putative parent sequences while UCHIME2 always requires an exact match.

Table 1. Datasets used for testing. All datasets contain MiSeq paired-end reads. A random subset of 1M read pairs was extracted from the Mock2 and Soil1 datasets because DADA2 failed to converge in the model estimation step when all reads were input. Mock1 and Mock2 are the mock communities with 21 strains(Haas et al., 2011) used to validate sequencing protocols in the Human Microbiome Project. The Extreme mock community has 27 strains(Callahan et al., 2016).

Figure 1. Schematic of the UNOISE2 denoising strategy. The left panel shows the neighborhood close to a high-abundance unique read sequence X, grouped by the number of sequence differences (d). Dots are unique sequences, the size of a dot indicates its abundance. Green dots are correct biological sequences; red dots have one or more errors. Neighbors with small numbers of differences and small abundance compared to X are predicted to be bad reads of X. The right panel shows the denoised amplicons. Here, X and b were correctly predicted, e is an error with anomalously high abundance that was wrongly predicted to be correct, f is an error that was correctly discarded but has an abundance almost high enough to be a false positive, and g is a low-abundance correct amplicon that was wrongly discarded. The abundances of b, e, and f are similar, illustrating the fundamental challenge in denoising: how to set an abundance threshold that distinguishes correct sequences from errors.

Table 2. Results on test datasets. See main text for explanation of column headings and discussion of the results. U&D is the consensus of UNOISE2 and DADA2, i.e. ZOTUs predicted by both algorithms.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

Benjamin J. Callahan, +2 more

- 21 Jul 2017 -

The ISME Journal

TL;DR: It is argued that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.

...read moreread less

Robert C. Edgar

- 01 Oct 2013 -

Nature Methods

TL;DR: The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% correct bases commonly reported by other methods.

...read moreread less

Collapse

Search and clustering orders of magnitude faster than BLAST

Robert C. Edgar

- 01 Oct 2010 -

Bioinformatics

DADA2: High-resolution sample inference from Illumina amplicon data

Benjamin J. Callahan, +5 more

- 01 Jul 2016 -

Nature Methods

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

Christian Quast, +7 more

- 28 Nov 2012 -

Nucleic Acids Research

Cutadapt removes adapter sequences from high-throughput sequencing reads

Marcel Martin

- 02 May 2011 -

EMBnet.journal

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Unoise2: improved error-correction for illumina 16s and its amplicon sequencing" ?

Introduction Recent examples of microbial tag sequencing experiments include the Human Microbiome Project ( HMP Consortium, 2012 ) and a survey of the Arabidopsis root microbiome ( Lundberg et al., 2012 ). The experimental protocol in such studies includes amplification by PCR followed by sequencing, which introduces errors in several ways. Spurious species can also be introduced when reads are assigned to incorrect samples due to cross-talk, also known as tag switching or barcode switching ( Carlsen et al., 2012 ). certified by peer review ) is the author/funder.

Q2. What is the way to ensure that reads of the same template have the same length?

If multiple primers were used which do not bind to the same locus, then trimming is required to ensure that reads of the same template amplified by different primers start and end at the same position in the biological sequence.

Q3. Why is denoising more effective with quality-filtered reads?

denoising is more effective with quality-filtered reads because sequencing error bias can cause some errors to have sufficiently high abundances that they could be mistaken for biological variants, and these often have lower quality scores.

Q4. How many different 16S sequences are in the reference database?

the reference database for the HMP mock community (21 strains) has 115 different 16S sequences, an average of 5.5 distinct 16S sequences per strain.

Q5. Why is it problematic to cut the sequences?

Global trimming and defining abundance Calculating unique sequence abundance is problematic when reads of the same template sequence vary in length, e.g. because reads are truncated when the quality score drops below a threshold.

Q6. What is the way to classify a sequence as non-chimeric?

A sequence cannot be reliably classified as nonchimeric unless it is identical to a reference sequence(Edgar, 2016), and amplicons with uncorrected point errors therefore cannot be reliably classified.

Q7. How many chimeras will have abundance ratios 2?

If fluctuations in the abundance ratio are equally likely to give values <2 and >2, then approximately half of the chimeras formed in the first round will have abundance ratio <2, i.e. 1/(2N).

Q8. What are some examples of microbial tag sequencing experiments?

Recent examples of microbial tag sequencing experiments include the Human Microbiome Project(HMP Consortium, 2012) and a survey of the Arabidopsis root microbiome(Lundberg et al., 2012).

Q9. What is the probability that many of the UCHIME2 predictions are also false positives?

Given that UCHIME2 agrees with 400/657 of the DADA2 chimera predictions with ratios >2, it seems likely that many of the UCHIME2 predictions are also false positives, despite using more stringent parameters (no differences allowed in the model, abundance ratio ≥2).

Q10. Why is the abundance of a given sequence lost?

Some low-abundance variants may be lost that would be correctly identified by pooling, e.g., because they are singletons in some of the samples where they occur.

Q11. What is the protocol for amplification followed by sequencing?

The experimental protocol in such studies includes amplification by PCR followed by sequencing, which introduces errors in several ways.

Q12. What is the probability of false positive chimeras?

I believe that false positive chimeras will have a much higher frequency than uncorrected point errors, given the high accuracy of DADA2 on most of the mock datasets and the observation that fake chimeric models are very common, especially when differences are allowed(Edgar, 2016).

Q13. What is the maximum skew allowed for a member with d differences?

If skew(M, C) ≤ β(d) then M is a valid member of a cluster defined by C; i.e., βis the maximum skew allowed for a member with d differences.

Q14. What is the way to avoid the problems of global trimming?

These problems are avoided by ensuring that reads of the same template sequence have the same length (global trimming, implying that reads of the same template should be globally alignable, though more distantly related sequences need not be).

Q15. What is the way to define a sequence with high abundance?

With (1), a given template sequence with high abundance in the amplicons will typically have many different unique sequences with low abundances because its reads are truncated to many different lengths.

Q16. What is the name of the first amplicon sequencing error correction method?

The first amplicon sequencing error-correction methods were designed for pyrosequencing flowgrams(Quince et al., 2011, 2009; Reeder and Knight, 2010; Rosen et al., 2013).

UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing

Figures

Citations

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis

NRT1.1B is associated with root microbiota composition and nitrogen use in field-grown rice

PICRUSt2: An improved and extensible approach for metagenome inference

References

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

Search and clustering orders of magnitude faster than BLAST

DADA2: High-resolution sample inference from Illumina amplicon data

UCHIME improves sensitivity and speed of chimera detection

UPARSE: highly accurate OTU sequences from microbial amplicon reads

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

DADA2: High-resolution sample inference from Illumina amplicon data

QIIME allows analysis of high-throughput community sequencing data.

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

Cutadapt removes adapter sequences from high-throughput sequencing reads

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Unoise2: improved error-correction for illumina 16s and its amplicon sequencing" ?

Q2. What is the way to ensure that reads of the same template have the same length?

Q3. Why is denoising more effective with quality-filtered reads?

Q4. How many different 16S sequences are in the reference database?

Q5. Why is it problematic to cut the sequences?

Q6. What is the way to classify a sequence as non-chimeric?

Q7. How many chimeras will have abundance ratios 2?

Q8. What are some examples of microbial tag sequencing experiments?

Q9. What is the probability that many of the UCHIME2 predictions are also false positives?

Q10. Why is the abundance of a given sequence lost?

Q11. What is the protocol for amplification followed by sequencing?

Q12. What is the probability of false positive chimeras?

Q13. What is the maximum skew allowed for a member with d differences?

Q14. What is the way to avoid the problems of global trimming?

Q15. What is the way to define a sequence with high abundance?

Q16. What is the name of the first amplicon sequencing error correction method?

Trending Questions (2)