What is the phylum rank of the accRDP algorithm?

At phylum rank, EPQ is given as the measure of error rate since almost allphyla are known so OC cannot be measured reliably and MC ≈ EPQ.

What is the way to determine the bootstrap confidence of a given taxonomy?

If Uall1 ≫ Uall2 then Usubset1 will be greater than Usubset2 in most or all iterations and C1 will therefore have high bootstrap confidence.

What is the way to determine the bootstrap confidence of a given sequence?

For a givenquery sequence, consider reference sequences ranked using all k-mers, i.e. in order of decreasing Uall(r) = |W(Q) ⋂ W(r)|.

What is the genus classification in RDPDB?

This result suggeststhat many of the genus annotations in RDPDB, most of which were predicted by RDP at80% bootstrap, may be false positives as 47% of the 3.2M RDPDB sequences have top-hitidentity <95% with RDPTS, implying that roughly half belong to novel genera.

What is the default method for classify?

The mothur classify.seqs command was run withmethod=wang (Mrdp, the default, a re-implementation of the RDP algorithm) andmethod=knn (Mknn).

(Open Access) SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences (2016) | Robert C. Edgar

Q: What is the popular method for surveying microbial communities?

Sequencing of tags such as the ribosomal 16S gene and fungal internal transcribed space(ITS) region is a popular method for surveying microbial communities.

Q: How does QIIME use a subset of Greengenes?

By default, QIIME uses a subset of Greengenes clustered at 97% identity (GGQ, containing99k sequences in v13.8), and mothur recommends a subset of SILVA (SILVAM, containing172k sequences in v123).

Q: What is the average number of non-singleton training sequences in RTS?

The average number of non-singletontraining sequences is 9 per genus in RTS and 14 per species in Warcup which suggests thatcorrect classification should be relatively easy for most queries, while in practice manygenera will be novel, and taxa that are rare in the database may be common in the query setand vice versa.

Q: What are the available databases for predicting the taxonomy of sequences?

Available databases include the RDP training sets, thefull RDP database (RDPDB) (Maidak et al., 2001), SILVA (Pruesse et al., 2007), Greengenes(DeSantis et al., 2006) and UNITE (Kõljalg et al., 2013).

Q: What is the default method for classify?

The mothur classify.seqs command was run withmethod=wang (Mrdp, the default, a re-implementation of the RDP algorithm) andmethod=knn (Mknn).

Q: What is the difference between a singleton and a ITS?

A singletoncannot be classified correctly in a leave-one-out test because no training sequences are leftfor its clade so that the maximum achievable AccRDP by an ideal algorithm is the fraction of non-singleton taxa, i.e. 92% for 16S genus and 87% for ITS species, rather than 100% aswould usually be expected for an accuracy measure.

SINTAX: a simple non-Bayesian

taxonomy classifier for 16S and ITS

sequences

Robert C. Edgar

Independent Investigator

Tiburon, California, USA.

robert@drive5.com

Abstract

Metagenomics experiments often characterize microbial communities by sequencing the

ribosomal 16S and ITS regions. Taxonomy prediction is a fundamental step in such studies.

The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in

a reference database and provides bootstrap confidence for all ranks in the prediction.

SINTAX achieves comparable or better accuracy to the RDP Naive Bayesian Classifier with a

simpler algorithm that does not require training. Most tested methods are shown to have

high rates of over-classification errors where novel taxa are incorrectly predicted to have

known names.

The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

Introduction

Sequencing of tags such as the ribosomal 16S gene and fungal internal transcribed space

(ITS) region is a popular method for surveying microbial communities. Recent examples

include the Human Microbiome Project (HMP Consortium, 2012) and a survey of the

Arabidopsis root microbiome (Lundberg et al., 2012). A fundamental step in such studies is

to predict the taxonomy of sequences found in the reads. The most popular method is

currently the RDP Naive Bayesian Classifier (Wang et al., 2007) (hereafter RDP). Additional

taxonomy prediction methods are supported by QIIME (Caporaso et al., 2010) and mothur

(Schloss et al., 2009).

Reference databases

Taxonomy prediction requires a reference database containing sequences with taxonomy

annotations. Authoritative prokaryotic sequence classifications exist for at most the

~12,000 named species belonging to ~2,300 genera which represent only a tiny fraction of

extant species (Yarza et al., 2014). Available databases include the RDP training sets, the

full RDP database (RDPDB) (Maidak et al., 2001), SILVA (Pruesse et al., 2007), Greengenes

(DeSantis et al., 2006) and UNITE (Kõljalg et al., 2013). The RDP 16S training set v16 (RTS)

has 13,212 sequences belonging to 2,126 genera while the RDP Warcup ITS training set

(Deshpande et al., 2015) v2 has 18,878 sequences belonging to 8,551 species. The RDP

training sets contain only sequences with authoritative names and are therefore much

smaller than SILVA, Greengenes and UNITE which include environmental sequences. SILVA

v123 has 1.8M small subunit ribosomal RNA sequences; v114 was estimated to contain

~94,000 genera (Yarza et al., 2014). Greengenes v13.5 has 1.8M 16S sequences. UNITE

release 01.08.2015 has 476k ITS sequences representing ~71,000 species. Most taxonomy

annotations in SILVA and Greengenes are predictions obtained by computational and

manual analyses which are primarily based on trees predicted from multiple alignments

(McDonald et al., 2012; Yilmaz et al., 2014); in RDPDB most annotations are predicted by

RDP. In the 16S databases (RDPDB, SILVA and Greengenes), no attempt is made to classify

unnamed groups, while UNITE assigns numerical “species hypothesis” identifiers to

unnamed clusters.

The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

By default, QIIME uses a subset of Greengenes clustered at 97% identity (GGQ, containing

99k sequences in v13.8), and mothur recommends a subset of SILVA (SILVAM, containing

172k sequences in v123). The RDP web site and stand-alone software use the RDP training

sets.

Database coverage and novel taxa

If a query sequence is found in the database, its taxonomy is naively given by the reference

annotation. This prediction may wrong if the database has annotation errors or multiple

species are identical over the sequenced region, which often happens with short tags such

as the popular V4 hypervariable region of 16S. The latter scenario cannot be reliably

identified by checking the database for other identical sequences because the reference

data may be incomplete. If the query sequence is not found in the database then prediction

is more difficult. For example, using a 95% identity threshold for clustering full-length 16S

sequences was found to give groups that best approximate genera (Yarza et al., 2014).

Thus, if a 16S sequence has 95% identity with a database hit, it might be in the same genus

but since identity correlates only approximately with taxonomic rank it could belong only

to the same family or same class. Or, it could belong to the same species if there is

atypically large variations between paralogs or strains. From this perspective, the task of

taxonomy prediction is to estimate the lowest common rank (LCR) between the query and

the database. A query rank r is known if r ≥ LCR, i.e. at least one member of its clade is

present in the reference database (regardless of whether it is named) and novel if r < LCR.

The coverage of a reference database at a given rank with respect to a set of query

sequences is the fraction of queries that are known and novelty = (1 – coverage) is the

fraction of queries that are novel. The mean top-hit identity (MTI) between query

sequences and their top hits can be used as an approximate indication of coverage. To

obtain typical query sets, I constructed OTUs at 97% identity using UPARSE (Edgar, 2013)

from V4 reads of human gut, mouse gut and soil communities respectively (Kozich et al.,

2013) and ITS reads of a soil fungal community (Schmidt et al., 2013). MTIs of these

samples vs. commonly-used reference databases are shown in Table 1. All V4 samples have

MTI<95% with RTS, suggesting that many, perhaps most, OTUs belong to novel genera,

especially in soil (MTI=88%).

The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

RDP leave-one-out validation

RDP was tested on 16S and ITS sequences using leave-one-out validation (Wang et al.,

2007; Deshpande et al., 2015) where one query sequence is extracted from the training set

(RTS and Warcup, respectively) and classified using the remaining sequences as a

reference. Accuracy (Acc

RDP

) is calculated as the fraction of sequences that are correctly

classified. Roughly half (1,119 / 2,472) of the genera in RTS are singletons, i.e. have exactly

one training sequence, while about a quarter (2,258 / 8,548) of the species in Warcup are

singletons, comprising 8% (16S) and 13% (ITS) of the training sequences. A singleton

cannot be classified correctly in a leave-one-out test because no training sequences are left

for its clade so that the maximum achievable Acc

RDP

by an ideal algorithm is the fraction of

non-singleton taxa, i.e. 92% for 16S genus and 87% for ITS species, rather than 100% as

would usually be expected for an accuracy measure. The average number of non-singleton

training sequences is 9 per genus in RTS and 14 per species in Warcup which suggests that

correct classification should be relatively easy for most queries, while in practice many

genera will be novel, and taxa that are rare in the database may be common in the query set

and vice versa. Also, all predictions are included in Acc

RDP

regardless of their bootstrap

confidence values rather than using the authors' recommended parameters (here, 80%

cutoff) as would usually be expected for a benchmark test. In summary, the RDP leave-one-

out test does not model typical query datasets and Acc

RDP

does not give a realistic estimate

of accuracy by any conventional definition.

Methods

Performance metrics

Sensitivity should be measured as the fraction of known queries that are correctly

identified so that the highest achievable sensitivity by an ideal algorithm is 100%. If novel

queries were also counted then sensitivity <100% would reflect an opaque combination of

low database coverage and failures to correctly predict known taxa, as with Acc

RDP

. It is

useful to distinguish two types of false positive error: misclassifications, where an incorrect

name is predicted for a known rank, and over-classifications, where a name is predicted for

The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

a novel rank. For a given query set, reference database and taxonomic rank let N

known

and

novel

be the number of queries with known and novel taxa respectively. Let TP be the

number of correct predictions, FP

mis

be the number of misclassification errors and FP

over

the number of over-classification errors. The total number of queries is N = N

known

+ N

novel

The following accuracy metrics can now be defined:

Sensitivity = TP / N

known

Misclassification rate = MC = FP

mis

/ N

known

Over-classification rate = OC = FP

over

/ N

novel

Errors per query = EPQ = (FP

mis

+ FP

over

) / N.

To a first approximation, we might expect misclassification and over-classification rates to

be similar on different datasets because these measures reflect intrinsic characteristics of

an algorithm independent of the data while EPQ, the measure that is typically of most

interest in practice, will strongly depend on database coverage (equivalently, on query

novelty). For example, if a query set contains mostly known sequences, we would expect

errors to be rare and dominated by misclassifications, while if a query set is highly novel

then there may be many overclassifications. If these expectations are correct, then values of

MC and OC measured on a benchmark test will be similar to those obtained on biological

data in practice while EPQ will be similar only if the benchmark has similar rates of novel

taxa.

Clade partition cross-validation (CPX)

If high ranks are usually known but low ranks are often novel, then a benchmark test

should contain a mix of known and novel taxa at low ranks so that both MC and OC can be

measured. This can be achieved by clade partition cross-validation (CPX), as follows. Clades

at a given rank r

part

from a reference database are partitioned so that a randomly-chosen

half of the daughter groups in a given clade are assigned to the query set and the other half

to the reference set so that ranks below r

part

are always novel. For example, if r

part

= family

then half of the genera for a given family are assigned to the query and half to the reference

set. Singletons are always assigned to the query set, so are always novel while non-

The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences

Figures

Citations

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

Bacterial metabolism of bile acids promotes generation of peripheral regulatory T cells

Kombucha Beverage from Green, Black and Rooibos Teas: A Comparative Study Looking at Microbiology, Chemistry and Antioxidant Activity.

A few Ascomycota taxa dominate soil fungal communities worldwide

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

References

QIIME allows analysis of high-throughput community sequencing data.

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy

UPARSE: highly accurate OTU sequences from microbial amplicon reads

Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy

The SILVA ribosomal RNA gene database project: improved data processing and web-based tools

QIIME allows analysis of high-throughput community sequencing data.

Cutadapt removes adapter sequences from high-throughput sequencing reads

Frequently Asked Questions (10)

Q1. What is the popular method for surveying microbial communities?

Q2. What is the phylum rank of the accRDP algorithm?

Q3. How does QIIME use a subset of Greengenes?

Q4. What is the way to determine the bootstrap confidence of a given taxonomy?

Q5. What is the average number of non-singleton training sequences in RTS?

Q6. What is the way to determine the bootstrap confidence of a given sequence?

Q7. What are the available databases for predicting the taxonomy of sequences?

Q8. What is the genus classification in RDPDB?

Q9. What is the default method for classify?

Q10. What is the difference between a singleton and a ITS?