scispace - formally typeset
Open AccessPosted ContentDOI

SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences

Robert C. Edgar
- 09 Sep 2016 - 
- pp 074161
TLDR
The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction, achieving comparable or better accuracy to the RDP Naive Bayesian Classifier with a simpler algorithm that does not require training.
Abstract
Metagenomics experiments often characterize microbial communities by sequencing the ribosomal 16S and ITS regions. Taxonomy prediction is a fundamental step in such studies. The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction. SINTAX achieves comparable or better accuracy to the RDP Naive Bayesian Classifier with a simpler algorithm that does not require training. Most tested methods are shown to have high rates of over-classification errors where novel taxa are incorrectly predicted to have known names.

read more

Content maybe subject to copyright    Report

SINTAX: a simple non-Bayesian
taxonomy classifier for 16S and ITS
sequences
Robert C. Edgar
Independent Investigator
Tiburon, California, USA.
robert@drive5.com
Abstract
Metagenomics experiments often characterize microbial communities by sequencing the
ribosomal 16S and ITS regions. Taxonomy prediction is a fundamental step in such studies.
The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in
a reference database and provides bootstrap confidence for all ranks in the prediction.
SINTAX achieves comparable or better accuracy to the RDP Naive Bayesian Classifier with a
simpler algorithm that does not require training. Most tested methods are shown to have
high rates of over-classification errors where novel taxa are incorrectly predicted to have
known names.
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

Introduction
Sequencing of tags such as the ribosomal 16S gene and fungal internal transcribed space
(ITS) region is a popular method for surveying microbial communities. Recent examples
include the Human Microbiome Project (HMP Consortium, 2012) and a survey of the
Arabidopsis root microbiome (Lundberg et al., 2012). A fundamental step in such studies is
to predict the taxonomy of sequences found in the reads. The most popular method is
currently the RDP Naive Bayesian Classifier (Wang et al., 2007) (hereafter RDP). Additional
taxonomy prediction methods are supported by QIIME (Caporaso et al., 2010) and mothur
(Schloss et al., 2009).
Reference databases
Taxonomy prediction requires a reference database containing sequences with taxonomy
annotations. Authoritative prokaryotic sequence classifications exist for at most the
~12,000 named species belonging to ~2,300 genera which represent only a tiny fraction of
extant species (Yarza et al., 2014). Available databases include the RDP training sets, the
full RDP database (RDPDB) (Maidak et al., 2001), SILVA (Pruesse et al., 2007), Greengenes
(DeSantis et al., 2006) and UNITE (Kõljalg et al., 2013). The RDP 16S training set v16 (RTS)
has 13,212 sequences belonging to 2,126 genera while the RDP Warcup ITS training set
(Deshpande et al., 2015) v2 has 18,878 sequences belonging to 8,551 species. The RDP
training sets contain only sequences with authoritative names and are therefore much
smaller than SILVA, Greengenes and UNITE which include environmental sequences. SILVA
v123 has 1.8M small subunit ribosomal RNA sequences; v114 was estimated to contain
~94,000 genera (Yarza et al., 2014). Greengenes v13.5 has 1.8M 16S sequences. UNITE
release 01.08.2015 has 476k ITS sequences representing ~71,000 species. Most taxonomy
annotations in SILVA and Greengenes are predictions obtained by computational and
manual analyses which are primarily based on trees predicted from multiple alignments
(McDonald et al., 2012; Yilmaz et al., 2014); in RDPDB most annotations are predicted by
RDP. In the 16S databases (RDPDB, SILVA and Greengenes), no attempt is made to classify
unnamed groups, while UNITE assigns numerical “species hypothesis” identifiers to
unnamed clusters.
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

By default, QIIME uses a subset of Greengenes clustered at 97% identity (GGQ, containing
99k sequences in v13.8), and mothur recommends a subset of SILVA (SILVAM, containing
172k sequences in v123). The RDP web site and stand-alone software use the RDP training
sets.
Database coverage and novel taxa
If a query sequence is found in the database, its taxonomy is naively given by the reference
annotation. This prediction may wrong if the database has annotation errors or multiple
species are identical over the sequenced region, which often happens with short tags such
as the popular V4 hypervariable region of 16S. The latter scenario cannot be reliably
identified by checking the database for other identical sequences because the reference
data may be incomplete. If the query sequence is not found in the database then prediction
is more difficult. For example, using a 95% identity threshold for clustering full-length 16S
sequences was found to give groups that best approximate genera (Yarza et al., 2014).
Thus, if a 16S sequence has 95% identity with a database hit, it might be in the same genus
but since identity correlates only approximately with taxonomic rank it could belong only
to the same family or same class. Or, it could belong to the same species if there is
atypically large variations between paralogs or strains. From this perspective, the task of
taxonomy prediction is to estimate the lowest common rank (LCR) between the query and
the database. A query rank r is known if r LCR, i.e. at least one member of its clade is
present in the reference database (regardless of whether it is named) and novel if r < LCR.
The coverage of a reference database at a given rank with respect to a set of query
sequences is the fraction of queries that are known and novelty = (1 coverage) is the
fraction of queries that are novel. The mean top-hit identity (MTI) between query
sequences and their top hits can be used as an approximate indication of coverage. To
obtain typical query sets, I constructed OTUs at 97% identity using UPARSE (Edgar, 2013)
from V4 reads of human gut, mouse gut and soil communities respectively (Kozich et al.,
2013) and ITS reads of a soil fungal community (Schmidt et al., 2013). MTIs of these
samples vs. commonly-used reference databases are shown in Table 1. All V4 samples have
MTI<95% with RTS, suggesting that many, perhaps most, OTUs belong to novel genera,
especially in soil (MTI=88%).
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

RDP leave-one-out validation
RDP was tested on 16S and ITS sequences using leave-one-out validation (Wang et al.,
2007; Deshpande et al., 2015) where one query sequence is extracted from the training set
(RTS and Warcup, respectively) and classified using the remaining sequences as a
reference. Accuracy (Acc
RDP
) is calculated as the fraction of sequences that are correctly
classified. Roughly half (1,119 / 2,472) of the genera in RTS are singletons, i.e. have exactly
one training sequence, while about a quarter (2,258 / 8,548) of the species in Warcup are
singletons, comprising 8% (16S) and 13% (ITS) of the training sequences. A singleton
cannot be classified correctly in a leave-one-out test because no training sequences are left
for its clade so that the maximum achievable Acc
RDP
by an ideal algorithm is the fraction of
non-singleton taxa, i.e. 92% for 16S genus and 87% for ITS species, rather than 100% as
would usually be expected for an accuracy measure. The average number of non-singleton
training sequences is 9 per genus in RTS and 14 per species in Warcup which suggests that
correct classification should be relatively easy for most queries, while in practice many
genera will be novel, and taxa that are rare in the database may be common in the query set
and vice versa. Also, all predictions are included in Acc
RDP
regardless of their bootstrap
confidence values rather than using the authors' recommended parameters (here, 80%
cutoff) as would usually be expected for a benchmark test. In summary, the RDP leave-one-
out test does not model typical query datasets and Acc
RDP
does not give a realistic estimate
of accuracy by any conventional definition.
Methods
Performance metrics
Sensitivity should be measured as the fraction of known queries that are correctly
identified so that the highest achievable sensitivity by an ideal algorithm is 100%. If novel
queries were also counted then sensitivity <100% would reflect an opaque combination of
low database coverage and failures to correctly predict known taxa, as with Acc
RDP
. It is
useful to distinguish two types of false positive error: misclassifications, where an incorrect
name is predicted for a known rank, and over-classifications, where a name is predicted for
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

a novel rank. For a given query set, reference database and taxonomic rank let N
known
and
N
novel
be the number of queries with known and novel taxa respectively. Let TP be the
number of correct predictions, FP
mis
be the number of misclassification errors and FP
over
be
the number of over-classification errors. The total number of queries is N = N
known
+ N
novel
.
The following accuracy metrics can now be defined:
Sensitivity = TP / N
known
,
Misclassification rate = MC = FP
mis
/ N
known
,
Over-classification rate = OC = FP
over
/ N
novel
,
Errors per query = EPQ = (FP
mis
+ FP
over
) / N.
To a first approximation, we might expect misclassification and over-classification rates to
be similar on different datasets because these measures reflect intrinsic characteristics of
an algorithm independent of the data while EPQ, the measure that is typically of most
interest in practice, will strongly depend on database coverage (equivalently, on query
novelty). For example, if a query set contains mostly known sequences, we would expect
errors to be rare and dominated by misclassifications, while if a query set is highly novel
then there may be many overclassifications. If these expectations are correct, then values of
MC and OC measured on a benchmark test will be similar to those obtained on biological
data in practice while EPQ will be similar only if the benchmark has similar rates of novel
taxa.
Clade partition cross-validation (CPX)
If high ranks are usually known but low ranks are often novel, then a benchmark test
should contain a mix of known and novel taxa at low ranks so that both MC and OC can be
measured. This can be achieved by clade partition cross-validation (CPX), as follows. Clades
at a given rank r
part
from a reference database are partitioned so that a randomly-chosen
half of the daughter groups in a given clade are assigned to the query set and the other half
to the reference set so that ranks below r
part
are always novel. For example, if r
part
= family
then half of the genera for a given family are assigned to the query and half to the reference
set. Singletons are always assigned to the query set, so are always novel while non-
not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which wasthis version posted September 9, 2016. ; https://doi.org/10.1101/074161doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.
Journal ArticleDOI

Bacterial metabolism of bile acids promotes generation of peripheral regulatory T cells

TL;DR: It is found that the secondary bile acid 3β-hydroxydeoxycholic acid (isoDCA) increased Foxp3 induction by acting on dendritic cells (DCs) to diminish their immunostimulatory properties, suggesting an interaction between this bile Acid and nuclear receptor.
Journal ArticleDOI

Kombucha Beverage from Green, Black and Rooibos Teas: A Comparative Study Looking at Microbiology, Chemistry and Antioxidant Activity.

TL;DR: Although antioxidant activity was higher in black and green kombucha compared to rooibos, the latter showed an important effect on the recovery of oxidative damage on fibroblast cell lines against oxidative stress, which makes rooIBos leaves interesting for the preparation of a fermented beverage with health benefits.
Journal ArticleDOI

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

TL;DR: IDTAXA is introduced, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors in reference taxonomies, and it is demonstrated that it has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier.
References
More filters
Journal ArticleDOI

Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy

TL;DR: The RDP Classifier can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes, and the majority of the classification errors appear to be due to anomalies in the current taxonomies.
Journal ArticleDOI

UPARSE: highly accurate OTU sequences from microbial amplicon reads

Robert C. Edgar
- 01 Oct 2013 - 
TL;DR: The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% correct bases commonly reported by other methods.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What is the popular method for surveying microbial communities?

Sequencing of tags such as the ribosomal 16S gene and fungal internal transcribed space(ITS) region is a popular method for surveying microbial communities. 

At phylum rank, EPQ is given as the measure of error rate since almost allphyla are known so OC cannot be measured reliably and MC ≈ EPQ. 

By default, QIIME uses a subset of Greengenes clustered at 97% identity (GGQ, containing99k sequences in v13.8), and mothur recommends a subset of SILVA (SILVAM, containing172k sequences in v123). 

If Uall1 ≫ Uall2 then Usubset1 will be greater than Usubset2 in most or all iterations and C1 will therefore have high bootstrap confidence. 

The average number of non-singletontraining sequences is 9 per genus in RTS and 14 per species in Warcup which suggests thatcorrect classification should be relatively easy for most queries, while in practice manygenera will be novel, and taxa that are rare in the database may be common in the query setand vice versa. 

For a givenquery sequence, consider reference sequences ranked using all k-mers, i.e. in order of decreasing Uall(r) = |W(Q) ⋂ W(r)|. 

Available databases include the RDP training sets, thefull RDP database (RDPDB) (Maidak et al., 2001), SILVA (Pruesse et al., 2007), Greengenes(DeSantis et al., 2006) and UNITE (Kõljalg et al., 2013). 

This result suggeststhat many of the genus annotations in RDPDB, most of which were predicted by RDP at80% bootstrap, may be false positives as 47% of the 3.2M RDPDB sequences have top-hitidentity <95% with RDPTS, implying that roughly half belong to novel genera. 

The mothur classify.seqs command was run withmethod=wang (Mrdp, the default, a re-implementation of the RDP algorithm) andmethod=knn (Mknn). 

A singletoncannot be classified correctly in a leave-one-out test because no training sequences are leftfor its clade so that the maximum achievable AccRDP by an ideal algorithm is the fraction of non-singleton taxa, i.e. 92% for 16S genus and 87% for ITS species, rather than 100% aswould usually be expected for an accuracy measure.