Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.

doi:10.1038/NMETH.2276

Home
/
Papers
/
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.

Journal Article•DOI•

Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.

Nicholas A. Bokulich¹, Sathish Subramanian², Jeremiah J. Faith², Dirk Gevers³, Jeffrey I. Gordon², Rob Knight⁴, Rob Knight⁵, David A. Mills¹, J. Gregory Caporaso⁶, J. Gregory Caporaso⁷ - Show less +6 more•Institutions (7)

University of California, Davis¹, Washington University in St. Louis², Broad Institute³, University of Colorado Boulder⁴, Howard Hughes Medical Institute⁵, Argonne National Laboratory⁶, Northern Arizona University⁷

01 Jan 2013-Nature Methods (Nature Research)-Vol. 10, Iss: 1, pp 57-59

TL;DR: It is demonstrated that high-quality read length and abundance are the primary factors differentiating correct from erroneous reads produced by Illumina GAIIx, HiSeq and MiSeq instruments.

read less

Abstract: High-throughput sequencing has revolutionized microbial ecology, but read quality remains a considerable barrier to accurate taxonomy assignment and α-diversity assessment for microbial communities. We demonstrate that high-quality read length and abundance are the primary factors differentiating correct from erroneous reads produced by Illumina GAIIx, HiSeq and MiSeq instruments. We present guidelines for user-defined quality-filtering strategies, enabling efficient extraction of high-quality data and facilitating interpretation of Illumina sequencing results.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

UPARSE: highly accurate OTU sequences from microbial amplicon reads

[...]

Robert C. Edgar

01 Oct 2013-Nature Methods

TL;DR: The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% correct bases commonly reported by other methods.

...read moreread less

Abstract: Amplified marker-gene sequences can be used to understand microbial community structure, but they suffer from a high level of sequencing and amplification artifacts. The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% incorrect bases commonly reported by other methods. The improved accuracy results in far fewer OTUs, consistently closer to the expected number of species in a community.

...read moreread less

11,329 citations

Journal Article•DOI•

Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform.

[...]

James J. Kozich¹, Sarah L. Westcott¹, Nancy N. Baxter¹, Sarah K. Highlander², Patrick D. Schloss¹ - Show less +1 more•Institutions (2)

University of Michigan¹, Baylor College of Medicine²

01 Sep 2013-Applied and Environmental Microbiology

TL;DR: This work presents an improved method for sequencing variable regions within the 16S rRNA gene using Illumina's MiSeq platform, which is currently capable of producing paired 250-nucleotide reads and demonstrates that it can provide data that are at least as good as that generated by the 454 platform while providing considerably higher sequencing coverage for a fraction of the cost.

...read moreread less

Abstract: Rapid advances in sequencing technology have changed the experimental landscape of microbial ecology. In the last 10 years, the field has moved from sequencing hundreds of 16S rRNA gene fragments per study using clone libraries to the sequencing of millions of fragments per study using next-generation sequencing technologies from 454 and Illumina. As these technologies advance, it is critical to assess the strengths, weaknesses, and overall suitability of these platforms for the interrogation of microbial communities. Here, we present an improved method for sequencing variable regions within the 16S rRNA gene using Illumina's MiSeq platform, which is currently capable of producing paired 250-nucleotide reads. We evaluated three overlapping regions of the 16S rRNA gene that vary in length (i.e., V34, V4, and V45) by resequencing a mock community and natural samples from human feces, mouse feces, and soil. By titrating the concentration of 16S rRNA gene amplicons applied to the flow cell and using a quality score-based approach to correct discrepancies between reads used to construct contigs, we were able to reduce error rates by as much as two orders of magnitude. Finally, we reprocessed samples from a previous study to demonstrate that large numbers of samples could be multiplexed and sequenced in parallel with shotgun metagenomes. These analyses demonstrate that our approach can provide data that are at least as good as that generated by the 454 platform while providing considerably higher sequencing coverage for a fraction of the cost.

...read moreread less

5,417 citations

Cites methods from "Quality-filtering vastly improves d..."

...Others have attempted to use the Phred/Phrap quality scores associated with each base to trim sequence reads in combination with removing rare taxa (16)....
[...]

“Bioinformatics” 특집을 내면서

[...]

장병탁, 김삼묘, 허철구

01 Aug 2000

TL;DR: Assessment of medical technology in the context of commercialization with Bioentrepreneur course, which addresses many issues unique to biomedical products.

...read moreread less

Abstract: BIOE 402. Medical Technology Assessment. 2 or 3 hours. Bioentrepreneur course. Assessment of medical technology in the context of commercialization. Objectives, competition, market share, funding, pricing, manufacturing, growth, and intellectual property; many issues unique to biomedical products. Course Information: 2 undergraduate hours. 3 graduate hours. Prerequisite(s): Junior standing or above and consent of the instructor.

...read moreread less

4,833 citations

Journal Article•DOI•

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin

[...]

Nicholas A. Bokulich¹, Benjamin D. Kaehler², Jai Ram Rideout¹, Matthew R. Dillon¹, Evan Bolyen¹, Rob Knight³, Gavin A. Huttley², J. Gregory Caporaso¹ - Show less +4 more•Institutions (3)

Northern Arizona University¹, Australian National University², University of California, San Diego³

17 May 2018-Microbiome

TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.

...read moreread less

Abstract: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated “novel” marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

...read moreread less

2,475 citations

Journal Article•DOI•

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis.

[...]

Benjamin J. Callahan¹, Paul J. McMurdie, Susan Holmes²•Institutions (2)

North Carolina State University¹, Stanford University²

21 Jul 2017-The ISME Journal

TL;DR: It is argued that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.

...read moreread less

Abstract: Recent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently such that amplicon sequence variants (ASVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer resolution are immediately apparent, and arguments for ASV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits that derive from the status of ASVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how these features grant ASVs the combined advantages of closed-reference OTUs—including computational costs that scale linearly with study size, simple merging between independently processed data sets, and forward prediction—and of de novo OTUs—including accurate measurement of diversity and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that ASVs should replace OTUs as the standard unit of marker-gene analysis and reporting.

...read moreread less

1,977 citations

Cites background from "Quality-filtering vastly improves d..."

...Aggressive filtering and complete overlap between paired-end reads can reduce the rate at which OTU methods misinterpret sequencing artifacts as biological variation (Bokulich et al., 2013; Kozich et al., 2013; Edgar and Flyvbjerg, 2015)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

QIIME allows analysis of high-throughput community sequencing data.

[...]

J. Gregory Caporaso¹, Justin Kuczynski¹, Jesse Stombaugh¹, Kyle Bittinger², Frederic D. Bushman², Elizabeth K. Costello¹, Noah Fierer³, Antonio Gonzalez Peña¹, Julia K. Goodrich¹, Jeffrey I. Gordon⁴, Gavin A. Huttley⁵, Scott T. Kelley⁶, Dan Knights¹, Jeremy E. Koenig⁷, Ruth E. Ley⁷, Catherine A. Lozupone¹, Daniel McDonald¹, Brian D. Muegge⁴, Meg Pirrung¹, Jens Reeder¹, Joel Sevinsky, Peter J. Turnbaugh⁴, William A. Walters¹, Jeremy Widmann¹, Tanya Yatsunenko⁴, Jesse R. Zaneveld¹, Rob Knight¹, Rob Knight⁸ - Show less +24 more•Institutions (8)

University of Colorado Boulder¹, University of Pennsylvania², Cooperative Institute for Research in Environmental Sciences³, Washington University in St. Louis⁴, Australian National University⁵, San Diego State University⁶, Cornell University⁷, Howard Hughes Medical Institute⁸

11 Apr 2010-Nature Methods

TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.

...read moreread less

Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

...read moreread less

28,911 citations

Journal Article•DOI•

Search and clustering orders of magnitude faster than BLAST

[...]

Robert C. Edgar

01 Oct 2010-Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

17,301 citations

Journal Article•DOI•

Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy

[...]

Qiong Wang, George M. Garrity¹, James M. Tiedje¹, James R. Cole•Institutions (1)

Michigan State University¹

15 Aug 2007-Applied and Environmental Microbiology

TL;DR: The RDP Classifier can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes, and the majority of the classification errors appear to be due to anomalies in the current taxonomies.

...read moreread less

Abstract: The Ribosomal Database Project (RDP) Classifier, a naive Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes (2nd ed., release 5.0, Springer-Verlag, New York, NY, 2004). It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The majority of classifications (98%) were of high estimated confidence (≥95%) and high accuracy (98%). In addition to being tested with the corpus of 5,014 type strain sequences from Bergey's outline, the RDP Classifier was tested with a corpus of 23,095 rRNA sequences as assigned by the NCBI into their alternative higher-order taxonomy. The results from leave-one-out testing on both corpora show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level, and the majority of the classification errors appear to be due to anomalies in the current taxonomies. For shorter rRNA segments, such as those that might be generated by pyrosequencing, the error rate varied greatly over the length of the 16S rRNA gene, with segments around the V2 and V4 variable regions giving the lowest error rates. The RDP Classifier is suitable both for the analysis of single rRNA sequences and for the analysis of libraries of thousands of sequences. Another related tool, RDP Library Compare, was developed to facilitate microbial-community comparison based on 16S rRNA gene sequence libraries. It combines the RDP Classifier with a statistical test to flag taxa differentially represented between samples. The RDP Classifier and RDP Library Compare are available online at http://rdp.cme.msu.edu/.

...read moreread less

16,048 citations

Journal Article•DOI•

Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB

[...]

Todd Z. DeSantis¹, Philip Hugenholtz², Neils Larsen, Mark Rojas³, Eoin L. Brodie¹, Keith Keller⁴, Thomas Huber⁵, Daniel Dalevi⁶, Ping Hu¹, Gary L. Andersen¹ - Show less +6 more•Institutions (6)

Lawrence Berkeley National Laboratory¹, Joint Genome Institute², Baylor University³, University of California, Berkeley⁴, University of Queensland⁵, Chalmers University of Technology⁶

01 Jul 2006-Applied and Environmental Microbiology

TL;DR: A 16S rRNA gene database (http://greengenes.lbl.gov) was used to provide chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies as mentioned in this paper.

...read moreread less

Abstract: A 16S rRNA gene database (http://greengenes.lbl.gov) addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.

...read moreread less

9,593 citations

Journal Article•DOI•

Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms

[...]

J. Gregory Caporaso¹, Christian L. Lauber², William A. Walters³, Donna Berg-Lyons², James Huntley³, Noah Fierer³, Noah Fierer², Sarah M. Owens⁴, Jason Betley⁵, Louise Fraser⁵, Markus J. Bauer⁵, Niall Anthony Gormley⁵, Jack A. Gilbert⁶, Jack A. Gilbert⁴, Geoff Smith⁵, Rob Knight - Show less +12 more•Institutions (6)

Northern Arizona University¹, Cooperative Institute for Research in Environmental Sciences², University of Colorado Boulder³, Argonne National Laboratory⁴, Illumina⁵, University of Chicago⁶

01 Aug 2012-The ISME Journal

TL;DR: It is shown that the protocol developed for these instruments successfully recaptures known biological results, and additionally that biological conclusions are consistent across sequencing platforms (the HiSeq2000 versus the MiSeq) and across the sequenced regions of amplicons.

...read moreread less

Abstract: DNA sequencing continues to decrease in cost with the Illumina HiSeq2000 generating up to 600 Gb of paired-end 100 base reads in a ten-day run. Here we present a protocol for community amplicon sequencing on the HiSeq2000 and MiSeq Illumina platforms, and apply that protocol to sequence 24 microbial communities from host-associated and free-living environments. A critical question as more sequencing platforms become available is whether biological conclusions derived on one platform are consistent with what would be derived on a different platform. We show that the protocol developed for these instruments successfully recaptures known biological results, and additionally that biological conclusions are consistent across sequencing platforms (the HiSeq2000 versus the MiSeq) and across the sequenced regions of amplicons.

...read moreread less

6,840 citations