scispace - formally typeset
Search or ask a question
Journal ArticleDOI

VSEARCH: a versatile open source tool for metagenomics

TL;DR: VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with US EARCH for paired-ends read merging and dereplication.
Abstract: Background: VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods: When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results: VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion: VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

Content maybe subject to copyright    Report

Submitted 5 September 2016
Accepted 17 September 2016
Published 18 October 2016
Corresponding author
Torbjørn Rognes, torognes@ifi.uio.no
Academic editor
Tomas Hrbek
Additional Information and
Declarations can be found on
page 18
DOI 10.7717/peerj.2584
Copyright
2016 Rognes et al.
Distributed under
Creative Commons CC-BY 4.0
OPEN ACCESS
VSEARCH: a versatile open source tool
for metagenomics
Torbjørn Rognes
1,2
, Tomáš Flouri
3,4
, Ben Nichols
5
, Christopher Quince
5,6
and
Frédéric Mahé
7,8
1
Department of Informatics, University of Oslo, Oslo, Norway
2
Department of Microbiology, Oslo University Hospital, Oslo, Norway
3
Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
4
Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
5
School of Engineering, University of Glasgow, Glasgow, United Kingdom
6
Warwick Medical School, University of Warwick, Coventry, United Kingdom
7
Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
8
UMR LSTM, CIRAD, Montpellier, France
ABSTRACT
Background. VSEARCH is an open source and free of charge multithreaded 64-bit
tool for processing and preparing metagenomics, genomics and population genomics
nucleotide sequence data. It is designed as an alternative to the widely used USEARCH
tool (Edgar, 2010) for which the source code is not publicly available, algorithm details
are only rudimentarily described, and only a memory-confined 32-bit version is freely
available for academic use.
Methods. When searching nucleotide sequences, VSEARCH uses a fast heuristic based
on words shared by the query and target sequences in order to quickly identify similar
sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs
optimal global sequence alignment of the query against potential target sequences, using
full dynamic programming instead of the seed-and-extend heuristic used by USEARCH.
Pairwise alignments are computed in parallel using vectorisation and multiple threads.
Results. VSEARCH includes most commands for analysing nucleotide sequences
available in USEARCH version 7 and several of those available in USEARCH version 8,
including searching (exact or based on global alignment), clustering by similarity (using
length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection
(reference-based or de novo), dereplication (full length or prefix), pairwise alignment,
reverse complementation, sorting, and subsampling. VSEARCH also includes com-
mands for FASTQ file processing, i.e., format detection, filtering, read quality statistics,
and merging of paired reads. Furthermore, VSEARCH extends functionality with
several new commands and improvements, including shuffling, rereplication, masking
of low-complexity sequences with the well-known DUST algorithm, a choice among
different similarity definitions, and FASTQ file format conversion. VSEARCH is here
shown to be more accurate than USEARCH when performing searching, clustering,
chimera detection and subsampling, while on a par with USEARCH for paired-ends
read merging. VSEARCH is slower than USEARCH when performing clustering and
chimera detection, but significantly faster when performing paired-end reads merging
and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under
either the BSD 2-clause license or the GNU General Public License version 3.0.
How to cite this article Rognes et al. (2016), VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584; DOI
10.7717/peerj.2584

Discussion. VSEARCH has been shown to be a fast, accurate and full-fledged alternative
to USEARCH. A free and open-source versatile tool for sequence analysis is now
available to the metagenomics community.
Subjects Biodiversity, Bioinformatics, Computational Biology, Genomics, Microbiology
Keywords Clustering, Chimera detection, Searching, Masking, Shuffling, Parallellization,
Metagenomics, Alignment, Sequences, Dereplication
INTRODUCTION
Rockström et al. (2009) and Steffen et al. (2015) presented biodiversity loss as a major threat
for the short-term survival of humanity. Recent progress in sequencing technologies
have made possible large scale studies of environmental genetic diversity, from deep sea
hydrothermal vents to Antarctic lakes (Karsenti et al., 2011), and from tropical forests to
Siberian steppes (Gilbert, Jansson & Knight, 2014). Recent clinical studies have shown the
importance of the microbiomes of our bodies and daily environments for human health
(Human Microbiome Project Consortium, 2012). Usually focusing on universal markers
(e.g., 16S rRNA, ITS, COI), these targeted metagenomics studies produce many millions
of sequences, and require open-source, fast and memory efficient tools to facilitate their
ecological interpretation.
Several pipelines have been developed for microbiome analysis, among which
mothur (Schloss et al., 2009), QIIME (Caporaso et al., 2010), and UPARSE (Edgar, 2013)
are the most popular. QIIME and UPARSE are both based on USEARCH (Edgar,
2010), a set of tools designed and implemented by Robert C. Edgar, and available at
http://drive5.com/usearch/. USEARCH offers a great number of commands and options to
manipulate and analyse FASTQ and FASTA files. However, the source code of USEARCH
is not publicly available, algorithm details are only rudimentarily described, and only a
memory-confined 32-bit version is freely available for academic use.
We believe that the existence of open-source solutions is beneficial for end-users and can
invigorate research activities. For this reason, we have undertaken to offer a high quality
open-source alternative to USEARCH, freely available to users without any memory
limitation. VSEARCH includes most of the USEARCH functions in common use, and
further development may add additional features. Here we describe the details of the
VSEARCH implementation. To assess its performance in terms of speed and quality of
results, we have evaluated some of the most important functions (searching, clustering,
chimera detection and subsampling) and compared them to USEARCH. We find that
VSEARCH delivers results that are better or on a par with USEARCH results.
MATERIALS AND METHODS
Algorithms and implementation
Below is a brief description of the most important functions of VSEARCH and details of
their implementation. VSEARCH command line options are shown in italics, and should
be preceded by a single (-) or double dash (- -) when used.
Rognes et al. (2016), PeerJ, DOI 10.7717/peerj.2584 2/22

Reading FASTA and FASTQ files
Most VSEARCH commands read files in FASTA or FASTQ format. The parser for FASTQ
files in VSEARCH is compliant with the standard as described by Cock et al. (2010) and
correctly parses all their tests files. FASTA and FASTQ files are automatically detected
and many commands accept both as input. Files compressed with gzip or bzip2 are
automatically detected and decompressed using the zlib library by Gailly & Adler (2016)
or the bzip2 library by Seward (2016), respectively. Data may also be piped into or out of
VSEARCH, allowing for instance many separate FASTA files to be piped into VSEARCH
for simultaneous dereplication, or allowing the creation of complex pipelines without ever
having to write on slow disks.
VSEARCH is a 64-bit program and allows very large datasets to be processed, essentially
limited only by the amount of memory available. The free USEARCH versions are 32-bit
programs that limit the available memory to somewhere less than 4GB, often seriously
hampering the analysis of realistic datasets.
Writing result files
VSEARCH can output results in a variety of formats (FASTA, FASTQ, tables, alignments,
SAM) depending on the input format and command used. When outputting FASTA
files, the line width may be specified using the fasta_width option, where 0 means that
line wrapping should be turned off. Similar controls are offered for pairwise or multiple
sequence alignments.
Searching
Global pairwise sequence comparison is a core functionality of VSEARCH. Several
commands compare a query sequence against a database of sequences: all-vs-all alignment
(allpairs_global), clustering (cluster_fast, cluster_size, cluster_smallmem), chimera detection
(uchime_denovo and uchime_ref ) and searching (usearch_global). This comparison
function proceeds in two phases: an initial heuristic filtering based on shared words,
followed by optimal alignment of the query with the most promising candidates.
The first phase is presumably quite similar to USEARCH (Edgar, 2010). Heuristics
are used to identify a small set of database sequences that have many words in common
with the query sequence. Words (or k-mers) consist of a certain number k of consecutive
nucleotides of a sequence (8 by default, adjustable with the wordlength option). All
overlapping words are included. A sequence of length n then contains at most n k + 1
unique words. VSEARCH counts the number of shared words between the query and
each database sequence. Words that appear multiple times are counted only once. To
count the words in the database sequences quickly, VSEARCH creates an index of all the
4
k
possible distinct words and stores information about which database sequences they
appear in. For extremely frequent words, the set of database sequences is represented by a
bitmap; otherwise the set is stored as a list. A finer control of k-mer indexing is described
for USEARCH by the pattern (binary string indicating which positions must match) and
slots options. USEARCH has such options but seems to ignore them. Currently, VSEARCH
ignores these two options too. The minimum number of shared words required may be
Rognes et al. (2016), PeerJ, DOI 10.7717/peerj.2584 3/22

specified with the minwordmatches option (10 by default), but a lower value is automatically
used for short or simple query sequences with less than 10 unique words.
Comparing sequences based on statistics of shared words is a common method to
quickly assess the similarity between two sequences without aligning them, which is
often time-consuming. The D
2
statistic and related metrics for alignment-free sequence
comparison have often been used for rapid and approximate sequence matching and their
statistical properties have been well studied (Song et al., 2014). The approach used here has
similarities to the D
2
statistic, but multiple matches of the same word are ignored.
In the second phase, searching proceeds by considering the database sequences in a
specific order, starting with the sequence having the largest number of words in common
with the query, and proceeding with a decreasing number of shared words. If two database
sequences have the same number of words in common with the query, the shortest
sequence is considered first. The query sequence is compared with each database sequence
by computing the optimal global alignment. The alignment is performed using a multi-
threaded and vectorised full dynamic programming algorithm (Needleman & Wunsch,
1970) adapted from SWIPE (Rognes, 2011). Due to the extreme memory requirements
of this method when aligning two long sequences, an alternative algorithm described by
Hirschberg (1975) and Myers & Miller (1988) is used when the product of the length of the
sequences is greater than 25,000,000, corresponding to aligning two 5,000 bp sequences.
This alternative algorithm uses only a linear amount of memory but is considerably
slower. This second phase is probably where USEARCH and VSEARCH differ the most, as
USEARCH by default presumably performs heuristic seed-and-extend alignment similar to
BLAST (Altschul et al., 1997), and only performs optimal alignment when the option fulldp
(full dynamic programming) is used. Computing the optimal pairwise alignment in each
case gives more accurate results but is also computationally more demanding. The efficient
and vectorised full dynamic programming implementation in VSEARCH compensates
that extra cost, at least for sequences that are not too long.
If the resulting alignment indicates a similarity equal to or greater than the value
specified with the id option, the database sequence is accepted. If the similarity is too low, it
is rejected. Several other options may also be used to determine how similarity is computed
(iddef, as USEARCH used to offer up to version 6), and which sequences should be accepted
and rejected, either before (e.g., self, minqsize) or after alignment (e.g., maxgaps, maxsubs).
The search is terminated when either a certain number of sequences have been accepted
(1 by default, adjustable with the maxaccepts option), or a certain number of sequences
have been rejected (32 by default, adjustable with the maxrejects option). The accepted
sequences are sorted by sequence similarity and presented as the search results.
VSEARCH also includes a search_exact command that only identifies exact matches to
the query. It uses a hash table in a way similar to the full-length dereplication command
described below.
Clustering
VSEARCH includes commands to perform de novo clustering using a greedy and heuristic
centroid-based algorithm with an adjustable sequence similarity threshold specified with
Rognes et al. (2016), PeerJ, DOI 10.7717/peerj.2584 4/22

the id option (e.g., 0.97). The input sequences are either processed in the user supplied
order (cluster_smallmem) or pre-sorted based on length (cluster_fast) or abundance (the
new cluster_size option). Each input sequence is then used as a query in a search against
an initially empty database of centroid sequences. The query sequence is clustered with the
first centroid sequence found with similarity equal to or above the threshold. The search
is performed using the heuristic approach described above which generally finds the most
similar sequences first. If no matches are found, the query sequence becomes the centroid of
a new cluster and is added to the database. If maxaccepts is higher than 1, several centroids
with sufficient sequence similarity may be found and considered. By default, the query
is clustered with the centroid presenting the highest sequence similarity (distance-based
greedy clustering, DGC), or, if the sizeorder option is turned on, the centroid with the
highest abundance (abundance-based greedy clustering, AGC) (He et al., 2015; Westcott &
Schloss, 2015; Schloss, 2016). VSEARCH performs multi-threaded clustering by searching
the database of centroid sequences with several query sequences in parallel. If there are
any non-matching query sequences giving rise to new centroids, the required internal
comparisons between the query sequences are subsequently performed to achieve correct
results. For each cluster, VSEARCH can create a simple multiple sequence alignment using
the center star method (Gusfield, 1993) with the centroid as the center sequence, and then
compute a consensus sequence and a sequence profile.
Dereplication and rereplication
Full-length dereplication (derep_fulllength) is performed using a hash table with an open
addressing and linear probing strategy based on the Google CityHash hash functions
(written by Geoff Pike and Jyrki Alakuijala, and available at https://github.com/google/
cityhash). The hash table is initially empty. For each input sequence, the hash is computed
and a lookup in the hash table is performed. If an identical sequence is found, the input
sequence is clustered with the matching sequence; otherwise the input sequence is inserted
into the hash table.
Prefix dereplication (derep_prefix) is also implemented. As with full-length dereplication,
identical sequences are clustered. In addition, sequences that are identical to prefixes of
other sequences will also be clustered together. If a sequence is identical to the prefix
of multiple sequences, it is generally not defined how prefix clustering should behave.
VSEARCH resolves this ambiguity by clustering the sequence with the shortest of the
candidate sequences. If they are equally long, priority will be given to the most abundant,
the one with the lexicographically smaller identifier or the one with the earliest original
position, in that order.
To perform prefix dereplication, VSEARCH first creates an initially empty hash table. It
then sorts the input sequences by length and identifies the length s of the shortest sequence
in the dataset. Each input sequence is then processed as follows, starting with the shortest:
If an exact match to the full input sequence is found in the hash table, the input sequence
is clustered with the matching hash table sequence. If no match to the full input sequence
is found, the prefixes of the input sequence are considered, starting with the longest prefix
and proceeding with shorter prefixes in order, down to prefixes of length s. If a match is
Rognes et al. (2016), PeerJ, DOI 10.7717/peerj.2584 5/22

Citations
More filters
Journal ArticleDOI
05 Jan 2018-Science
TL;DR: Examination of the oral and gut microbiome of melanoma patients undergoing anti-programmed cell death 1 protein (PD-1) immunotherapy suggested enhanced systemic and antitumor immunity in responding patients with a favorable gut microbiome as well as in germ-free mice receiving fecal transplants from responding patients.
Abstract: Preclinical mouse models suggest that the gut microbiome modulates tumor response to checkpoint blockade immunotherapy; however, this has not been well-characterized in human cancer patients. Here we examined the oral and gut microbiome of melanoma patients undergoing anti-programmed cell death 1 protein (PD-1) immunotherapy (n = 112). Significant differences were observed in the diversity and composition of the patient gut microbiome of responders versus nonresponders. Analysis of patient fecal microbiome samples (n = 43, 30 responders, 13 nonresponders) showed significantly higher alpha diversity (P < 0.01) and relative abundance of bacteria of the Ruminococcaceae family (P < 0.01) in responding patients. Metagenomic studies revealed functional differences in gut bacteria in responders, including enrichment of anabolic pathways. Immune profiling suggested enhanced systemic and antitumor immunity in responding patients with a favorable gut microbiome as well as in germ-free mice receiving fecal transplants from responding patients. Together, these data have important implications for the treatment of melanoma patients with immune checkpoint inhibitors.

2,791 citations

Journal ArticleDOI
TL;DR: The results illustrate the importance of parameter tuning for optimizing classifier performance, and the recommendations regarding parameter choices for these classifiers under a range of standard operating conditions are made.
Abstract: Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated “novel” marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

2,475 citations


Cites background or methods from "VSEARCH: a versatile open source to..."

  • ...Naive Bayes (0.022984 s/ sequence), BLAST+ (0.026222 s/sequence), and VSEARCH (0.030190 s/sequence) exhibit greater slopes....

    [...]

  • ...The naive Bayes, VSEARCH, and BLAST+ consensus classifiers described here are released for the first time in QIIME 2, with optimized “balanced” configurations (Table 2) set as defaults....

    [...]

  • ...The methods classify_consensus_vsearch and classify_consensus_blast use the global aligner VSEARCH [10] or the local aligner BLAST+ [9], respectively, to return up to maxaccepts reference sequences that align to the query with at least perc_identity similarity....

    [...]

  • ...The q2feature-classifier plugin supports use of any of the numerous machine-learning classifiers available in scikitlearn [7, 8] for marker gene taxonomy classification, and currently provides two alignment-based taxonomy consensus classifiers based on BLAST+ [9] and VSEARCH [10]....

    [...]

  • ...For 16S rRNA gene sequences, BLAST+, UCLUST, and VSEARCH consensus classifiers perform best for novel taxon classification (Table 2)....

    [...]

Journal ArticleDOI
01 Nov 2017-Nature
TL;DR: A meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project is presented, creating both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.
Abstract: Our growing awareness of the microbial world’s importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity.

1,676 citations

Journal ArticleDOI
21 Apr 2017
TL;DR: A novel sub-operational-taxonomic-unit (sOTU) approach that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms, Deblur, which substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity.
Abstract: High-throughput sequencing of 16S ribosomal RNA gene amplicons has facilitated understanding of complex microbial communities, but the inherent noise in PCR and DNA sequencing limits differentiation of closely related bacteria. Although many scientific questions can be addressed with broad taxonomic profiles, clinical, food safety, and some ecological applications require higher specificity. Here we introduce a novel sub-operational-taxonomic-unit (sOTU) approach, Deblur, that uses error profiles to obtain putative error-free sequences from Illumina MiSeq and HiSeq sequencing platforms. Deblur substantially reduces computational demands relative to similar sOTU methods and does so with similar or better sensitivity and specificity. Using simulations, mock mixtures, and real data sets, we detected closely related bacterial sequences with single nucleotide differences while removing false positives and maintaining stability in detection, suggesting that Deblur is limited only by read length and diversity within the amplicon sequences. Because Deblur operates on a per-sample level, it scales to modern data sets and meta-analyses. To highlight Deblur's ability to integrate data sets, we include an interactive exploration of its application to multiple distinct sequencing rounds of the American Gut Project. Deblur is open source under the Berkeley Software Distribution (BSD) license, easily installable, and downloadable from https://github.com/biocore/deblur. IMPORTANCE Deblur provides a rapid and sensitive means to assess ecological patterns driven by differentiation of closely related taxa. This algorithm provides a solution to the problem of identifying real ecological differences between taxa whose amplicons differ by a single base pair, is applicable in an automated fashion to large-scale sequencing data sets, and can integrate sequencing runs collected over time.

1,181 citations


Cites methods from "VSEARCH: a versatile open source to..."

  • ...Reads are filtered for de novo chimeras using UCHIME (8) as implemented by VSEARCH (9) using modified parameters (Text S1)....

    [...]

  • ...However, it is possible that the reads would still contain chimeras originating from PCR. Reads are filtered for de novo chimeras using UCHIME (8) as implemented by VSEARCH (9) using modified parameters (Text S1)....

    [...]

Journal ArticleDOI
TL;DR: The use of eDNA metabarcoding for surveying animal and plant richness, and the challenges in using eDNA approaches to estimate relative abundance are reviewed, which distill what is known about the ability of different eDNA sample types to approximate richness in space and across time.
Abstract: The genomic revolution has fundamentally changed how we survey biodiversity on earth. High-throughput sequencing ("HTS") platforms now enable the rapid sequencing of DNA from diverse kinds of environmental samples (termed "environmental DNA" or "eDNA"). Coupling HTS with our ability to associate sequences from eDNA with a taxonomic name is called "eDNA metabarcoding" and offers a powerful molecular tool capable of noninvasively surveying species richness from many ecosystems. Here, we review the use of eDNA metabarcoding for surveying animal and plant richness, and the challenges in using eDNA approaches to estimate relative abundance. We highlight eDNA applications in freshwater, marine and terrestrial environments, and in this broad context, we distill what is known about the ability of different eDNA sample types to approximate richness in space and across time. We provide guiding questions for study design and discuss the eDNA metabarcoding workflow with a focus on primers and library preparation methods. We additionally discuss important criteria for consideration of bioinformatic filtering of data sets, with recommendations for increasing transparency. Finally, looking to the future, we discuss emerging applications of eDNA metabarcoding in ecology, conservation, invasion biology, biomonitoring, and how eDNA metabarcoding can empower citizen science and biodiversity education.

1,038 citations

References
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations


"VSEARCH: a versatile open source to..." refers methods in this paper

  • ...This second phase is probably where USEARCH and VSEARCH differ the most, as USEARCH by default presumably performs heuristic seed-and-extend alignment similar to BLAST (Altschul et al., 1997), and only performs optimal alignment when the option fulldp (full dynamic programming) is used....

    [...]

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations


"VSEARCH: a versatile open source to..." refers methods in this paper

  • ...Rognes et al. (2016), PeerJ, DOI 10.7717/peerj.2584 10/22...

    [...]

  • ...Merged sequences that could be perfectly aligned to their respective reference sequences (either the entire genome or the specific rRNA region) using BWAMEM (Li & Durbin, 2009) were considered correctly merged....

    [...]

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations


"VSEARCH: a versatile open source to..." refers methods in this paper

  • ...QIIME and UPARSE are both based on USEARCH (Edgar, 2010), a set of tools designed and implemented by Robert C. Edgar, and available at http://drive5.com/usearch/....

    [...]

  • ..., 2009), QIIME (Caporaso et al., 2010), and UPARSE (Edgar, 2013) are the most popular....

    [...]

  • ...In fact, in QIIME many commands will run fine if an alias or link from usearch to vsearch is made....

    [...]

  • ...Several pipelines have been developed for microbiome analysis, among which mothur (Schloss et al., 2009), QIIME (Caporaso et al., 2010), and UPARSE (Edgar, 2013) are the most popular....

    [...]

Journal ArticleDOI
TL;DR: The extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
Abstract: SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.

18,256 citations


"VSEARCH: a versatile open source to..." refers methods in this paper

  • ...2 segments 3 segments 4 segments Divergence Noise UC U7 U8 V UC U7 U8 V UC U7 U8 V 97–99% – 89 88 88 89 56 52 52 55 38 33 34 35 i1 79 79 77 85 46 44 43 53 32 27 24 34 i2 64 57 56 77 33 32 31 56 24 20 18 33 i3 48 45 36 72 37 35 29 45 16 17 16 21 i4 29 24 23 65 18 11 13 40 9 9 8 25 i5 27 22 16 53 15 12 12 39 7 8 6 17 m1 83 83 83 81 53 48 48 53 33 29 29 30 m2 73 71 71 72 49 44 44 50 28 22 22 27 m3 66 66 66 68 40 40 39 44 21 20 21 21 m4 55 54 53 57 28 24 23 28 21 18 18 19 m5 44 44 42 48 20 19 18 28 16 14 12 12 95–97% – 100 100 100 100 80 77 76 79 64 60 59 63 i1 100 98 98 100 77 75 72 75 54 55 53 61 i2 96 94 93 99 60 55 55 71 48 44 44 60 i3 86 82 82 95 61 50 52 70 38 36 31 53 i4 75 66 64 95 48 41 39 64 29 29 22 47 i5 64 58 53 86 37 32 25 60 24 19 19 46 m1 99 99 99 99 76 73 73 76 60 57 57 60 m2 98 97 97 97 71 69 69 71 50 48 46 48 m3 93 94 94 96 63 61 61 64 41 41 41 42 m4 92 92 90 93 56 55 54 57 39 39 37 41 m5 86 86 85 86 53 51 51 56 35 35 34 34 90–95% – 100 100 100 100 93 93 93 93 88 88 88 86 i1 100 100 100 100 88 88 87 91 86 86 87 88 i2 99 97 99 99 83 79 78 88 74 72 72 84 i3 100 100 100 100 79 76 75 88 74 69 70 82 i4 99 94 96 99 80 71 72 84 66 62 61 79 i5 95 84 86 99 74 65 65 88 55 48 48 71 m1 100 100 100 100 89 89 89 92 87 87 86 85 m2 100 100 100 100 87 87 87 89 78 78 78 79 m3 100 99 99 100 86 86 86 89 76 76 78 80 m4 100 100 100 100 82 82 84 83 73 73 72 78 m5 99 98 98 99 82 81 82 84 75 73 75 79 dataset, while none of the programs work well with the SILVA dataset....

    [...]

  • ...All datasets used were small enough to fit comfortably in the memory allocated to a 32-bit process....

    [...]

  • ...We evaluated the chimera detection accuracy of VSEARCH and USEARCH in two ways, first using a method similar to that performed for UCHIME, and then using a new chimera simulation procedure based on sequences from Greengenes (DeSantis et al., 2006) and SILVA (Quast et al., 2013) sequences....

    [...]

  • ...Next, we tested reference-based (uchime_ref ) and de novo (uchime_denovo) chimera detection using sequences from the 2011 version of Greengenes downloaded from http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/ and from version 106 (May 2011) of the SILVA database downloaded from https://www.arb-silva.de/no_ cache/download/archive/release_106/Exports/....

    [...]

  • ...5 and 6 show that de novo chimera detection performs better than reference-based detection, with the SILVA dataset in particular, but it does of course depend on the reference database used....

    [...]

Journal ArticleDOI
TL;DR: M mothur is used as a case study to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments.
Abstract: mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the alpha and beta diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.

17,350 citations


"VSEARCH: a versatile open source to..." refers methods in this paper

  • ...Several pipelines have been developed for microbiome analysis, among which mothur (Schloss et al., 2009), QIIME (Caporaso et al., 2010), and UPARSE (Edgar, 2013) are the most popular....

    [...]

Related Papers (5)