scispace - formally typeset
Open AccessPosted ContentDOI

Unicycler: resolving bacterial genome assemblies from short and long sequencing reads

TLDR
Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low.
Abstract
The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce more complete genome assemblies, but the sequencing is more expensive and error prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate "hybrid" assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler utilises a novel semi-global aligner, which is used to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

read more

Content maybe subject to copyright    Report

RESEARCH ARTICLE
Unicycler: Resolving bacterial genome
assemblies from short and long sequencing
reads
Ryan R. Wick*, Louise M. Judd, Claire L. Gorrie, Kathryn E. Holt
Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute,
The University of Melbourne, Victoria, Australia
* rrwick@gmail.com
Abstract
The Illumina DNA sequencing platform generates accurate but short reads, which can be
used to produce accurate but fragmented genome assemblies. Pacific Biosciences and
Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can
produce complete genome assemblies, but the sequencing is more expensive and error-
prone. There is significant interest in combining data from these complementary sequencing
technologies to generate more accurate “hybrid” assemblies. However, few tools exist that
truly leverage the benefits of both types of data, namely the accuracy of short reads and
the structural resolving power of long reads. Here we present Unicycler, a new tool for
assembling bacterial genomes from a combination of short and long reads, which produces
assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assem-
bly graph from short reads using the de novo assembler SPAdes and then simplifies the
graph using information from short and long reads. Unicycler uses a novel semi-global
aligner to align long reads to the assembly graph. Tests on both synthetic and real reads
show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid
assemblers, even when long-read depth and accuracy are low. Unicycler is open source
(GPLv3) and available at github.com/rrwick/Unicycler.
This is a PLOS Computational Biology Software paper.
Introduction
Bacterial genomics is currently dominated by Illumina sequencing platforms. Illumina reads
are accurate, have a low cost per base and have enabled widespread use of whole genome
sequencing. However, much Illumina sequencing uses short fragments (500 bp or less) that
are smaller than many repetitive elements in bacterial genomes[1]. This prevents short-read
assembly tools (assemblers) from resolving the full genome, and their assemblies are instead
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 1 / 22
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Wick RR, Judd LM, Gorrie CL, Holt KE
(2017) Unicycler: Resolving bacterial genome
assemblies from short and long sequencing reads.
PLoS Comput Biol 13(6): e1005595. https://doi.
org/10.1371/journal.pcbi.1005595
Editor: Adam M. Phillippy, National Human
Genome Research Institute, UNITED STATES
Received: January 13, 2017
Accepted: May 22, 2017
Published: June 8, 2017
Copyright: © 2017 Wick et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All reference
genomes used for simulation data are available
from the NCBI assembly database (accession
numbers in Table 1). E. coli sequence files are
publicly available (links in Table 2). Klebsiella
sequence files are available from the NCBI
Sequence Read Archive database (accession
numbers ERX1087708, ERX1087759,
SRX2874872 and SRX2874871).
Funding: This work was funded by the NHMRC of
Australia (project #1043822 and Fellowship
#1061409 to KEH). The funders had no role in

fragmented into dozens of contiguous sequences (contigs). Consequently, most available bac-
terial genomes are incomplete, which hinders large-scale comparative genomic studies.
Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing plat-
forms can sequence DNA fragments of 10 kbp or longer, but at a higher cost per base than
Illumina platforms. PacBio and ONT long reads also have much higher per-base error rates
than Illumina reads (5–15% vs <1%), although they are often sufficient to complete bacterial
genome assemblies with reasonable consensus accuracy[2,3]. Hence most researchers must
choose between generating fragmented draft assemblies for many isolates with inexpensive
Illumina sequencing, or generating complete assemblies for fewer isolates with expensive long-
read technologies. Hybrid assembly, which uses a combination of short and long reads, offers
an alternative. In this approach, short reads are used to produce accurate contigs and long
reads provide the information to scaffold them together. This requires relatively few long reads
and can thus be the most cost-effective route to a complete bacterial genome.
Despite recent developments in long-read technologies, Illumina reads are widely used in
public health and research laboratories[4], and are likely to remain so for some time due to
their high accuracy and low cost. Moreover, Illumina data is already available for hundreds of
thousands of bacterial isolates, and most of these are unlikely to be replaced with long-read-
only sequencing data. It is therefore probable that research and clinical labs will continue to
use low cost Illumina reads for most samples and generate long reads as necessary to complete
genomes of interest. Hybrid assembly, which requires fewer long reads than long-read-only
assembly, is the most cost-effective means of achieving this goal.
Hybrid assembly can be accomplished with either a short-read-first or long-read-first
approach. In the short-read-first method, a scaffolding tool uses long reads to join Illumina
contigs together. However, scaffolding mistakes are common and lead to structural errors
(misassemblies) in the sequence[5]. Long-read-first approaches may involve assembly of
uncorrected long reads, followed by error-correction of the assembly using short reads[3].
Alternatively, they may first use short reads to correct errors in long reads, followed by assem-
bly of the corrected long reads[6,7]. Whether error correction is performed before or after
assembly, long-read-first approaches require higher long-read depth than short-read-first
approaches.
Here we present Unicycler, a new hybrid assembly pipeline for bacterial isolate genomes.
Unicycler first assembles short reads into an accurate and connected assembly graph, a data
structure containing both contigs and their interconnections[8]. It then uses long reads to find
the best paths through the graph. By following a short-read-first approach, Unicycler makes
effective use of low quantities of long reads, but it can produce a completed assembly (one con-
tig per replicon) if the long-read depth is sufficient. By using the assembly graph connections
to constrain the possible scaffolding arrangements, Unicycler achieves lower misassembly
rates than alternative short-read-first assemblers.
Design and implementation
Unicycler encapsulates its entire pipeline (Fig 1) in a single command and automatically deter-
mines low-level parameters so users can expect optimal results with default settings[9].
Short-read assembly
Unicycler uses SPAdes (v3.6.2 or later) to construct a De Bruijn graph assembly using a wide
range of k-mer sizes: 10 values spanning 20–95% of the Illumina read length (not exceeding
127, the largest k-mer possible in SPAdes)[10]. In SPAdes, large k-mers often result in larger
contigs, but excessively large k-mers can cause a fragmented graph with dead ends. Unicycler
Unicycler: Bacterial genome assemblies from short and long reads
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 2 / 22
study design, data collection and analysis, decision
to publish, or preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.

Fig 1. Key steps in the Unicycler pipeline.
https://doi.org/10.1371/journal.pcbi.1005595.g001
Unicycler: Bacterial genome assemblies from short and long reads
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 3 / 22

assigns a score (
1
cðdþ1Þ
2
) to each k-mer graph based on the number of contigs (c) and the number
of dead ends (d). This score function penalises both large numbers of contigs and large num-
bers of dead ends. Since dead ends are particularly problematic in later Unicycler steps (see
Multiplicity and Graph bridging using long-read alignments), the score function scales with
the inverse of d
2
. The highest scoring graph is selected as a balance between minimising both
contig count and dead ends (Figs 1A and S1).
As some contamination is possible in sequencing read sets (particularly when multiplexing,
which is a common strategy for bacterial isolate sequencing on Illumina[11]), Unicycler then
removes contigs with a depth of less than half the median graph depth, unless doing so would
create a dead end. This removes most contamination while leaving important graph structures
intact.
Multiplicity
To resolve the graph as accurately as possible, Unicycler must first determine the multiplicity
of contigs in the assembly graph. The most important distinction is between single-copy
contigs (sequences that occur once in the genome, multiplicity k = 1) and repeat contigs
(sequences that occur multiple times in the genome, multiplicity k > 1). However, determin-
ing the correct multiplicity for repeat contigs is also important, as this information can be used
when finalising the assembly graph (see Conservative, normal and bold).
When a bacterial genome consists of a single chromosome with no additional replicons,
then for each contig x, its median read depth d
x
is a good indicator of its multiplicity k
x
. Sin-
gle-copy contigs will have a median depth d
x
close to D, the median depth per base across the
entire assembly, while repeat contigs will have a median depth near a multiple of that value
(i.e. d
x
~ k
x
D). The relationship between median read depth and multiplicity is more compli-
cated when the genome contains multiple replicons present at different copy numbers per cell.
For example, small plasmids are often present in multiple copies, while large conjugative plas-
mids are often present once per cell. The relationship between read depth and multiplicity
only holds for replicons which exist in one copy per cell (the same as the chromosome). For
example, contigs with depth 2D may be chromosomal and have a multiplicity of two, or they
may be in a two-copy-per-cell plasmid and have a multiplicity of one.
In addition to read depth, a contig’s graph connections also provide useful information
about its multiplicity. Repeat contigs typically have multiple graph connections at their start
and end, while single-copy contigs usually have only a single connection at each end. These
trends break down when the assembly graph is fragmented, which is one reason why Unicycler
aims to minimise the number of dead ends when determining the optimal short-read assembly
graph.
To determine multiplicity values, Unicycler therefore uses both depth and connectivity
information. Initially, a multiplicity of one is assigned to all contigs that are near the graph’s
median depth and have no more than one connection at either end. A greedy algorithm then
propagates multiplicity where graph connections and depth are in close agreement (Figs 1B
and S2). When no more propagation is possible, the largest suitable contig is given a multiplic-
ity of one and the process is repeated. This algorithm can correctly assign multiplicity to high-
copy-number plasmid contigs in additional to chromosomal contigs.
Bridges
Unicycler scaffolds assembly graphs by constructing bridge contigs to connect pairs of single-
copy contigs. Before bridging, single-copy contigs connect via multiple alternative paths con-
taining one or more repeat contigs. After bridging, they connect via a simple, unambiguous
Unicycler: Bacterial genome assemblies from short and long reads
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 4 / 22

path. Bridges thus simplify the graph by resolving repeats. There are two primary sources of
information available for creating bridges: paired-end short reads, which can resolve small
repeats, and long reads, which can resolve much larger repeats.
Graph bridging using read pair information
When SPAdes assembles paired-end reads, it uses its ExSPAnder algorithm to find paths
through the assembly graph using read-pair orientation[12]. This process is known as repeat
resolution (RR). SPAdes does not save its post-RR assembly (contigs.fasta) in graph form, but
it does save the graph paths used to make post-RR contigs (contigs.paths). Unicycler finds
cases where two single-copy contigs are connected in a SPAdes contig path and uses them to
build bridges. In Fig 1C, the SPAdes contig path connects contigs 1 and 5 via contig 3. Unicy-
cler’s resulting bridge connects contigs 1 and 5 directly with a copy of the contig 3 sequence.
When this bridge is applied, contigs 2 and 4 also become connected via an unbranching path,
in essence becoming bridged by process of elimination. These indirect graph simplifications
may be merged together later in Unicycler’s pipeline, depending on the mode (see Conserva-
tive, normal and bold). While Unicycler creates bridges at this stage, they are not immediately
applied to the graph. This is deferred to a later step where bridges are applied in decreasing
order of quality (see Bridge application).
Semi-global long-read alignment
While short reads can resolve repeats up to the insert size of the library (typically <1000 bp),
long reads provide a much more powerful source of scaffolding information. As a first step in
long-read bridging, Unicycler aligns all available long reads to the single-copy contigs. Since
the long and short reads must be from the same biological sample, there should be no genuine
structural discrepancies between the long reads and contigs. Semi-global alignment (i.e. end-
gap free alignment) is therefore appropriate, where alignments can only terminate when the
end of a sequence is reached. Most available long-read alignment tools such as BLASR[13],
BWA-MEM[14], BLAST[15] and LAST[16] perform local alignment, so Unicycler implements
semi-global alignment directly using the SeqAn C++ library (S3 Fig)[17].
Graph bridging using long-read alignments
Long reads that align to multiple single-copy contigs can be used for bridging. Such reads con-
tain a sample of the gap sequence between those contigs, and if multiple long reads connect a
pair of contigs, Unicycler uses SeqAn to produce a consensus gap sequence[18,19]. Unicycler
does not directly use this gap sequence in the bridge but instead uses it to find the best graph
path connecting the contigs, via a branch and bound algorithm. Thus, the bridge sequence
comes from the graph and reflects base calling accuracy of the short reads rather than the long
reads that may have much lower accuracy (Fig 1D). Sometimes Unicycler cannot find a graph
path connecting two single-copy contigs that are connected via long reads, such as when the
short-read graph is incomplete and contains dead ends. In these cases, the long-read consensus
sequence is directly used as the bridging sequence. Such bridges are more likely to contain
errors—another reason why Unicycler strives to minimise dead ends in the assembly graph.
Bridge application
Having produced bridges from both short reads (SPAdes RR) and long reads, Unicycler can
now apply them to simplify the graph structure (Fig 1E). Since some bridges may be errone-
ous, Unicycler assigns a quality score to each bridge and applies them in order of decreasing
Unicycler: Bacterial genome assemblies from short and long reads
PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 5 / 22

Citations
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Posted ContentDOI

Accurate and Complete Genomes from Metagenomes

TL;DR: Through analysis of ~7000 published complete bacterial isolate genomes, the value of cumulative GC skew is verified in combination with other metrics to establish bacterial genome sequence accuracy and analysis of possible mis-assemblies identified potential mis- assemblies in some reference genomes of isolated bacteria.
Journal ArticleDOI

Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing

TL;DR: In this article , the authors show that the Oxford Nanopore R10.4 can be used to generate near-finished microbial genomes from isolates or metagenomes without short-read or reference polishing.
Journal ArticleDOI

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.

TL;DR: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity, and when reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK.
References
More filters
Journal ArticleDOI

Fast gapped-read alignment with Bowtie 2

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Posted ContentDOI

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li
- 16 Mar 2013 - 
TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.
Journal ArticleDOI

T-Coffee: A novel method for fast and accurate multiple sequence alignment.

TL;DR: A new method for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives but avoids the most serious pitfalls caused by the greedy nature of this algorithm.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Unicycler: resolving bacterial genome assemblies from short and long sequencing reads" ?

Here the authors present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. This is a PLOS Computational Biology Software paper. 

Algorithmic improvements to long-read alignment, path finding and graph manipulations will all be required for Unicycler to be appropriate in such cases. 

There are two primary sources of information available for creating bridges: paired-end short reads, which can resolve small repeats, and long reads, which can resolve much larger repeats. 

Hybrid assembly, which requires fewer long reads than long-read-only assembly, is the most cost-effective means of achieving this goal. 

For the short-read sets, the authors performed five assemblies: Unicycler in each of its modes (conservatives, normal and bold), SPAdes and ABySS. 

npScarf required 76 minutes of reads (9.0x) to complete the assembly, SPAdes took 102 minutes of reads (12.1x) and miniasm took 213 minutes of reads (25.3x). 

It was included in these tests because of its speed—it only takes a few minutes to run—making it potentially suitable for real-time analysis. 

all tests were performed in five replicates using separately generated synthetic reads, resulting in 16920 total assemblies. 

The authors produced seven query sequences from the reference genome: each RNA operon along with 2 kbp of neighbouring sequence on each end. 

By iteratively polishing the genome with both short and long reads, this process can correct many remaining errors in a completed assembly, including those in repeat regions. 

SPAdes and npScarf performed the fastest, both having a median time of eight minutes and maximum time of less than 25 minutes on the same data. 

As a final step, Unicycler uses Bowtie2 and Pilon to polish the assembly using shortread alignments, reducing the rate of small errors (Fig 1G)[21,22]. 

To investigate each assembler’s suitability for such real-time analysis, the authors generated 240 sub-sets of reads, one set per minute of sequencing, each containing all reads generated up to that minute (e.g. set 60 contained all reads generated in the first hour of sequencing). 

they may first use short reads to correct errors in long reads, followed by assembly of the corrected long reads[6,7].