What are the requirements for Unicycler to be appropriate in such cases?

Algorithmic improvements to long-read alignment, path finding and graph manipulations will all be required for Unicycler to be appropriate in such cases.

How many assemblies were performed for the short-read sets?

For the short-read sets, the authors performed five assemblies: Unicycler in each of its modes (conservatives, normal and bold), SPAdes and ABySS.

How many reads did npScarf take to complete the assembly?

npScarf required 76 minutes of reads (9.0x) to complete the assembly, SPAdes took 102 minutes of reads (12.1x) and miniasm took 213 minutes of reads (25.3x).

Why was it included in the tests?

It was included in these tests because of its speed—it only takes a few minutes to run—making it potentially suitable for real-time analysis.

How many assemblies were performed using PBSIM?

all tests were performed in five replicates using separately generated synthetic reads, resulting in 16920 total assemblies.

How many RNA operons were found in the reference genome?

The authors produced seven query sequences from the reference genome: each RNA operon along with 2 kbp of neighbouring sequence on each end.

What is the way to polish the genome?

By iteratively polishing the genome with both short and long reads, this process can correct many remaining errors in a completed assembly, including those in repeat regions.

What assemblers performed the fastest on the simulated data?

SPAdes and npScarf performed the fastest, both having a median time of eight minutes and maximum time of less than 25 minutes on the same data.

What is the way to polish the assembly graph?

As a final step, Unicycler uses Bowtie2 and Pilon to polish the assembly using shortread alignments, reducing the rate of small errors (Fig 1G)[21,22].

How many reads were generated in a four hour period?

To investigate each assembler’s suitability for such real-time analysis, the authors generated 240 sub-sets of reads, one set per minute of sequencing, each containing all reads generated up to that minute (e.g. set 60 contained all reads generated in the first hour of sequencing).

(Open Access) Unicycler: resolving bacterial genome assemblies from short and long sequencing reads (2016) | Ryan R. Wick

Q: What have the authors contributed in "Unicycler: resolving bacterial genome assemblies from short and long sequencing reads" ?

Here the authors present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. This is a PLOS Computational Biology Software paper.

Q: What is the cost-effective method of achieving a complete bacterial genome?

Hybrid assembly, which requires fewer long reads than long-read-only assembly, is the most cost-effective means of achieving this goal.

RESEARCH ARTICLE

Unicycler: Resolving bacterial genome

assemblies from short and long sequencing

reads

Ryan R. Wick*, Louise M. Judd, Claire L. Gorrie, Kathryn E. Holt

Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute,

The University of Melbourne, Victoria, Australia

* rrwick@gmail.com

Abstract

The Illumina DNA sequencing platform generates accurate but short reads, which can be

used to produce accurate but fragmented genome assemblies. Pacific Biosciences and

Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can

produce complete genome assemblies, but the sequencing is more expensive and error-

prone. There is significant interest in combining data from these complementary sequencing

technologies to generate more accurate “hybrid” assemblies. However, few tools exist that

truly leverage the benefits of both types of data, namely the accuracy of short reads and

the structural resolving power of long reads. Here we present Unicycler, a new tool for

assembling bacterial genomes from a combination of short and long reads, which produces

assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assem-

bly graph from short reads using the de novo assembler SPAdes and then simplifies the

graph using information from short and long reads. Unicycler uses a novel semi-global

aligner to align long reads to the assembly graph. Tests on both synthetic and real reads

show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid

assemblers, even when long-read depth and accuracy are low. Unicycler is open source

(GPLv3) and available at github.com/rrwick/Unicycler.

This is a PLOS Computational Biology Software paper.

Introduction

Bacterial genomics is currently dominated by Illumina sequencing platforms. Illumina reads

are accurate, have a low cost per base and have enabled widespread use of whole genome

sequencing. However, much Illumina sequencing uses short fragments (500 bp or less) that

are smaller than many repetitive elements in bacterial genomes[1]. This prevents short-read

assembly tools (assemblers) from resolving the full genome, and their assemblies are instead

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 1 / 22

a1111111111

OPEN ACCESS

Citation: Wick RR, Judd LM, Gorrie CL, Holt KE

(2017) Unicycler: Resolving bacterial genome

assemblies from short and long sequencing reads.

PLoS Comput Biol 13(6): e1005595. https://doi.

org/10.1371/journal.pcbi.1005595

Editor: Adam M. Phillippy, National Human

Genome Research Institute, UNITED STATES

Received: January 13, 2017

Accepted: May 22, 2017

Published: June 8, 2017

access article distributed under the terms of the

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: All reference

genomes used for simulation data are available

from the NCBI assembly database (accession

numbers in Table 1). E. coli sequence files are

publicly available (links in Table 2). Klebsiella

sequence files are available from the NCBI

Sequence Read Archive database (accession

numbers ERX1087708, ERX1087759,

SRX2874872 and SRX2874871).

Funding: This work was funded by the NHMRC of

Australia (project #1043822 and Fellowship

#1061409 to KEH). The funders had no role in

fragmented into dozens of contiguous sequences (contigs). Consequently, most available bac-

terial genomes are incomplete, which hinders large-scale comparative genomic studies.

Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing plat-

forms can sequence DNA fragments of 10 kbp or longer, but at a higher cost per base than

Illumina platforms. PacBio and ONT long reads also have much higher per-base error rates

than Illumina reads (5–15% vs <1%), although they are often sufficient to complete bacterial

genome assemblies with reasonable consensus accuracy[2,3]. Hence most researchers must

choose between generating fragmented draft assemblies for many isolates with inexpensive

Illumina sequencing, or generating complete assemblies for fewer isolates with expensive long-

read technologies. Hybrid assembly, which uses a combination of short and long reads, offers

an alternative. In this approach, short reads are used to produce accurate contigs and long

reads provide the information to scaffold them together. This requires relatively few long reads

and can thus be the most cost-effective route to a complete bacterial genome.

Despite recent developments in long-read technologies, Illumina reads are widely used in

public health and research laboratories[4], and are likely to remain so for some time due to

their high accuracy and low cost. Moreover, Illumina data is already available for hundreds of

thousands of bacterial isolates, and most of these are unlikely to be replaced with long-read-

only sequencing data. It is therefore probable that research and clinical labs will continue to

use low cost Illumina reads for most samples and generate long reads as necessary to complete

genomes of interest. Hybrid assembly, which requires fewer long reads than long-read-only

assembly, is the most cost-effective means of achieving this goal.

Hybrid assembly can be accomplished with either a short-read-first or long-read-first

approach. In the short-read-first method, a scaffolding tool uses long reads to join Illumina

contigs together. However, scaffolding mistakes are common and lead to structural errors

(misassemblies) in the sequence[5]. Long-read-first approaches may involve assembly of

uncorrected long reads, followed by error-correction of the assembly using short reads[3].

Alternatively, they may first use short reads to correct errors in long reads, followed by assem-

bly of the corrected long reads[6,7]. Whether error correction is performed before or after

assembly, long-read-first approaches require higher long-read depth than short-read-first

approaches.

Here we present Unicycler, a new hybrid assembly pipeline for bacterial isolate genomes.

Unicycler first assembles short reads into an accurate and connected assembly graph, a data

structure containing both contigs and their interconnections[8]. It then uses long reads to find

the best paths through the graph. By following a short-read-first approach, Unicycler makes

effective use of low quantities of long reads, but it can produce a completed assembly (one con-

tig per replicon) if the long-read depth is sufficient. By using the assembly graph connections

to constrain the possible scaffolding arrangements, Unicycler achieves lower misassembly

rates than alternative short-read-first assemblers.

Design and implementation

Unicycler encapsulates its entire pipeline (Fig 1) in a single command and automatically deter-

mines low-level parameters so users can expect optimal results with default settings[9].

Short-read assembly

Unicycler uses SPAdes (v3.6.2 or later) to construct a De Bruijn graph assembly using a wide

range of k-mer sizes: 10 values spanning 20–95% of the Illumina read length (not exceeding

127, the largest k-mer possible in SPAdes)[10]. In SPAdes, large k-mers often result in larger

contigs, but excessively large k-mers can cause a fragmented graph with dead ends. Unicycler

Unicycler: Bacterial genome assemblies from short and long reads

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 2 / 22

study design, data collection and analysis, decision

to publish, or preparation of the manuscript.

Competing interests: The authors have declared

that no competing interests exist.

Fig 1. Key steps in the Unicycler pipeline.

https://doi.org/10.1371/journal.pcbi.1005595.g001

Unicycler: Bacterial genome assemblies from short and long reads

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 3 / 22

assigns a score (

cðdþ1Þ

) to each k-mer graph based on the number of contigs (c) and the number

of dead ends (d). This score function penalises both large numbers of contigs and large num-

bers of dead ends. Since dead ends are particularly problematic in later Unicycler steps (see

Multiplicity and Graph bridging using long-read alignments), the score function scales with

the inverse of d

. The highest scoring graph is selected as a balance between minimising both

contig count and dead ends (Figs 1A and S1).

As some contamination is possible in sequencing read sets (particularly when multiplexing,

which is a common strategy for bacterial isolate sequencing on Illumina[11]), Unicycler then

removes contigs with a depth of less than half the median graph depth, unless doing so would

create a dead end. This removes most contamination while leaving important graph structures

intact.

Multiplicity

To resolve the graph as accurately as possible, Unicycler must first determine the multiplicity

of contigs in the assembly graph. The most important distinction is between single-copy

contigs (sequences that occur once in the genome, multiplicity k = 1) and repeat contigs

(sequences that occur multiple times in the genome, multiplicity k > 1). However, determin-

ing the correct multiplicity for repeat contigs is also important, as this information can be used

when finalising the assembly graph (see Conservative, normal and bold).

When a bacterial genome consists of a single chromosome with no additional replicons,

then for each contig x, its median read depth d

is a good indicator of its multiplicity k

. Sin-

gle-copy contigs will have a median depth d

close to D, the median depth per base across the

entire assembly, while repeat contigs will have a median depth near a multiple of that value

(i.e. d

~ k

D). The relationship between median read depth and multiplicity is more compli-

cated when the genome contains multiple replicons present at different copy numbers per cell.

For example, small plasmids are often present in multiple copies, while large conjugative plas-

mids are often present once per cell. The relationship between read depth and multiplicity

only holds for replicons which exist in one copy per cell (the same as the chromosome). For

example, contigs with depth 2D may be chromosomal and have a multiplicity of two, or they

may be in a two-copy-per-cell plasmid and have a multiplicity of one.

In addition to read depth, a contig’s graph connections also provide useful information

about its multiplicity. Repeat contigs typically have multiple graph connections at their start

and end, while single-copy contigs usually have only a single connection at each end. These

trends break down when the assembly graph is fragmented, which is one reason why Unicycler

aims to minimise the number of dead ends when determining the optimal short-read assembly

graph.

To determine multiplicity values, Unicycler therefore uses both depth and connectivity

information. Initially, a multiplicity of one is assigned to all contigs that are near the graph’s

median depth and have no more than one connection at either end. A greedy algorithm then

propagates multiplicity where graph connections and depth are in close agreement (Figs 1B

and S2). When no more propagation is possible, the largest suitable contig is given a multiplic-

ity of one and the process is repeated. This algorithm can correctly assign multiplicity to high-

copy-number plasmid contigs in additional to chromosomal contigs.

Bridges

Unicycler scaffolds assembly graphs by constructing bridge contigs to connect pairs of single-

copy contigs. Before bridging, single-copy contigs connect via multiple alternative paths con-

taining one or more repeat contigs. After bridging, they connect via a simple, unambiguous

Unicycler: Bacterial genome assemblies from short and long reads

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 4 / 22

path. Bridges thus simplify the graph by resolving repeats. There are two primary sources of

information available for creating bridges: paired-end short reads, which can resolve small

repeats, and long reads, which can resolve much larger repeats.

Graph bridging using read pair information

When SPAdes assembles paired-end reads, it uses its ExSPAnder algorithm to find paths

through the assembly graph using read-pair orientation[12]. This process is known as repeat

resolution (RR). SPAdes does not save its post-RR assembly (contigs.fasta) in graph form, but

it does save the graph paths used to make post-RR contigs (contigs.paths). Unicycler finds

cases where two single-copy contigs are connected in a SPAdes contig path and uses them to

build bridges. In Fig 1C, the SPAdes contig path connects contigs 1 and 5 via contig 3. Unicy-

cler’s resulting bridge connects contigs 1 and 5 directly with a copy of the contig 3 sequence.

When this bridge is applied, contigs 2 and 4 also become connected via an unbranching path,

in essence becoming bridged by process of elimination. These indirect graph simplifications

may be merged together later in Unicycler’s pipeline, depending on the mode (see Conserva-

tive, normal and bold). While Unicycler creates bridges at this stage, they are not immediately

applied to the graph. This is deferred to a later step where bridges are applied in decreasing

order of quality (see Bridge application).

Semi-global long-read alignment

While short reads can resolve repeats up to the insert size of the library (typically <1000 bp),

long reads provide a much more powerful source of scaffolding information. As a first step in

long-read bridging, Unicycler aligns all available long reads to the single-copy contigs. Since

the long and short reads must be from the same biological sample, there should be no genuine

structural discrepancies between the long reads and contigs. Semi-global alignment (i.e. end-

gap free alignment) is therefore appropriate, where alignments can only terminate when the

end of a sequence is reached. Most available long-read alignment tools such as BLASR[13],

BWA-MEM[14], BLAST[15] and LAST[16] perform local alignment, so Unicycler implements

semi-global alignment directly using the SeqAn C++ library (S3 Fig)[17].

Graph bridging using long-read alignments

Long reads that align to multiple single-copy contigs can be used for bridging. Such reads con-

tain a sample of the gap sequence between those contigs, and if multiple long reads connect a

pair of contigs, Unicycler uses SeqAn to produce a consensus gap sequence[18,19]. Unicycler

does not directly use this gap sequence in the bridge but instead uses it to find the best graph

path connecting the contigs, via a branch and bound algorithm. Thus, the bridge sequence

comes from the graph and reflects base calling accuracy of the short reads rather than the long

reads that may have much lower accuracy (Fig 1D). Sometimes Unicycler cannot find a graph

path connecting two single-copy contigs that are connected via long reads, such as when the

short-read graph is incomplete and contains dead ends. In these cases, the long-read consensus

sequence is directly used as the bridging sequence. Such bridges are more likely to contain

errors—another reason why Unicycler strives to minimise dead ends in the assembly graph.

Bridge application

Having produced bridges from both short reads (SPAdes RR) and long reads, Unicycler can

now apply them to simplify the graph structure (Fig 1E). Since some bridges may be errone-

ous, Unicycler assigns a quality score to each bridge and applies them in order of decreasing

Unicycler: Bacterial genome assemblies from short and long reads

PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1005595 June 8, 2017 5 / 22

Unicycler: resolving bacterial genome assemblies from short and long sequencing reads

Figures

Citations

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Plasmid-encoded tet(X) genes that confer high-level tigecycline resistance in Escherichia coli.

Accurate and Complete Genomes from Metagenomes

Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.

References

Fast gapped-read alignment with Bowtie 2

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Related Papers (5)

hybridSPAdes: an algorithm for hybrid assembly of short and long reads

HASLR: Fast Hybrid Assembly of Long Reads

TruSPAdes: barcode assembly of TruSeq synthetic long reads

Short read fragment assembly of bacterial genomes

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Unicycler: resolving bacterial genome assemblies from short and long sequencing reads" ?

Q2. What are the requirements for Unicycler to be appropriate in such cases?

Q3. What is the main source of information for creating bridges?

Q4. What is the cost-effective method of achieving a complete bacterial genome?

Q5. How many assemblies were performed for the short-read sets?

Q6. How many reads did npScarf take to complete the assembly?

Q7. Why was it included in the tests?

Q8. How many assemblies were performed using PBSIM?

Q9. How many RNA operons were found in the reference genome?

Q10. What is the way to polish the genome?

Q11. What assemblers performed the fastest on the simulated data?

Q12. What is the way to polish the assembly graph?

Q13. How many reads were generated in a four hour period?

Q14. What is the way to correct errors in the sequence?