scispace - formally typeset
Search or ask a question
Journal ArticleDOI

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

01 May 2017-Genome Research (Cold Spring Harbor Lab)-Vol. 27, Iss: 5, pp 801-812
TL;DR: It is shown that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types and scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled.
Abstract: Many tools have been developed for haplotype assembly-the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types-dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing-we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.

Content maybe subject to copyright    Report

HapCUT2: robust and accurate haplotype assembly
for diverse sequencing technologies
Peter Edge,
1
Vineet Bafna,
1
and Vikas Bansal
2
1
Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA;
2
Department
of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA
Many tools have been developed for haplotype assemblythe reconstruction of individual haplotypes using reads mapped
to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new
sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes.
However, existing computational methods designed to handle specific technologies do not scale well on data from different
protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple se-
quencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types
dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation
(Hi-C) sequencingwe show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In
particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS
data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting
in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes
from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-res-
olution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (98% across chromo-
somes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse
sequencing technologies.
[Supplemental material is available for this article.]
Humans are diploid organisms with two copies of each chromo-
some (except the sex chromosomes). The two haplotypes (described
by the combination of alleles at variant sites on a single chromo-
some) represent the complete information on DNA variation in
an individual. Reconstructing individual haplotypes has impor-
tant implications for understanding human genetic variation, in-
terpretation of variants in disease, and reconstructing human
population history (Tewhey et al. 2011; Glusman et al. 2014;
Schiffels and Durbin 2014; Snyder et al. 2015). A number of meth-
ods, computational and experimental, have been developed for
haplotyping human genomes. Statistical methods for haplotype
phasing using population genotype data have proven successful
for phasing common variants and for genotype imputation but
are limited in their ability to phase rare variants and phase long
stretches of the genome that cross recombination hot-spots
(Browning and Browning 2011; Tewhey et al. 2011).
Haplotypes for an individual genome at known heterozygous
variants can be directly reconstructed from reference-aligned se-
quence reads derived from whole-genome sequencing (WGS).
Sequence reads that are long enough to cover multiple heterozy-
gous variants provide partial haplotype information. Using over-
laps between such haplotype-informative reads, long haplotypes
can be assembled. This haplotype assembly approach does not rely
on information from other individuals (such as parents) and can
phase even individual-specific variants. Levy et al. (2007) demon-
strated the feasibility of this approach using sequence data derived
from paired Sanger sequencing of long insert DNA fragment librar-
ies to computationally assemble long haplotype blocks (N50 of
350 kb) for the first individual human genome.
Since then, advancements in massively parallel sequencing
technologies have reduced the cost of human WGS drastically,
leading to the sequencing of thousands of human genomes.
However, the short read lengths generated by technologies such
as Illumina (100250 bases) and the use of short fragment lengths
in WGS protocols makes it infeasible to link distant variants into
haplotypes. To overcome this limitation, a number of innovative
methods that attempt to preserve haplotype information from
long DNA fragments (tens to hundreds of kilobases) in short se-
quence reads have been developed.
The underlying principle for these methods involves generat-
ing multiple pools of high-molecular-weight DNA fragments such
that each pool contains only a small fraction of the DNA from a
single genome. As a result, there are very few overlapping DNA
fragments in each pool, and high-throughput sequencing of the
DNA in each pool can be used to reconstruct the fragments by
alignment to a reference genome (Kitzman et al. 2011; Suk et al.
2011). Therefore, each pool provides haplotype information
from long DNA fragments, and long haplotypes can be assembled
using information from a sufficiently large number of indepen-
dent pools (Snyder et al. 2015). A number of methods based on
this approach have been developed to phase human genomes
(Kitzman et al. 2011; Suk et al. 2011; Peters et al. 2012; Kaper
et al. 2013; Amini et al. 2014). Recently, 10X Genomics described
Corresponding author: vibansal@ucsd.edu
Article published online before print. Article, supplemental material, and publi-
cation date are at http://www.genome.org/cgi/doi/10.1101/gr.213462.116.
© 2017 Edge et al. This article is distributed exclusively by Cold Spring Harbor
Laboratory Press for the first six months after the full-issue publication date (see
http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is avail-
able under a Creative Commons License (Attribution-NonCommercial
4.0 International), as described at http://creativecommons.org/licenses/by-
nc/4.0/.
Method
27:801812 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 801
www.genome.org

a novel microfluidics-based library preparation approach that
generates long linked reads that can be assembled into long haplo-
types (Zheng et al. 2016). Third-generation sequencing technolo-
gies such as Pacific Biosciences (PacBio) generate long sequence
reads (220 kb in length) that can directly enable genome-wide
haplotyping. Pendleton and colleagues demonstrated the feasibil-
ity of assembling haplotypes from SMRT reads using variants iden-
tified from short read Illumina sequencing (Pendleton et al. 2015).
Haplotype assembly is also feasible with paired-end sequenc-
ing, i.e., pairs of short reads derived from the ends of long DNA
fragments, but requires long and variable insert lengths to assem-
ble long haplotypes (Tewhey et al. 2011). Selvaraj et al. (2013) used
sequence data from a proximity ligation method (Hi-C) to assem-
ble accurate haplotypes for mouse and human genomes. Using
mouse data, they demonstrated that the vast majority of intrachro-
mosomal Hi-C read pairs correspond to cis interactions (between
fragments on the same chromosome) and therefore contain haplo-
type information equivalent to paired-end reads with long and
variable insert lengths. Subsequently, 17× whole-genome Hi-C
data was used to assemble chromosome-spanning haplotypes for
a human genome, albeit with low resolution (<22% of variants
phased).
In summary, multiple different sequencing technologies and
protocols have the capability to generate sequence reads with hap-
lotype information but require computational tools to assemble
the reads into long haplotypes. A number of combinatorial algo-
rithms have been developed for haplotype assembly (Bansal and
Bafna 2008; Duitama et al. 2010; He et al. 2010; Aguiar and
Istrail 2012). Among these, HapCUT (Bansal and Bafna 2008)
was developed for phasing Sanger WGS data for the first individual
genome (Levy et al. 2007). HapCUT utilizes max-cuts in read-hap-
lotype graphs, an approach that is equally adept at handling data
with local haplotype information and data with long-range haplo-
type information such as that from long insert paired-end reads. As
a result, it has been successfully utilized to assemble haplotypes
from different types of high-throughput sequence data sets, in-
cluding fosmid pool sequencing (Kitzman et al. 2011), Hi-C data
(Selvaraj et al. 2013), and single molecule long reads (Pendleton
et al. 2015) with appropriate modifications. However, HapCUT
only models simple sequencing errors and does not scale well for
long read data. More recently, several algorithms have been de-
signed specifically to enable accurate haplotype assembly from
long reads (Duitama et al. 2010; Kuleshov 2014).
The diverse characteristics and specific error modalities of
data generated by different haplotype-enabling protocols and
technologies continue to pose challenges for haplotype assembly
algorithms. Some protocols, such as clone-based sequencing, can
generate very long fragments (BAC clones of length 140 kb have
been used to assemble haplotypes [Lo et al. 2013]) but may have
low fragment coverage. Other protocols, such as PacBio SMRT,
generate fragments with shorter mean lengths than clone-based
approaches but can be scaled to higher read coverage more easily.
10X Genomics linked reads are long (longest molecules > 100 kb)
but have gaps resulting in high clone coverage for each variant.
Proximity ligation approaches, such as Hi-C, generate paired-end
read data with very short read lengths but with a larger genomic
span. Hi-C reads can span from a few kilobases to tens of mega-
bases in physical distance. While an algorithm that leverages char-
acteristics of a specific type of data is likely to perform well on that
particular type of data, it may not perform well or not work at all
on other types of data. For example, dynamic programming algo-
rithms such as ProbHap (Kuleshov 2014) that were developed for
low-depth long read sequence data are unlikely to scale well for
data sets with high sequence coverage or for other types of data
such as Hi-C. Even if a haplotype assembly algorithm has broad
support for data qualities, there remains the challenge that differ-
ent sequencing protocols each have systematic error modalities.
For instance, fragment data from the sequencing of multiple hap-
loid subsets of a human genome (Kitzman et al. 2011; Suk et al.
2011) generate long haplotype fragments, but some of these frag-
ments are chimeric due to overlapping DNA molecules that origi-
nate from different chromosomes. Similarly, noise in Hi-C data
due to ligated fragments from opposite homologous chromosomes
increases with increasing distance between the variants. The accu-
racy of haplotypes assembled from each sequencing protocol de-
pends on both the haplotype assembly algorithms ability to
effectively utilize the sequence data and its ability to model proto-
col-specific errors.
Results
To address the challenge of haplotype assembly for diverse types of
sequence data sets, we developed HapCUT2, an algorithm that
generalizes the HapCUT approach in several ways. Compared to
a discrete score optimized by HapCUT, HapCUT2 uses a likeli-
hood-based model, which allows for the modeling and estimation
of technology-specific errors such as h-trans errors in Hi-C data.
To improve memory performance for long read data, HapCUT2
does not explicitly construct the complete read-haplotype graph.
Further, it implements a number of optimizations to enable fast
runtimes on diverse types of sequence data sets. To demonstrate
the accuracy and robustness of HapCUT2, we compared its perfor-
mance with existing methods for haplotype assembly using simu-
lated and real WGS data sets. Previous publications (Duitama et al.
2012; Kuleshov 2014) have compared different methods for haplo-
type assembly and concluded that RefHap (Duitama et al. 2010),
ProbHap (Kuleshov 2014), FastHare (Panconesi and Sozio 2004),
and HapCUT (Bansal and Bafna 2008) are among the best perform-
ing methods. Other methods such as DGS (Levy et al. 2007),
MixSIH (Matsumoto and Kiryu 2013), and HapTree (Berger et al.
2014) did not perform as well on the data sets evaluated in this
study. Therefore, we compared the performance of HapCUT2
with four other methods: RefHap, ProbHap, FastHare, and
HapCUT (Table 1).
Overview of HapCUT2 algorithm
The input to HapCUT2 consists of haplotype fragments (sequence
of alleles at heterozygous variant sites identified from aligned se-
quence reads) and a list of heterozygous variants (identified from
WGS data). HapCUT2 aims to assemble a pair of haplotypes that
are maximally consistent with the input set of haplotype frag-
ments. This consistency is measured using a likelihood function
that captures sequencing errors and technology-specific errors
such as h-trans errors in proximity ligation data. HapCUT2 is an it-
erative procedure that starts with a candidate haplotype pair.
Given the current pair of haplotypes, HapCUT2 searches for a sub-
set of variants (using max-cut computations in the read-haplotype
graph) such that changing the phase of these variants relative to
the remaining set of variants results in a new pair of haplotypes
with greater likelihood. This procedure is repeated iteratively until
no further improvements can be made to the likelihood (see
Methods for details).
Edge et al.
802 Genome Research
www.genome.org

Comparison of runtimes on simulated data
We used simulations to compare the runtime of HapCUT2 with ex-
isting methods for haplotype assembly across different types of se-
quence data sets. A fair comparison of the performance of different
methods is not completely straightforward. Different methods
chose to optimize different technology parameters and highlight-
ed performance using those parameters. We considered the follow-
ing parameters: number of variants per read (V), coverage per
variant (d), and the number of paired-end reads spanning a variant
(d
). The parameter V is a natural outcome of read length; for exam-
ple, PacBio provides higher values of V compared to Illumina se-
quencing. The parameter d is similar to read coverage but only
considers haplotype informative readshigher values result in
better accuracy but also increased running time. Finally, many se-
quencing technologies (such as Hi-C) generate paired-end se-
quencing with long inserts and d
can potentially be much
greater than d. Some haplotype assembly methods implicitly ana-
lyze all paired-end reads spanning a specific position, and their
runtime depends upon d
rather than d.
In order to make a fair comparison of runtimes and allow us-
ers to determine the most efficient method for any technology, we
summarized the computational complexity of each method as a
function of these parameters (Table 1) and used simulations to
verify the dependence of runtime and accuracy on each parameter
(Fig. 1). We simulated reads using a single chromosome of
length 250 Mb (approximately equal to the length of human
Chromosome 1) with a heterozygous variant density of 0.08%
and a uniform rate of sequencing errors (2%), performing 10 repli-
cates for each simulation. Standard deviations of runtimes and er-
ror rates between replicates were small (Supplemental Fig. S1). A
method was cut off if it exceeded 10 CPU-h of runtime or 8 GB
of memory on a single CPU, since most methods required signifi-
cantly less resources than these limits. We note that the runtimes
in Table 1 refer to complexity as implemented, with parameters re-
ferring to maximum values (e.g., maximum coverage per variant),
while in simulations, the parameters refer to mean values (e.g.,
mean coverage per variant).
To assess the dependence of runtime on d, we generated reads
with a mean of four variants per read (V) and varied the mean read
Table 1. Comparison of the approach, time complexity, and applicability of five algorithms for haplotype assembly: HapCUT2, HapCUT, RefHap,
ProbHap, and FastHare
Method Approach Complexity Long reads
Hi-C
support Variant pruning
HapCUT2 Likelihood optimization using graph-cuts O(c
1
c
2
(Nlog (N)+NdV
2
)) Scalable Yes Likelihood
HapCUT MEC optimization using graph-cuts O(c
1
c
2
(Nlog (N)+NdV
2
)) High memory requirement Yes No
RefHap Max-cut on read graph O(c
3
(R
2
Vd
+ RV
2
d
2
)) Low-to-medium coverage No Discrete
ProbHap Exact likelihood using dynamic prog. + merging
heuristic
O(Nd
2
d
) Low-coverage No Confidence scores
FastHare Read partitioning optimization O(RVd
) Yes No Discrete
(R) Number of reads (all algo rithms process reads for each haplotype block separately); (N) total number of variants; (V) maximum number of variants
in a read; (d) maximum read depth per site; (d
) maximum number of reads crossing a site (equivalent to d except with paired-end inserts being includ-
ed as part of the read); (c
1
)(c
2
)(c
3
) method-specific variables that are either fixed in advance or s elected by the user. Reads are assumed to be sorted
by starting position.
Figure 1. Comparison of runtime (top panel) and switch + mismatch error rate (bottom panel) for HapCUT2 with four methods for haplotype assembly
(HapCUT, RefHap, ProbHap, and FastHare) on simulated read data as a function of (A) mean coverage per variant (variants per read fixed at four); (B) mean
variants per read (mean coverage per variant fixed at five); and (C) mean number of paired-end reads crossing a variant (mean coverage per variant fixed at
five, read length 150 bp, random insert size up to a variable maximum value). Lines represent the mean of 10 replicate simulations. FastHare is not visible on
C (bottom) due to significantly higher error rates.
HapCUT2: robust and accurate haplotype assembly
Genome Research 803
www.genome.org

coverage per variant (d ) from five to 100. The error rates of
HapCUT2, HapCUT, ProbHap, and RefHap were similar and de-
creased with increasing coverage before reaching saturation.
FastHare was significantly faster than other methods but had error
rates that were several times greater. As predicted by the computa-
tional complexity of the different methods (Table 1), HapCUT2 is
significantly faster than HapCUT, RefHap, and ProbHap, once the
coverage exceeds 10× (Fig. 1A). For example, RefHap required 10
CPU-h to phase reads at a coverage of 38×, while HapCUT2 took
only 34 CPU-min to phase reads with 100× coverage per variant.
ProbHap reached the 10-CPU-h limit at a coverage of only 8×.
HapCUT shows a similar trend to HapCUT2 but is significantly
slower and requires more than 8 GB of memory at coverages of
40× or greater. RefHap constructs a graph with the sequence reads
as nodes and performs a max-cut operation that scales quadratical-
ly with number of reads. Therefore, its runtime is expected to in-
crease as the square of read-coverage. ProbHaps runtime is
exponential in the maximum read-depth (Kuleshov 2014) and ex-
ceeds the maximum allotted time for modest values of d. FastHare
greedily builds a maximally consistent haplotype from left to right
in a single pass, resulting in a low runtime but also lower accuracy.
While HapCUT2 has the same asymptotic behavior as HapCUT, it
improves upon the memory usage and runtime significantly in
practice. It does this by only adding edges that link adjacent vari-
ants on each read to the read-haplotype graph, as well as using con-
vergence heuristics that reduce the number of iterations performed
(see Methods for details).
Next, we varied the number of variants per read (V) and kept
the coverage per variant (d) fixed at 5×. The error rates for each
method decrease monotonically (Fig. 1B). HapCUT2, RefHap,
and ProbHap have similarly low error rates, while FastHare and
HapCUT have error rates higher than the other methods. The run-
times of RefHap and FastHare are consistently very low, although
the runtime of RefHap peaks very slightly around V = 15. The run-
time of ProbHap decreases monotonically as V increases. This is
consistent with the fact that the runtime of these methods has a
linear dependence on the read length because for a fixed sequence
coverage, the number of reads decreases as the read length increas-
es. In comparison, HapCUT2s runtime is observed to increase lin-
early with V. This is consistent with the complexity of HapCUT2
being proportional to the square of the number of variants per
read (see Table 1). Although HapCUT2s runtime increases, it re-
mains practical across all tested values and is <50 CPU-min for
mean read lengths consistent with very long sequences (160 vari-
ants per read or 200 kb). The space requirements for HapCUT have
a quadratic dependence on the number of variants per read, and
therefore, exceeded the memory limit after only eight variants
per read.
Finally, we compared runtimes as a function of the average
number of paired-end reads crossing a variant (d
). For single-end
reads, this parameter is identical to d. Proximity ligation data, on
the other hand, consists of pairs of short reads each with a single
large gap (insert) between them. The large and highly variable in-
sert sizes result in a large number of reads crossing each variant po-
sition. This property is important for linking distant variants,
because the extremely long insert size spans of proximity ligation
methods are capable of spanning long variant-free regions. For this
reason, we simulated paired-end short read data with random in-
sert sizes up to a parametrized maximum value, to represent a gen-
eralized proximity ligation experiment. We varied d
by increasing
the maximum insert size value from 6.25 kb (5 single-nucleotide
variants [SNVs]) to 125 kb (100 SNVs) while keeping d and V
constant at and 150 base pairs (bp) (0.1195 SNVs), respectively.
ProbHap and RefHap exceeded the time limit at d
= 10 and d
= 17,
respectively. FastHare exceeded the time limit at d
= 36 but had ex-
tremely high error rates (10×18× higher than HapCUT2).
ProbHaps dynamic programming algorithm needs to consider
the haplotype of origin for each read crossing a variant; therefore,
the complexity scales exponentially in d
. In the case of RefHap
and FastHare, the failure to scale with increasing d
appears to be
a result of representing fragments as continuous arrays with length
equal to the number of variants spanned by each read. Thus, as im-
plemented, the runtimes for RefHap and FastHare scale with d
rather than d. In contrast, both HapCUT and HapCUT2 were
able to phase data with arbitrarily long insert lengths, reaching
d
= 100 (Fig. 1C). The runtime of HapCUT2 was independent of
d
and 10× faster than that for HapCUT.
Overall, the results on simulated data demonstrate that the
complexity of HapCUT2 is linear in the number of reads and qua-
dratic in the number of variants per read. HapCUT2 is fast in prac-
tice and effective for both long reads and paired-end reads with
long insert lengths, with scalability unmatched by the four other
tools we evaluated. Additionally, HapCUT2 and HapCUT were
the only tools tested that can reasonably phase paired-end data
with long insert lengths that result from proximity ligation
(Hi-C) sequencing.
Comparison of methods on diverse WGS data sets
for a single individual
We next assessed the accuracy of HapCUT2 using data from four dif-
ferent sequencing data types for a single individual (NA12878): fos-
mid-based dilution pool sequencing, 10X Genomics linked-read
sequencing, single molecule real-time (SMRT) sequencing, and
proximity ligation sequencing. Haplotype assembly methods re-
quire a set of heterozygous variants as input. Therefore, a set of het-
erozygous variants for NA12878 identified from WGS Illumina data
were used as input to assemble haplotypes for each data type (see
Methods for description). The accuracy of the haplotypes was as-
sessed by comparing the assembled haplotypes to gold-standard
trio-phased haplotypes and using the switch error rate and mis-
match error rate metrics (see Methods).
Fosmid-based dilution pool data
To assess HapCUT2 on long read sequencing data, we used whole-
genome fosmid-based dilution pool sequence data for a human in-
dividual, NA12878 (Duitama et al. 2012). This data was generated
from 1.44 million fosmids (3338 kb and 3845 kb in length)
that were partitioned into 32 pools such that each pool contains
DNA from a small fraction of the genome (5%). Subsequently,
each pool was sequenced using the ABI SOLiD sequencer and hap-
lotype fragments were identified using read depth analysis
(Duitama et al. 2012). Although this data set has low sequence cov-
erage (d 3×), the processed fragment data (needed as input for
haplotype assembly) is publicly available and has been used to
assess the performance of haplotype assembly methods in several
papers (Duitama et al. 2012; Kuleshov 2014). On this data, the
switch error and the mismatch error rates for HapCUT2 were
virtually identical or slightly better than ProbHap, the second
best performing method, across all chromosomes (Supplemental
Fig. S2). However, ProbHap pruned 1.2% of the variants from
the assembled haplotypes, in comparison to HapCUT2, which
only pruned 0.6% of the variants. The switch error rates for
RefHap and FastHare were also similar to HapCUT2 and ProbHap
Edge et al.
804 Genome Research
www.genome.org

(Supplemental Fig. S2). To enable a head-to-head comparison of
the switch error rate across different methods, we also calculated
the switch and mismatch error rates on a subset of variants that
were phased by all tools (not pruned). On this subset of variants,
the switch and mismatch error rates for HapCUT2 were similar to
but slightly lower than ProbHap (Fig. 2A). In terms of running
time, RefHap and FastHare were the fastest methods on this data
set, while HapCUT2 took a total of 1:09 CPU-h to phase all chromo-
somes (Table 2). In summary, HapCUT2 had similar (but slightly
better) accuracy to ProbHap, RefHap, and FastHare on this data
set and was more accurate than HapCUT.
10X Genomics linked-read data
We also used HapCUT2 to assemble haplotypes from 10X
Genomics linked-read data (Zheng et al. 2016), which is based
on a similar idea as the fosmid-based dilution pool approach.
10X Genomics technology labels short DNA fragments originating
from a single long DNA fragment with barcodes inside hundreds of
thousands of separate nano-scale droplets (Zheng et al. 2016). The
linked reads produced can be extremely long (>100 kb). This data
set has a short read coverage of 34×, with a linked-read coverage per
variant of 12× (Zook et al. 2016). For haplotype assembly, we used
the same set of variant calls as for the fosmid data set and extracted
haplotype fragments from the 10X aligned reads (see Methods,
Long read data sets). On this data set, neither RefHap nor
ProbHap finished haplotype assembly within the time limit.
HapCUT2 was the fastest method and analyzed all chromosomes
in 1:55 CPU-h (Table 2). When compared on the subset of variants
that were phased by all tools, HapCUT2 had an accuracy slightly
better than the next best approach (HapCUT), which took 16:50
CPU-h (Fig. 2C).
PacBio SMRT data
SMRT sequencing on the Pacific Biosciences platform generates
long (220 kb) but error-prone (>10% indel error rate) reads. We
used HapCUT2 to assemble haplotypes from 44× coverage
PacBio reads (Pendleton et al. 2015). We extracted haplotype frag-
ments from the PacBio reads that were aligned to the human refer-
ence genome (hg19), using the same set of variant calls as for the
previous two data sets. On the full data set, HapCUT2 was not
only the most accurate but was also significantly faster than
RefHap and HapCUT (see Supplemental Fig. S3 for detailed com-
parisons of error rates and runtimes). We calculated the switch er-
ror and mismatch error rates on the subset of variants that were
phased by all methods. HapCUT2 had a 12.4% lower switch error
and a 2% lower mismatch rate than RefHap. RefHap took 215:53
CPU-h to phase the data set. By comparison, HapCUT2 took
only 4:05 CPU-h in total. Because ProbHap was unable to complete
within the time limit on the full data set, we also compared the per-
formance of the haplotype assembly tools on a lower, 11× coverage
subsample of this data set. On the subsample, HapCUT2 had the
lowest switch error and mismatch error rates of the five methods
(Fig. 2B). FastHare was the fastest method on this data set and
ProbHap was the slowest method, taking 52:32 CPU-h (Table 2).
HapCUT2 implements likelihood-based strategies for prun-
ing low-confidence variants to reduce mismatch errors and split-
ting blocks at poor linkages to reduce switch errors (see
Methods). These post-processing steps allow a user to improve ac-
curacy of the haplotypes at the cost of reducing completeness and
contiguity. ProbHaps transition, posterior, and emission confi-
dence scores are designed for the same purpose (Kuleshov 2014).
Post-processing strategies are of particular interest for haplotype
assembly with PacBio SMRT reads because the individual reads
Figure 2. Accuracy of HapCUT2 compared to four other methods for haplotype assembly on diverse whole-genome sequence data sets for NA12878. (A)
Fosmid dilution pool data (Duitama et al. 2012). (B) PacBio SMRT data (11× and 44× coverage). (C ) 10X Genomics linked reads. (D) Whole-genome Hi-C
data (40× and 90× coverage, created with MboI enzyme). Switch and mismatch error rates were calculated across all chromosomes using the subset of
variants that were phased by all methods. For each data set, only methods that produced results within 20 CPU-h per chromosome are shown.
HapCUT2: robust and accurate haplotype assembly
Genome Research 805
www.genome.org

Citations
More filters
Journal ArticleDOI
TL;DR: Hifiasm as discussed by the authors is a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph.
Abstract: Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However, existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypes to produce high-quality phased assemblies. Here we describe hifiasm, a de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standard trio binning. On three human and five nonhuman datasets, including California redwood with a ~30-Gb hexaploid genome, we show that hifiasm frequently delivers better assemblies than existing tools and consistently outperforms others on haplotype-resolved assembly.

884 citations

Journal ArticleDOI
Mark Chaisson1, Mark Chaisson2, Ashley D. Sanders, Xuefang Zhao3, Xuefang Zhao4, Ankit Malhotra, David Porubsky5, David Porubsky6, Tobias Rausch, Eugene J. Gardner7, Oscar L. Rodriguez8, Li Guo9, Ryan L. Collins3, Xian Fan10, Jia Wen11, Robert E. Handsaker12, Robert E. Handsaker3, Susan Fairley13, Zev N. Kronenberg1, Xiangmeng Kong14, Fereydoun Hormozdiari15, Dillon Lee16, Aaron M. Wenger17, Alex Hastie, Danny Antaki18, Thomas Anantharaman, Peter A. Audano1, Harrison Brand3, Stuart Cantsilieris1, Han Cao, Eliza Cerveira, Chong Chen10, Xintong Chen7, Chen-Shan Chin17, Zechen Chong10, Nelson T. Chuang7, Christine C. Lambert17, Deanna M. Church, Laura Clarke13, Andrew Farrell16, Joey Flores19, Timur R. Galeev14, David U. Gorkin20, David U. Gorkin18, Madhusudan Gujral18, Victor Guryev5, William Haynes Heaton, Jonas Korlach17, Sushant Kumar14, Jee Young Kwon21, Ernest T. Lam, Jong Eun Lee, Joyce V. Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li14, Patrick Marks, Karine A. Viaud-Martinez19, Sascha Meiers, Katherine M. Munson1, Fabio C. P. Navarro14, Bradley J. Nelson1, Conor Nodzak11, Amina Noor18, Sofia Kyriazopoulou-Panagiotopoulou, Andy Wing Chun Pang, Yunjiang Qiu18, Yunjiang Qiu20, Gabriel Rosanio18, Mallory Ryan, Adrian M. Stütz, Diana C.J. Spierings5, Alistair Ward16, Anne Marie E. Welch1, Ming Xiao22, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley13, Ernesto Lowy13, Sergei Yakneen, Steven A. McCarroll3, Steven A. McCarroll12, Goo Jun23, Li Ding24, Chong-Lek Koh25, Bing Ren20, Bing Ren18, Paul Flicek13, Ken Chen10, Mark Gerstein, Pui-Yan Kwok26, Peter M. Lansdorp27, Peter M. Lansdorp28, Peter M. Lansdorp5, Gabor T. Marth16, Jonathan Sebat18, Xinghua Shi11, Ali Bashir8, Kai Ye9, Scott E. Devine7, Michael E. Talkowski12, Michael E. Talkowski3, Ryan E. Mills4, Tobias Marschall6, Jan O. Korbel13, Evan E. Eichler1, Charles Lee21 
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

606 citations

Journal ArticleDOI
TL;DR: This work presents Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations, and demonstrates on both human and plant genomes that it is a fast and robust method for assembly validation.
Abstract: Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.

477 citations

Journal ArticleDOI
TL;DR: This Review discusses bioinformatics tools that have been devised to handle the numerous characteristic features of these long-range data types, with applications in genome assembly, genetic variant detection, haplotype phasing, transcriptomics and epigenomics.
Abstract: Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.

381 citations

Journal ArticleDOI
TL;DR: This study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions to identify 33,984 plausible RNA isoforms and describes strategies for assessing 3′ poly(A) tail length, base modifications and transcript haplotypes.
Abstract: High-throughput complementary DNA sequencing technologies have advanced our understanding of transcriptome complexity and regulation. However, these methods lose information contained in biological RNA because the copied reads are often short and modifications are not retained. We address these limitations using a native poly(A) RNA sequencing strategy developed by Oxford Nanopore Technologies. Our study generated 9.9 million aligned sequence reads for the human cell line GM12878, using thirty MinION flow cells at six institutions. These native RNA reads had a median length of 771 bases, and a maximum aligned length of over 21,000 bases. Mitochondrial poly(A) reads provided an internal measure of read-length quality. We combined these long nanopore reads with higher accuracy short-reads and annotated GM12878 promoter regions to identify 33,984 plausible RNA isoforms. We describe strategies for assessing 3' poly(A) tail length, base modifications and transcript haplotypes.

358 citations

References
More filters
Posted ContentDOI
TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.
Abstract: Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at this http URL. Contact: hengli@broadinstitute.org

8,090 citations

Journal ArticleDOI
18 Dec 2014-Cell
TL;DR: In situ Hi-C is used to probe the 3D architecture of genomes, constructing haploid and diploid maps of nine cell types, identifying ∼10,000 loops that frequently link promoters and enhancers, correlate with gene activation, and show conservation across cell types and species.

5,945 citations

Journal ArticleDOI
TL;DR: A modified version of the Celera assembler is developed to facilitate the identification and comparison of alternate alleles within this individual diploid genome, and a novel haplotype assembly strategy is used, able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploids nature of the genome.
Abstract: Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.

1,843 citations

Journal ArticleDOI
TL;DR: The computational performance of SHAPEIT2 is competitive compared to other methods and had the property that SER decreases as sample size increases, illustrating that the SHAPEit2 model can adapt to data sets with very high SNP density.
Abstract: SHAPEIT2 uses multithreading so that multiple cores can be used to phase whole chromosomes, allowing users to make the best use of their computational resources. We tested SHAPEIT2 on several large-sample, whole-chromosome data sets from a range of SNP genotyping chips (Supplementary Note 1). SHAPEIT2 outperforms other methods (Fig. 1a–c) in terms of switch error rate (SER) and the mean distance between switch errors (Supplementary Figs. 1 and 2). As compared to SHAPEIT1, SHAPEIT2 reduced SER by as much as 45% on these data sets. For example, on 1,229 Vietnamese samples assayed on the Illumina 660K chip on chromosome 22, the SERs of SHAPEIT2 (K = 100, W = 2 Mb), SHAPEIT1 (K = 100) (ref. 2), HAPI-UR (v1.01) (ref. 4), Beagle (v3.3) (ref. 5), Impute2 v2.1.2 (K = 100) (ref. 3), MaCH v1.0.18 (K = 100) (ref. 6) and fastPHASE (v1.4) (ref. 7) were 2.87%, 4.64%, 4.75%, 5.14%, 5.57%, 6.05% and 6.34%, respectively. In general, SHAPEIT2 with low values of K outperformed SHAPEIT1 with high values of K (Fig. 1a–c). As the number of samples increased (up to ~9,000 samples in our tests), we found that SHAPEIT2 outperformed other methods and had the property that SER decreases as sample size increases (Fig. 1d). We assessed accuracy on sequence data by phasing 381 European samples from the 1000 Genomes Project (TGP) together with genotypes from two trio parents sequenced at high coverage. We found that SHAPEIT2 (K = 100, W = 0.3 Mb) reduced SER by 38% compared to Beagle (Supplementary Table 1 and Supplementary Note 2), illustrating that the SHAPEIT2 model can adapt to data sets with very high SNP density. The computational performance of SHAPEIT2 is competitive compared to other methods. Figure 1e shows the computational Improved whole-chromosome phasing for disease and population genetic studies

1,242 citations

Journal ArticleDOI
TL;DR: The issue of sequencing depth in the design of next-generation sequencing experiments is discussed and current guidelines and precedents on the issue of coverage are reviewed for four major study designs, including de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses.
Abstract: Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP-seq) and chromosome conformation capture (3C)).

1,156 citations