HapCUT2: robust and accurate haplotype assembly
for diverse sequencing technologies
Peter Edge,
1
Vineet Bafna,
1
and Vikas Bansal
2
1
Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA;
2
Department
of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA
Many tools have been developed for haplotype assembly—the reconstruction of individual haplotypes using reads mapped
to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new
sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes.
However, existing computational methods designed to handle specific technologies do not scale well on data from different
protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple se-
quencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types
—dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation
(Hi-C) sequencing—we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In
particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS
data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting
in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes
from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-res-
olution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromo-
somes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse
sequencing technologies.
[Supplemental material is available for this article.]
Humans are diploid organisms with two copies of each chromo-
some (except the sex chromosomes). The two haplotypes (described
by the combination of alleles at variant sites on a single chromo-
some) represent the complete information on DNA variation in
an individual. Reconstructing individual haplotypes has impor-
tant implications for understanding human genetic variation, in-
terpretation of variants in disease, and reconstructing human
population history (Tewhey et al. 2011; Glusman et al. 2014;
Schiffels and Durbin 2014; Snyder et al. 2015). A number of meth-
ods, computational and experimental, have been developed for
haplotyping human genomes. Statistical methods for haplotype
phasing using population genotype data have proven successful
for phasing common variants and for genotype imputation but
are limited in their ability to phase rare variants and phase long
stretches of the genome that cross recombination hot-spots
(Browning and Browning 2011; Tewhey et al. 2011).
Haplotypes for an individual genome at known heterozygous
variants can be directly reconstructed from reference-aligned se-
quence reads derived from whole-genome sequencing (WGS).
Sequence reads that are long enough to cover multiple heterozy-
gous variants provide partial haplotype information. Using over-
laps between such haplotype-informative reads, long haplotypes
can be assembled. This haplotype assembly approach does not rely
on information from other individuals (such as parents) and can
phase even individual-specific variants. Levy et al. (2007) demon-
strated the feasibility of this approach using sequence data derived
from paired Sanger sequencing of long insert DNA fragment librar-
ies to computationally assemble long haplotype blocks (N50 of
350 kb) for the first individual human genome.
Since then, advancements in massively parallel sequencing
technologies have reduced the cost of human WGS drastically,
leading to the sequencing of thousands of human genomes.
However, the short read lengths generated by technologies such
as Illumina (100–250 bases) and the use of short fragment lengths
in WGS protocols makes it infeasible to link distant variants into
haplotypes. To overcome this limitation, a number of innovative
methods that attempt to preserve haplotype information from
long DNA fragments (tens to hundreds of kilobases) in short se-
quence reads have been developed.
The underlying principle for these methods involves generat-
ing multiple pools of high-molecular-weight DNA fragments such
that each pool contains only a small fraction of the DNA from a
single genome. As a result, there are very few overlapping DNA
fragments in each pool, and high-throughput sequencing of the
DNA in each pool can be used to reconstruct the fragments by
alignment to a reference genome (Kitzman et al. 2011; Suk et al.
2011). Therefore, each pool provides haplotype information
from long DNA fragments, and long haplotypes can be assembled
using information from a sufficiently large number of indepen-
dent pools (Snyder et al. 2015). A number of methods based on
this approach have been developed to phase human genomes
(Kitzman et al. 2011; Suk et al. 2011; Peters et al. 2012; Kaper
et al. 2013; Amini et al. 2014). Recently, 10X Genomics described
Corresponding author: vibansal@ucsd.edu
Article published online before print. Article, supplemental material, and publi-
cation date are at http://www.genome.org/cgi/doi/10.1101/gr.213462.116.
© 2017 Edge et al. This article is distributed exclusively by Cold Spring Harbor
Laboratory Press for the first six months after the full-issue publication date (see
http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is avail-
able under a Creative Commons License (Attribution-NonCommercial
4.0 International), as described at http://creativecommons.org/licenses/by-
nc/4.0/.
Method
27:801–812 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 801
www.genome.org
a novel microfluidics-based library preparation approach that
generates long linked reads that can be assembled into long haplo-
types (Zheng et al. 2016). Third-generation sequencing technolo-
gies such as Pacific Biosciences (PacBio) generate long sequence
reads (2–20 kb in length) that can directly enable genome-wide
haplotyping. Pendleton and colleagues demonstrated the feasibil-
ity of assembling haplotypes from SMRT reads using variants iden-
tified from short read Illumina sequencing (Pendleton et al. 2015).
Haplotype assembly is also feasible with paired-end sequenc-
ing, i.e., pairs of short reads derived from the ends of long DNA
fragments, but requires long and variable insert lengths to assem-
ble long haplotypes (Tewhey et al. 2011). Selvaraj et al. (2013) used
sequence data from a proximity ligation method (Hi-C) to assem-
ble accurate haplotypes for mouse and human genomes. Using
mouse data, they demonstrated that the vast majority of intrachro-
mosomal Hi-C read pairs correspond to ‘cis’ interactions (between
fragments on the same chromosome) and therefore contain haplo-
type information equivalent to paired-end reads with long and
variable insert lengths. Subsequently, 17× whole-genome Hi-C
data was used to assemble chromosome-spanning haplotypes for
a human genome, albeit with low resolution (<22% of variants
phased).
In summary, multiple different sequencing technologies and
protocols have the capability to generate sequence reads with hap-
lotype information but require computational tools to assemble
the reads into long haplotypes. A number of combinatorial algo-
rithms have been developed for haplotype assembly (Bansal and
Bafna 2008; Duitama et al. 2010; He et al. 2010; Aguiar and
Istrail 2012). Among these, HapCUT (Bansal and Bafna 2008)
was developed for phasing Sanger WGS data for the first individual
genome (Levy et al. 2007). HapCUT utilizes max-cuts in read-hap-
lotype graphs, an approach that is equally adept at handling data
with local haplotype information and data with long-range haplo-
type information such as that from long insert paired-end reads. As
a result, it has been successfully utilized to assemble haplotypes
from different types of high-throughput sequence data sets, in-
cluding fosmid pool sequencing (Kitzman et al. 2011), Hi-C data
(Selvaraj et al. 2013), and single molecule long reads (Pendleton
et al. 2015) with appropriate modifications. However, HapCUT
only models simple sequencing errors and does not scale well for
long read data. More recently, several algorithms have been de-
signed specifically to enable accurate haplotype assembly from
long reads (Duitama et al. 2010; Kuleshov 2014).
The diverse characteristics and specific error modalities of
data generated by different haplotype-enabling protocols and
technologies continue to pose challenges for haplotype assembly
algorithms. Some protocols, such as clone-based sequencing, can
generate very long fragments (BAC clones of length 140 kb have
been used to assemble haplotypes [Lo et al. 2013]) but may have
low fragment coverage. Other protocols, such as PacBio SMRT,
generate fragments with shorter mean lengths than clone-based
approaches but can be scaled to higher read coverage more easily.
10X Genomics linked reads are long (longest molecules > 100 kb)
but have gaps resulting in high clone coverage for each variant.
Proximity ligation approaches, such as Hi-C, generate paired-end
read data with very short read lengths but with a larger genomic
span. Hi-C reads can span from a few kilobases to tens of mega-
bases in physical distance. While an algorithm that leverages char-
acteristics of a specific type of data is likely to perform well on that
particular type of data, it may not perform well or not work at all
on other types of data. For example, dynamic programming algo-
rithms such as ProbHap (Kuleshov 2014) that were developed for
low-depth long read sequence data are unlikely to scale well for
data sets with high sequence coverage or for other types of data
such as Hi-C. Even if a haplotype assembly algorithm has broad
support for data qualities, there remains the challenge that differ-
ent sequencing protocols each have systematic error modalities.
For instance, fragment data from the sequencing of multiple hap-
loid subsets of a human genome (Kitzman et al. 2011; Suk et al.
2011) generate long haplotype fragments, but some of these frag-
ments are chimeric due to overlapping DNA molecules that origi-
nate from different chromosomes. Similarly, noise in Hi-C data
due to ligated fragments from opposite homologous chromosomes
increases with increasing distance between the variants. The accu-
racy of haplotypes assembled from each sequencing protocol de-
pends on both the haplotype assembly algorithm’s ability to
effectively utilize the sequence data and its ability to model proto-
col-specific errors.
Results
To address the challenge of haplotype assembly for diverse types of
sequence data sets, we developed HapCUT2, an algorithm that
generalizes the HapCUT approach in several ways. Compared to
a discrete score optimized by HapCUT, HapCUT2 uses a likeli-
hood-based model, which allows for the modeling and estimation
of technology-specific errors such as ‘h-trans errors’ in Hi-C data.
To improve memory performance for long read data, HapCUT2
does not explicitly construct the complete read-haplotype graph.
Further, it implements a number of optimizations to enable fast
runtimes on diverse types of sequence data sets. To demonstrate
the accuracy and robustness of HapCUT2, we compared its perfor-
mance with existing methods for haplotype assembly using simu-
lated and real WGS data sets. Previous publications (Duitama et al.
2012; Kuleshov 2014) have compared different methods for haplo-
type assembly and concluded that RefHap (Duitama et al. 2010),
ProbHap (Kuleshov 2014), FastHare (Panconesi and Sozio 2004),
and HapCUT (Bansal and Bafna 2008) are among the best perform-
ing methods. Other methods such as DGS (Levy et al. 2007),
MixSIH (Matsumoto and Kiryu 2013), and HapTree (Berger et al.
2014) did not perform as well on the data sets evaluated in this
study. Therefore, we compared the performance of HapCUT2
with four other methods: RefHap, ProbHap, FastHare, and
HapCUT (Table 1).
Overview of HapCUT2 algorithm
The input to HapCUT2 consists of haplotype fragments (sequence
of alleles at heterozygous variant sites identified from aligned se-
quence reads) and a list of heterozygous variants (identified from
WGS data). HapCUT2 aims to assemble a pair of haplotypes that
are maximally consistent with the input set of haplotype frag-
ments. This consistency is measured using a likelihood function
that captures sequencing errors and technology-specific errors
such as h-trans errors in proximity ligation data. HapCUT2 is an it-
erative procedure that starts with a candidate haplotype pair.
Given the current pair of haplotypes, HapCUT2 searches for a sub-
set of variants (using max-cut computations in the read-haplotype
graph) such that changing the phase of these variants relative to
the remaining set of variants results in a new pair of haplotypes
with greater likelihood. This procedure is repeated iteratively until
no further improvements can be made to the likelihood (see
Methods for details).
Edge et al.
802 Genome Research
www.genome.org
Comparison of runtimes on simulated data
We used simulations to compare the runtime of HapCUT2 with ex-
isting methods for haplotype assembly across different types of se-
quence data sets. A fair comparison of the performance of different
methods is not completely straightforward. Different methods
chose to optimize different technology parameters and highlight-
ed performance using those parameters. We considered the follow-
ing parameters: number of variants per read (V), coverage per
variant (d), and the number of paired-end reads spanning a variant
(d
′
). The parameter V is a natural outcome of read length; for exam-
ple, PacBio provides higher values of V compared to Illumina se-
quencing. The parameter d is similar to read coverage but only
considers haplotype informative reads—higher values result in
better accuracy but also increased running time. Finally, many se-
quencing technologies (such as Hi-C) generate paired-end se-
quencing with long inserts and d
′
can potentially be much
greater than d. Some haplotype assembly methods implicitly ana-
lyze all paired-end reads spanning a specific position, and their
runtime depends upon d
′
rather than d.
In order to make a fair comparison of runtimes and allow us-
ers to determine the most efficient method for any technology, we
summarized the computational complexity of each method as a
function of these parameters (Table 1) and used simulations to
verify the dependence of runtime and accuracy on each parameter
(Fig. 1). We simulated reads using a single chromosome of
length ∼250 Mb (approximately equal to the length of human
Chromosome 1) with a heterozygous variant density of 0.08%
and a uniform rate of sequencing errors (2%), performing 10 repli-
cates for each simulation. Standard deviations of runtimes and er-
ror rates between replicates were small (Supplemental Fig. S1). A
method was cut off if it exceeded 10 CPU-h of runtime or 8 GB
of memory on a single CPU, since most methods required signifi-
cantly less resources than these limits. We note that the runtimes
in Table 1 refer to complexity as implemented, with parameters re-
ferring to maximum values (e.g., maximum coverage per variant),
while in simulations, the parameters refer to mean values (e.g.,
mean coverage per variant).
To assess the dependence of runtime on d, we generated reads
with a mean of four variants per read (V) and varied the mean read
Table 1. Comparison of the approach, time complexity, and applicability of five algorithms for haplotype assembly: HapCUT2, HapCUT, RefHap,
ProbHap, and FastHare
Method Approach Complexity Long reads
Hi-C
support Variant pruning
HapCUT2 Likelihood optimization using graph-cuts O(c
1
c
2
(Nlog (N)+NdV
2
)) Scalable Yes Likelihood
HapCUT MEC optimization using graph-cuts O(c
1
c
2
(Nlog (N)+NdV
2
)) High memory requirement Yes No
RefHap Max-cut on read graph O(c
3
(R
2
Vd
′
+ RV
2
d
′
2
)) Low-to-medium coverage No Discrete
ProbHap Exact likelihood using dynamic prog. + merging
heuristic
O(Nd
′
2
d
′
) Low-coverage No Confidence scores
FastHare Read partitioning optimization O(RVd
′
) Yes No Discrete
(R) Number of reads (all algo rithms process reads for each haplotype block separately); (N) total number of variants; (V) maximum number of variants
in a read; (d) maximum read depth per site; (d
′
) maximum number of reads crossing a site (equivalent to d except with paired-end inserts being includ-
ed as part of the read); (c
1
)(c
2
)(c
3
) method-specific variables that are either fixed in advance or s elected by the user. Reads are assumed to be sorted
by starting position.
Figure 1. Comparison of runtime (top panel) and switch + mismatch error rate (bottom panel) for HapCUT2 with four methods for haplotype assembly
(HapCUT, RefHap, ProbHap, and FastHare) on simulated read data as a function of (A) mean coverage per variant (variants per read fixed at four); (B) mean
variants per read (mean coverage per variant fixed at five); and (C) mean number of paired-end reads crossing a variant (mean coverage per variant fixed at
five, read length 150 bp, random insert size up to a variable maximum value). Lines represent the mean of 10 replicate simulations. FastHare is not visible on
C (bottom) due to significantly higher error rates.
HapCUT2: robust and accurate haplotype assembly
Genome Research 803
www.genome.org
coverage per variant (d ) from five to 100. The error rates of
HapCUT2, HapCUT, ProbHap, and RefHap were similar and de-
creased with increasing coverage before reaching saturation.
FastHare was significantly faster than other methods but had error
rates that were several times greater. As predicted by the computa-
tional complexity of the different methods (Table 1), HapCUT2 is
significantly faster than HapCUT, RefHap, and ProbHap, once the
coverage exceeds 10× (Fig. 1A). For example, RefHap required 10
CPU-h to phase reads at a coverage of 38×, while HapCUT2 took
only 34 CPU-min to phase reads with 100× coverage per variant.
ProbHap reached the 10-CPU-h limit at a coverage of only 8×.
HapCUT shows a similar trend to HapCUT2 but is significantly
slower and requires more than 8 GB of memory at coverages of
40× or greater. RefHap constructs a graph with the sequence reads
as nodes and performs a max-cut operation that scales quadratical-
ly with number of reads. Therefore, its runtime is expected to in-
crease as the square of read-coverage. ProbHap’s runtime is
exponential in the maximum read-depth (Kuleshov 2014) and ex-
ceeds the maximum allotted time for modest values of d. FastHare
greedily builds a maximally consistent haplotype from left to right
in a single pass, resulting in a low runtime but also lower accuracy.
While HapCUT2 has the same asymptotic behavior as HapCUT, it
improves upon the memory usage and runtime significantly in
practice. It does this by only adding edges that link adjacent vari-
ants on each read to the read-haplotype graph, as well as using con-
vergence heuristics that reduce the number of iterations performed
(see Methods for details).
Next, we varied the number of variants per read (V) and kept
the coverage per variant (d) fixed at 5×. The error rates for each
method decrease monotonically (Fig. 1B). HapCUT2, RefHap,
and ProbHap have similarly low error rates, while FastHare and
HapCUT have error rates higher than the other methods. The run-
times of RefHap and FastHare are consistently very low, although
the runtime of RefHap peaks very slightly around V = 15. The run-
time of ProbHap decreases monotonically as V increases. This is
consistent with the fact that the runtime of these methods has a
linear dependence on the read length because for a fixed sequence
coverage, the number of reads decreases as the read length increas-
es. In comparison, HapCUT2’s runtime is observed to increase lin-
early with V. This is consistent with the complexity of HapCUT2
being proportional to the square of the number of variants per
read (see Table 1). Although HapCUT2’s runtime increases, it re-
mains practical across all tested values and is <50 CPU-min for
mean read lengths consistent with very long sequences (160 vari-
ants per read or 200 kb). The space requirements for HapCUT have
a quadratic dependence on the number of variants per read, and
therefore, exceeded the memory limit after only eight variants
per read.
Finally, we compared runtimes as a function of the average
number of paired-end reads crossing a variant (d
′
). For single-end
reads, this parameter is identical to d. Proximity ligation data, on
the other hand, consists of pairs of short reads each with a single
large gap (insert) between them. The large and highly variable in-
sert sizes result in a large number of reads crossing each variant po-
sition. This property is important for linking distant variants,
because the extremely long insert size spans of proximity ligation
methods are capable of spanning long variant-free regions. For this
reason, we simulated paired-end short read data with random in-
sert sizes up to a parametrized maximum value, to represent a gen-
eralized proximity ligation experiment. We varied d
′
by increasing
the maximum insert size value from 6.25 kb (∼5 single-nucleotide
variants [SNVs]) to 125 kb (∼100 SNVs) while keeping d and V
constant at 5× and 150 base pairs (bp) (0.1195 SNVs), respectively.
ProbHap and RefHap exceeded the time limit at d
′
= 10 and d
′
= 17,
respectively. FastHare exceeded the time limit at d
′
= 36 but had ex-
tremely high error rates (10×–18× higher than HapCUT2).
ProbHap’s dynamic programming algorithm needs to consider
the haplotype of origin for each read crossing a variant; therefore,
the complexity scales exponentially in d
′
. In the case of RefHap
and FastHare, the failure to scale with increasing d
′
appears to be
a result of representing fragments as continuous arrays with length
equal to the number of variants spanned by each read. Thus, as im-
plemented, the runtimes for RefHap and FastHare scale with d
′
rather than d. In contrast, both HapCUT and HapCUT2 were
able to phase data with arbitrarily long insert lengths, reaching
d
′
= 100 (Fig. 1C). The runtime of HapCUT2 was independent of
d
′
and 8×–10× faster than that for HapCUT.
Overall, the results on simulated data demonstrate that the
complexity of HapCUT2 is linear in the number of reads and qua-
dratic in the number of variants per read. HapCUT2 is fast in prac-
tice and effective for both long reads and paired-end reads with
long insert lengths, with scalability unmatched by the four other
tools we evaluated. Additionally, HapCUT2 and HapCUT were
the only tools tested that can reasonably phase paired-end data
with long insert lengths that result from proximity ligation
(Hi-C) sequencing.
Comparison of methods on diverse WGS data sets
for a single individual
We next assessed the accuracy of HapCUT2 using data from four dif-
ferent sequencing data types for a single individual (NA12878): fos-
mid-based dilution pool sequencing, 10X Genomics linked-read
sequencing, single molecule real-time (SMRT) sequencing, and
proximity ligation sequencing. Haplotype assembly methods re-
quire a set of heterozygous variants as input. Therefore, a set of het-
erozygous variants for NA12878 identified from WGS Illumina data
were used as input to assemble haplotypes for each data type (see
Methods for description). The accuracy of the haplotypes was as-
sessed by comparing the assembled haplotypes to gold-standard
trio-phased haplotypes and using the switch error rate and mis-
match error rate metrics (see Methods).
Fosmid-based dilution pool data
To assess HapCUT2 on long read sequencing data, we used whole-
genome fosmid-based dilution pool sequence data for a human in-
dividual, NA12878 (Duitama et al. 2012). This data was generated
from 1.44 million fosmids (33–38 kb and 38–45 kb in length)
that were partitioned into 32 pools such that each pool contains
DNA from a small fraction of the genome (∼5%). Subsequently,
each pool was sequenced using the ABI SOLiD sequencer and hap-
lotype fragments were identified using read depth analysis
(Duitama et al. 2012). Although this data set has low sequence cov-
erage (d ≈ 3×), the processed fragment data (needed as input for
haplotype assembly) is publicly available and has been used to
assess the performance of haplotype assembly methods in several
papers (Duitama et al. 2012; Kuleshov 2014). On this data, the
switch error and the mismatch error rates for HapCUT2 were
virtually identical or slightly better than ProbHap, the second
best performing method, across all chromosomes (Supplemental
Fig. S2). However, ProbHap pruned ∼1.2% of the variants from
the assembled haplotypes, in comparison to HapCUT2, which
only pruned 0.6% of the variants. The switch error rates for
RefHap and FastHare were also similar to HapCUT2 and ProbHap
Edge et al.
804 Genome Research
www.genome.org
(Supplemental Fig. S2). To enable a head-to-head comparison of
the switch error rate across different methods, we also calculated
the switch and mismatch error rates on a subset of variants that
were phased by all tools (not pruned). On this subset of variants,
the switch and mismatch error rates for HapCUT2 were similar to
but slightly lower than ProbHap (Fig. 2A). In terms of running
time, RefHap and FastHare were the fastest methods on this data
set, while HapCUT2 took a total of 1:09 CPU-h to phase all chromo-
somes (Table 2). In summary, HapCUT2 had similar (but slightly
better) accuracy to ProbHap, RefHap, and FastHare on this data
set and was more accurate than HapCUT.
10X Genomics linked-read data
We also used HapCUT2 to assemble haplotypes from 10X
Genomics linked-read data (Zheng et al. 2016), which is based
on a similar idea as the fosmid-based dilution pool approach.
10X Genomics technology labels short DNA fragments originating
from a single long DNA fragment with barcodes inside hundreds of
thousands of separate nano-scale droplets (Zheng et al. 2016). The
linked reads produced can be extremely long (>100 kb). This data
set has a short read coverage of 34×, with a linked-read coverage per
variant of 12× (Zook et al. 2016). For haplotype assembly, we used
the same set of variant calls as for the fosmid data set and extracted
haplotype fragments from the 10X aligned reads (see Methods,
“Long read data sets”). On this data set, neither RefHap nor
ProbHap finished haplotype assembly within the time limit.
HapCUT2 was the fastest method and analyzed all chromosomes
in 1:55 CPU-h (Table 2). When compared on the subset of variants
that were phased by all tools, HapCUT2 had an accuracy slightly
better than the next best approach (HapCUT), which took 16:50
CPU-h (Fig. 2C).
PacBio SMRT data
SMRT sequencing on the Pacific Biosciences platform generates
long (2–20 kb) but error-prone (>10% indel error rate) reads. We
used HapCUT2 to assemble haplotypes from 44× coverage
PacBio reads (Pendleton et al. 2015). We extracted haplotype frag-
ments from the PacBio reads that were aligned to the human refer-
ence genome (hg19), using the same set of variant calls as for the
previous two data sets. On the full data set, HapCUT2 was not
only the most accurate but was also significantly faster than
RefHap and HapCUT (see Supplemental Fig. S3 for detailed com-
parisons of error rates and runtimes). We calculated the switch er-
ror and mismatch error rates on the subset of variants that were
phased by all methods. HapCUT2 had a 12.4% lower switch error
and a 2% lower mismatch rate than RefHap. RefHap took 215:53
CPU-h to phase the data set. By comparison, HapCUT2 took
only 4:05 CPU-h in total. Because ProbHap was unable to complete
within the time limit on the full data set, we also compared the per-
formance of the haplotype assembly tools on a lower, 11× coverage
subsample of this data set. On the subsample, HapCUT2 had the
lowest switch error and mismatch error rates of the five methods
(Fig. 2B). FastHare was the fastest method on this data set and
ProbHap was the slowest method, taking 52:32 CPU-h (Table 2).
HapCUT2 implements likelihood-based strategies for prun-
ing low-confidence variants to reduce mismatch errors and split-
ting blocks at poor linkages to reduce switch errors (see
Methods). These post-processing steps allow a user to improve ac-
curacy of the haplotypes at the cost of reducing completeness and
contiguity. ProbHap’s “transition, posterior, and emission” confi-
dence scores are designed for the same purpose (Kuleshov 2014).
Post-processing strategies are of particular interest for haplotype
assembly with PacBio SMRT reads because the individual reads
Figure 2. Accuracy of HapCUT2 compared to four other methods for haplotype assembly on diverse whole-genome sequence data sets for NA12878. (A)
Fosmid dilution pool data (Duitama et al. 2012). (B) PacBio SMRT data (11× and 44× coverage). (C ) 10X Genomics linked reads. (D) Whole-genome Hi-C
data (40× and 90× coverage, created with MboI enzyme). Switch and mismatch error rates were calculated across all chromosomes using the subset of
variants that were phased by all methods. For each data set, only methods that produced results within 20 CPU-h per chromosome are shown.
HapCUT2: robust and accurate haplotype assembly
Genome Research 805
www.genome.org