HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

doi:10.1101/GR.213462.116

HapCUT2: robust and accurate haplotype assembly

for diverse sequencing technologies

Peter Edge,

1

Vineet Bafna,

1

and Vikas Bansal

2

1

Department of Computer Science & Engineering, University of California, San Diego, La Jolla, California 92053, USA;

2

Department

of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, California 92053, USA

Many tools have been developed for haplotype assembly—the reconstruction of individual haplotypes using reads mapped

to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new

sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes.

However, existing computational methods designed to handle specific technologies do not scale well on data from different

protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple se-

quencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types

—dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation

(Hi-C) sequencing—we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In

particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS

data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting

in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes

from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-res-

olution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromo-

somes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse

sequencing technologies.

[Supplemental material is available for this article.]

Humans are diploid organisms with two copies of each chromo-

some (except the sex chromosomes). The two haplotypes (described

by the combination of alleles at variant sites on a single chromo-

some) represent the complete information on DNA variation in

an individual. Reconstructing individual haplotypes has impor-

tant implications for understanding human genetic variation, in-

terpretation of variants in disease, and reconstructing human

population history (Tewhey et al. 2011; Glusman et al. 2014;

Schiffels and Durbin 2014; Snyder et al. 2015). A number of meth-

ods, computational and experimental, have been developed for

haplotyping human genomes. Statistical methods for haplotype

phasing using population genotype data have proven successful

for phasing common variants and for genotype imputation but

are limited in their ability to phase rare variants and phase long

stretches of the genome that cross recombination hot-spots

(Browning and Browning 2011; Tewhey et al. 2011).

Haplotypes for an individual genome at known heterozygous

variants can be directly reconstructed from reference-aligned se-

quence reads derived from whole-genome sequencing (WGS).

Sequence reads that are long enough to cover multiple heterozy-

gous variants provide partial haplotype information. Using over-

laps between such haplotype-informative reads, long haplotypes

can be assembled. This haplotype assembly approach does not rely

on information from other individuals (such as parents) and can

phase even individual-specific variants. Levy et al. (2007) demon-

strated the feasibility of this approach using sequence data derived

from paired Sanger sequencing of long insert DNA fragment librar-

ies to computationally assemble long haplotype blocks (N50 of

350 kb) for the first individual human genome.

Since then, advancements in massively parallel sequencing

technologies have reduced the cost of human WGS drastically,

leading to the sequencing of thousands of human genomes.

However, the short read lengths generated by technologies such

as Illumina (100–250 bases) and the use of short fragment lengths

in WGS protocols makes it infeasible to link distant variants into

haplotypes. To overcome this limitation, a number of innovative

methods that attempt to preserve haplotype information from

long DNA fragments (tens to hundreds of kilobases) in short se-

quence reads have been developed.

The underlying principle for these methods involves generat-

ing multiple pools of high-molecular-weight DNA fragments such

that each pool contains only a small fraction of the DNA from a

single genome. As a result, there are very few overlapping DNA

fragments in each pool, and high-throughput sequencing of the

DNA in each pool can be used to reconstruct the fragments by

alignment to a reference genome (Kitzman et al. 2011; Suk et al.

2011). Therefore, each pool provides haplotype information

from long DNA fragments, and long haplotypes can be assembled

using information from a sufficiently large number of indepen-

dent pools (Snyder et al. 2015). A number of methods based on

this approach have been developed to phase human genomes

(Kitzman et al. 2011; Suk et al. 2011; Peters et al. 2012; Kaper

et al. 2013; Amini et al. 2014). Recently, 10X Genomics described

Corresponding author: vibansal@ucsd.edu

Article published online before print. Article, supplemental material, and publi-

cation date are at http://www.genome.org/cgi/doi/10.1101/gr.213462.116.

Laboratory Press for the first six months after the full-issue publication date (see

http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is avail-

able under a Creative Commons License (Attribution-NonCommercial

4.0 International), as described at http://creativecommons.org/licenses/by-

nc/4.0/.

Method

27:801–812 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 801

www.genome.org

a novel microfluidics-based library preparation approach that

generates long linked reads that can be assembled into long haplo-

types (Zheng et al. 2016). Third-generation sequencing technolo-

gies such as Pacific Biosciences (PacBio) generate long sequence

reads (2–20 kb in length) that can directly enable genome-wide

haplotyping. Pendleton and colleagues demonstrated the feasibil-

ity of assembling haplotypes from SMRT reads using variants iden-

tified from short read Illumina sequencing (Pendleton et al. 2015).

Haplotype assembly is also feasible with paired-end sequenc-

ing, i.e., pairs of short reads derived from the ends of long DNA

fragments, but requires long and variable insert lengths to assem-

ble long haplotypes (Tewhey et al. 2011). Selvaraj et al. (2013) used

sequence data from a proximity ligation method (Hi-C) to assem-

ble accurate haplotypes for mouse and human genomes. Using

mouse data, they demonstrated that the vast majority of intrachro-

mosomal Hi-C read pairs correspond to ‘cis’ interactions (between

fragments on the same chromosome) and therefore contain haplo-

type information equivalent to paired-end reads with long and

variable insert lengths. Subsequently, 17× whole-genome Hi-C

data was used to assemble chromosome-spanning haplotypes for

a human genome, albeit with low resolution (<22% of variants

phased).

In summary, multiple different sequencing technologies and

protocols have the capability to generate sequence reads with hap-

lotype information but require computational tools to assemble

the reads into long haplotypes. A number of combinatorial algo-

rithms have been developed for haplotype assembly (Bansal and

Bafna 2008; Duitama et al. 2010; He et al. 2010; Aguiar and

Istrail 2012). Among these, HapCUT (Bansal and Bafna 2008)

was developed for phasing Sanger WGS data for the first individual

genome (Levy et al. 2007). HapCUT utilizes max-cuts in read-hap-

lotype graphs, an approach that is equally adept at handling data

with local haplotype information and data with long-range haplo-

type information such as that from long insert paired-end reads. As

a result, it has been successfully utilized to assemble haplotypes

from different types of high-throughput sequence data sets, in-

cluding fosmid pool sequencing (Kitzman et al. 2011), Hi-C data

(Selvaraj et al. 2013), and single molecule long reads (Pendleton

et al. 2015) with appropriate modifications. However, HapCUT

only models simple sequencing errors and does not scale well for

long read data. More recently, several algorithms have been de-

signed specifically to enable accurate haplotype assembly from

long reads (Duitama et al. 2010; Kuleshov 2014).

The diverse characteristics and specific error modalities of

data generated by different haplotype-enabling protocols and

technologies continue to pose challenges for haplotype assembly

algorithms. Some protocols, such as clone-based sequencing, can

generate very long fragments (BAC clones of length 140 kb have

been used to assemble haplotypes [Lo et al. 2013]) but may have

low fragment coverage. Other protocols, such as PacBio SMRT,

generate fragments with shorter mean lengths than clone-based

approaches but can be scaled to higher read coverage more easily.

10X Genomics linked reads are long (longest molecules > 100 kb)

but have gaps resulting in high clone coverage for each variant.

Proximity ligation approaches, such as Hi-C, generate paired-end

read data with very short read lengths but with a larger genomic

span. Hi-C reads can span from a few kilobases to tens of mega-

bases in physical distance. While an algorithm that leverages char-

acteristics of a specific type of data is likely to perform well on that

particular type of data, it may not perform well or not work at all

on other types of data. For example, dynamic programming algo-

rithms such as ProbHap (Kuleshov 2014) that were developed for

low-depth long read sequence data are unlikely to scale well for

data sets with high sequence coverage or for other types of data

such as Hi-C. Even if a haplotype assembly algorithm has broad

support for data qualities, there remains the challenge that differ-

ent sequencing protocols each have systematic error modalities.

For instance, fragment data from the sequencing of multiple hap-

loid subsets of a human genome (Kitzman et al. 2011; Suk et al.

2011) generate long haplotype fragments, but some of these frag-

ments are chimeric due to overlapping DNA molecules that origi-

nate from different chromosomes. Similarly, noise in Hi-C data

due to ligated fragments from opposite homologous chromosomes

increases with increasing distance between the variants. The accu-

racy of haplotypes assembled from each sequencing protocol de-

pends on both the haplotype assembly algorithm’s ability to

effectively utilize the sequence data and its ability to model proto-

col-specific errors.

Results

To address the challenge of haplotype assembly for diverse types of

sequence data sets, we developed HapCUT2, an algorithm that

generalizes the HapCUT approach in several ways. Compared to

a discrete score optimized by HapCUT, HapCUT2 uses a likeli-

hood-based model, which allows for the modeling and estimation

of technology-specific errors such as ‘h-trans errors’ in Hi-C data.

To improve memory performance for long read data, HapCUT2

does not explicitly construct the complete read-haplotype graph.

Further, it implements a number of optimizations to enable fast

runtimes on diverse types of sequence data sets. To demonstrate

the accuracy and robustness of HapCUT2, we compared its perfor-

mance with existing methods for haplotype assembly using simu-

lated and real WGS data sets. Previous publications (Duitama et al.

2012; Kuleshov 2014) have compared different methods for haplo-

type assembly and concluded that RefHap (Duitama et al. 2010),

ProbHap (Kuleshov 2014), FastHare (Panconesi and Sozio 2004),

and HapCUT (Bansal and Bafna 2008) are among the best perform-

ing methods. Other methods such as DGS (Levy et al. 2007),

MixSIH (Matsumoto and Kiryu 2013), and HapTree (Berger et al.

2014) did not perform as well on the data sets evaluated in this

study. Therefore, we compared the performance of HapCUT2

with four other methods: RefHap, ProbHap, FastHare, and

HapCUT (Table 1).

Overview of HapCUT2 algorithm

The input to HapCUT2 consists of haplotype fragments (sequence

of alleles at heterozygous variant sites identified from aligned se-

quence reads) and a list of heterozygous variants (identified from

WGS data). HapCUT2 aims to assemble a pair of haplotypes that

are maximally consistent with the input set of haplotype frag-

ments. This consistency is measured using a likelihood function

that captures sequencing errors and technology-specific errors

such as h-trans errors in proximity ligation data. HapCUT2 is an it-

erative procedure that starts with a candidate haplotype pair.

Given the current pair of haplotypes, HapCUT2 searches for a sub-

set of variants (using max-cut computations in the read-haplotype

graph) such that changing the phase of these variants relative to

the remaining set of variants results in a new pair of haplotypes

with greater likelihood. This procedure is repeated iteratively until

no further improvements can be made to the likelihood (see

Methods for details).

Edge et al.

802 Genome Research

www.genome.org

Comparison of runtimes on simulated data

We used simulations to compare the runtime of HapCUT2 with ex-

isting methods for haplotype assembly across different types of se-

quence data sets. A fair comparison of the performance of different

methods is not completely straightforward. Different methods

chose to optimize different technology parameters and highlight-

ed performance using those parameters. We considered the follow-

ing parameters: number of variants per read (V), coverage per

variant (d), and the number of paired-end reads spanning a variant

(d

′

). The parameter V is a natural outcome of read length; for exam-

ple, PacBio provides higher values of V compared to Illumina se-

quencing. The parameter d is similar to read coverage but only

considers haplotype informative reads—higher values result in

better accuracy but also increased running time. Finally, many se-

quencing technologies (such as Hi-C) generate paired-end se-

quencing with long inserts and d

′

can potentially be much

greater than d. Some haplotype assembly methods implicitly ana-

lyze all paired-end reads spanning a specific position, and their

runtime depends upon d

′

rather than d.

In order to make a fair comparison of runtimes and allow us-

ers to determine the most efficient method for any technology, we

summarized the computational complexity of each method as a

function of these parameters (Table 1) and used simulations to

verify the dependence of runtime and accuracy on each parameter

(Fig. 1). We simulated reads using a single chromosome of

length ∼250 Mb (approximately equal to the length of human

Chromosome 1) with a heterozygous variant density of 0.08%

and a uniform rate of sequencing errors (2%), performing 10 repli-

cates for each simulation. Standard deviations of runtimes and er-

ror rates between replicates were small (Supplemental Fig. S1). A

method was cut off if it exceeded 10 CPU-h of runtime or 8 GB

of memory on a single CPU, since most methods required signifi-

cantly less resources than these limits. We note that the runtimes

in Table 1 refer to complexity as implemented, with parameters re-

ferring to maximum values (e.g., maximum coverage per variant),

while in simulations, the parameters refer to mean values (e.g.,

mean coverage per variant).

To assess the dependence of runtime on d, we generated reads

with a mean of four variants per read (V) and varied the mean read

Table 1. Comparison of the approach, time complexity, and applicability of five algorithms for haplotype assembly: HapCUT2, HapCUT, RefHap,

ProbHap, and FastHare

Method Approach Complexity Long reads

Hi-C

support Variant pruning

HapCUT2 Likelihood optimization using graph-cuts O(c

1

c

2

(Nlog (N)+NdV

2

)) Scalable Yes Likelihood

HapCUT MEC optimization using graph-cuts O(c

1

c

2

(Nlog (N)+NdV

2

)) High memory requirement Yes No

RefHap Max-cut on read graph O(c

3

(R

2

Vd

′

+ RV

2

d

′

2

)) Low-to-medium coverage No Discrete

ProbHap Exact likelihood using dynamic prog. + merging

heuristic

O(Nd

′

2

d

′

) Low-coverage No Confidence scores

FastHare Read partitioning optimization O(RVd

′

) Yes No Discrete

(R) Number of reads (all algo rithms process reads for each haplotype block separately); (N) total number of variants; (V) maximum number of variants

in a read; (d) maximum read depth per site; (d

′

) maximum number of reads crossing a site (equivalent to d except with paired-end inserts being includ-

ed as part of the read); (c

1

)(c

2

)(c

3

) method-specific variables that are either fixed in advance or s elected by the user. Reads are assumed to be sorted

by starting position.

Figure 1. Comparison of runtime (top panel) and switch + mismatch error rate (bottom panel) for HapCUT2 with four methods for haplotype assembly

(HapCUT, RefHap, ProbHap, and FastHare) on simulated read data as a function of (A) mean coverage per variant (variants per read fixed at four); (B) mean

variants per read (mean coverage per variant fixed at five); and (C) mean number of paired-end reads crossing a variant (mean coverage per variant fixed at

five, read length 150 bp, random insert size up to a variable maximum value). Lines represent the mean of 10 replicate simulations. FastHare is not visible on

C (bottom) due to significantly higher error rates.

HapCUT2: robust and accurate haplotype assembly

Genome Research 803

www.genome.org

coverage per variant (d ) from five to 100. The error rates of

HapCUT2, HapCUT, ProbHap, and RefHap were similar and de-

creased with increasing coverage before reaching saturation.

FastHare was significantly faster than other methods but had error

rates that were several times greater. As predicted by the computa-

tional complexity of the different methods (Table 1), HapCUT2 is

significantly faster than HapCUT, RefHap, and ProbHap, once the

coverage exceeds 10× (Fig. 1A). For example, RefHap required 10

CPU-h to phase reads at a coverage of 38×, while HapCUT2 took

only 34 CPU-min to phase reads with 100× coverage per variant.

ProbHap reached the 10-CPU-h limit at a coverage of only 8×.

HapCUT shows a similar trend to HapCUT2 but is significantly

slower and requires more than 8 GB of memory at coverages of

40× or greater. RefHap constructs a graph with the sequence reads

as nodes and performs a max-cut operation that scales quadratical-

ly with number of reads. Therefore, its runtime is expected to in-

crease as the square of read-coverage. ProbHap’s runtime is

exponential in the maximum read-depth (Kuleshov 2014) and ex-

ceeds the maximum allotted time for modest values of d. FastHare

greedily builds a maximally consistent haplotype from left to right

in a single pass, resulting in a low runtime but also lower accuracy.

While HapCUT2 has the same asymptotic behavior as HapCUT, it

improves upon the memory usage and runtime significantly in

practice. It does this by only adding edges that link adjacent vari-

ants on each read to the read-haplotype graph, as well as using con-

vergence heuristics that reduce the number of iterations performed

(see Methods for details).

Next, we varied the number of variants per read (V) and kept

the coverage per variant (d) fixed at 5×. The error rates for each

method decrease monotonically (Fig. 1B). HapCUT2, RefHap,

and ProbHap have similarly low error rates, while FastHare and

HapCUT have error rates higher than the other methods. The run-

times of RefHap and FastHare are consistently very low, although

the runtime of RefHap peaks very slightly around V = 15. The run-

time of ProbHap decreases monotonically as V increases. This is

consistent with the fact that the runtime of these methods has a

linear dependence on the read length because for a fixed sequence

coverage, the number of reads decreases as the read length increas-

es. In comparison, HapCUT2’s runtime is observed to increase lin-

early with V. This is consistent with the complexity of HapCUT2

being proportional to the square of the number of variants per

read (see Table 1). Although HapCUT2’s runtime increases, it re-

mains practical across all tested values and is <50 CPU-min for

mean read lengths consistent with very long sequences (160 vari-

ants per read or 200 kb). The space requirements for HapCUT have

a quadratic dependence on the number of variants per read, and

therefore, exceeded the memory limit after only eight variants

per read.

Finally, we compared runtimes as a function of the average

number of paired-end reads crossing a variant (d

′

). For single-end

reads, this parameter is identical to d. Proximity ligation data, on

the other hand, consists of pairs of short reads each with a single

large gap (insert) between them. The large and highly variable in-

sert sizes result in a large number of reads crossing each variant po-

sition. This property is important for linking distant variants,

because the extremely long insert size spans of proximity ligation

methods are capable of spanning long variant-free regions. For this

reason, we simulated paired-end short read data with random in-

sert sizes up to a parametrized maximum value, to represent a gen-

eralized proximity ligation experiment. We varied d

′

by increasing

the maximum insert size value from 6.25 kb (∼5 single-nucleotide

variants [SNVs]) to 125 kb (∼100 SNVs) while keeping d and V

constant at 5× and 150 base pairs (bp) (0.1195 SNVs), respectively.

ProbHap and RefHap exceeded the time limit at d

′

= 10 and d

′

= 17,

respectively. FastHare exceeded the time limit at d

′

= 36 but had ex-

tremely high error rates (10×–18× higher than HapCUT2).

ProbHap’s dynamic programming algorithm needs to consider

the haplotype of origin for each read crossing a variant; therefore,

the complexity scales exponentially in d

′

. In the case of RefHap

and FastHare, the failure to scale with increasing d

′

appears to be

a result of representing fragments as continuous arrays with length

equal to the number of variants spanned by each read. Thus, as im-

plemented, the runtimes for RefHap and FastHare scale with d

′

rather than d. In contrast, both HapCUT and HapCUT2 were

able to phase data with arbitrarily long insert lengths, reaching

d

′

= 100 (Fig. 1C). The runtime of HapCUT2 was independent of

d

′

and 8×–10× faster than that for HapCUT.

Overall, the results on simulated data demonstrate that the

complexity of HapCUT2 is linear in the number of reads and qua-

dratic in the number of variants per read. HapCUT2 is fast in prac-

tice and effective for both long reads and paired-end reads with

long insert lengths, with scalability unmatched by the four other

tools we evaluated. Additionally, HapCUT2 and HapCUT were

the only tools tested that can reasonably phase paired-end data

with long insert lengths that result from proximity ligation

(Hi-C) sequencing.

Comparison of methods on diverse WGS data sets

for a single individual

We next assessed the accuracy of HapCUT2 using data from four dif-

ferent sequencing data types for a single individual (NA12878): fos-

mid-based dilution pool sequencing, 10X Genomics linked-read

sequencing, single molecule real-time (SMRT) sequencing, and

proximity ligation sequencing. Haplotype assembly methods re-

quire a set of heterozygous variants as input. Therefore, a set of het-

erozygous variants for NA12878 identified from WGS Illumina data

were used as input to assemble haplotypes for each data type (see

Methods for description). The accuracy of the haplotypes was as-

sessed by comparing the assembled haplotypes to gold-standard

trio-phased haplotypes and using the switch error rate and mis-

match error rate metrics (see Methods).

Fosmid-based dilution pool data

To assess HapCUT2 on long read sequencing data, we used whole-

genome fosmid-based dilution pool sequence data for a human in-

dividual, NA12878 (Duitama et al. 2012). This data was generated

from 1.44 million fosmids (33–38 kb and 38–45 kb in length)

that were partitioned into 32 pools such that each pool contains

DNA from a small fraction of the genome (∼5%). Subsequently,

each pool was sequenced using the ABI SOLiD sequencer and hap-

lotype fragments were identified using read depth analysis

(Duitama et al. 2012). Although this data set has low sequence cov-

erage (d ≈ 3×), the processed fragment data (needed as input for

haplotype assembly) is publicly available and has been used to

assess the performance of haplotype assembly methods in several

papers (Duitama et al. 2012; Kuleshov 2014). On this data, the

switch error and the mismatch error rates for HapCUT2 were

virtually identical or slightly better than ProbHap, the second

best performing method, across all chromosomes (Supplemental

Fig. S2). However, ProbHap pruned ∼1.2% of the variants from

the assembled haplotypes, in comparison to HapCUT2, which

only pruned 0.6% of the variants. The switch error rates for

RefHap and FastHare were also similar to HapCUT2 and ProbHap

Edge et al.

804 Genome Research

www.genome.org

(Supplemental Fig. S2). To enable a head-to-head comparison of

the switch error rate across different methods, we also calculated

the switch and mismatch error rates on a subset of variants that

were phased by all tools (not pruned). On this subset of variants,

the switch and mismatch error rates for HapCUT2 were similar to

but slightly lower than ProbHap (Fig. 2A). In terms of running

time, RefHap and FastHare were the fastest methods on this data

set, while HapCUT2 took a total of 1:09 CPU-h to phase all chromo-

somes (Table 2). In summary, HapCUT2 had similar (but slightly

better) accuracy to ProbHap, RefHap, and FastHare on this data

set and was more accurate than HapCUT.

10X Genomics linked-read data

We also used HapCUT2 to assemble haplotypes from 10X

Genomics linked-read data (Zheng et al. 2016), which is based

on a similar idea as the fosmid-based dilution pool approach.

10X Genomics technology labels short DNA fragments originating

from a single long DNA fragment with barcodes inside hundreds of

thousands of separate nano-scale droplets (Zheng et al. 2016). The

linked reads produced can be extremely long (>100 kb). This data

set has a short read coverage of 34×, with a linked-read coverage per

variant of 12× (Zook et al. 2016). For haplotype assembly, we used

the same set of variant calls as for the fosmid data set and extracted

haplotype fragments from the 10X aligned reads (see Methods,

“Long read data sets”). On this data set, neither RefHap nor

ProbHap finished haplotype assembly within the time limit.

HapCUT2 was the fastest method and analyzed all chromosomes

in 1:55 CPU-h (Table 2). When compared on the subset of variants

that were phased by all tools, HapCUT2 had an accuracy slightly

better than the next best approach (HapCUT), which took 16:50

CPU-h (Fig. 2C).

PacBio SMRT data

SMRT sequencing on the Pacific Biosciences platform generates

long (2–20 kb) but error-prone (>10% indel error rate) reads. We

used HapCUT2 to assemble haplotypes from 44× coverage

PacBio reads (Pendleton et al. 2015). We extracted haplotype frag-

ments from the PacBio reads that were aligned to the human refer-

ence genome (hg19), using the same set of variant calls as for the

previous two data sets. On the full data set, HapCUT2 was not

only the most accurate but was also significantly faster than

RefHap and HapCUT (see Supplemental Fig. S3 for detailed com-

parisons of error rates and runtimes). We calculated the switch er-

ror and mismatch error rates on the subset of variants that were

phased by all methods. HapCUT2 had a 12.4% lower switch error

and a 2% lower mismatch rate than RefHap. RefHap took 215:53

CPU-h to phase the data set. By comparison, HapCUT2 took

only 4:05 CPU-h in total. Because ProbHap was unable to complete

within the time limit on the full data set, we also compared the per-

formance of the haplotype assembly tools on a lower, 11× coverage

subsample of this data set. On the subsample, HapCUT2 had the

lowest switch error and mismatch error rates of the five methods

(Fig. 2B). FastHare was the fastest method on this data set and

ProbHap was the slowest method, taking 52:32 CPU-h (Table 2).

HapCUT2 implements likelihood-based strategies for prun-

ing low-confidence variants to reduce mismatch errors and split-

ting blocks at poor linkages to reduce switch errors (see

Methods). These post-processing steps allow a user to improve ac-

curacy of the haplotypes at the cost of reducing completeness and

contiguity. ProbHap’s “transition, posterior, and emission” confi-

dence scores are designed for the same purpose (Kuleshov 2014).

Post-processing strategies are of particular interest for haplotype

assembly with PacBio SMRT reads because the individual reads

Figure 2. Accuracy of HapCUT2 compared to four other methods for haplotype assembly on diverse whole-genome sequence data sets for NA12878. (A)

Fosmid dilution pool data (Duitama et al. 2012). (B) PacBio SMRT data (11× and 44× coverage). (C ) 10X Genomics linked reads. (D) Whole-genome Hi-C

data (40× and 90× coverage, created with MboI enzyme). Switch and mismatch error rates were calculated across all chromosomes using the subset of

variants that were phased by all methods. For each data set, only methods that produced results within 20 CPU-h per chromosome are shown.

HapCUT2: robust and accurate haplotype assembly

Genome Research 805

www.genome.org

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

Citations

References

Related Papers (5)