What have the authors contributed in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?

Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project ( VGP ), the Earth Biogenome Project ( EBP ) and the Genome in a Bottle project ( GIAB ), a variety of sequencing datasets from mother-father-child trios of various diploid species are becoming available. The authors present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences ( PacBio ) data from all related individuals, thereby generalizing their previous work on single individuals. The authors demonstrate the effectiveness of their pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from Arabidopsis Thaliana. The authors show that they require as little as 30× coverage Illumina data and 15× PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, the authors show that they can detect and phase variants from generated phased assemblies.

How did the authors evaluate the predicted assemblies?

The authors evaluate the child’s predicted haplotype-resolved assemblies by aligning their predicted assemblies to the true simulated genome in their yeast-based experiments, and to TrioCanu’s published assemblies of A. Thaliana when handling real data.

What is the main reason for the need for a haplotype-resolved?

developing a haplotype-resolved de novo assembly approach for related individuals which is cost-effective, flexible with regard to genomic complexity and heterozygosity rate, and which does not contain reference bias, is a pressing need for the genomics community.

How will the authors use heuristic approaches to perform polyploidy phasing?

The authors will explore heuristic approaches to perform polyploidy phasing in an efficient manner, and will aim to use a joint phasing framework to obtain more contiguous diploid genome assemblies.

How did the authors convert the assembly graph to a bluntified sequence graph?

using VG (Garrison et al., 2017), the authors converted the assembly graph to a bluntified sequence graph—that is, with redundant node sequences removed.

Why did the authors use a simulated genome?

Due to the lack of sufficient parental long-read data to pursue this goal (which major sequencing efforts will likely produce in the near future), the authors primarily considered simulated data for their comprehensive study of assembly behavior at varying heterozygosity rates and long-read coverage.

What is the limitation of the TrioCanu method?

The TrioCanu method (Koren et al., 2018) is a hybrid approach that takes advantage of parental Illumina data and long reads from the child in a trio; yet, it has the limitation of not phasing variants that are heterozygous in all individuals.

How many SNVs are predicted for Arabidopsis Thaliana?

From Figure 7, the authors observe that the the number of predicted SNVs or short indels rises in response to increasing heterozygosity rate; for example, 58,541 and 178,743 for genomes with heterozygosity rates of 0.5% and 1.5%, respectively.

How does the average identity of the haplotigs change with the increase in coverage?

In their approach, the authors observe that as the authors increase the long read coverage from 5× to 15× for each individual, the average identity of the haplotigs increases from 99.3% to 99.9%.

How many bubbles did the authors consider to be classified?

In measuring partitioning accuracy of long reads, the authors considered reads to be classified only if covering a fixed threshold of bubbles.

How many Mb of assemblies can the authors produce for each individual in a trio?

For real data from Arabidopsis Thaliana, the authors can produce complete assemblies of 119 Mb at 15× coverage for each individual in a trio.

How can the authors perform haplotype-aware error correction on the reads?

One the authors obtain the partitions of long-read data for each individual, the authors can perform haplotype-aware error-correction on the reads.

(Open Access) A haplotype-aware de novo assembly of related individuals using pedigree graph (2019) | Shilpa Garg

Q: What are the future works in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?

One restriction in their model is the use of constant recombination rates ; the authors aim to fine tune this parameter in the future according to genomic distances, and properly incorporate recombination hotspots. Their framework, in principle, is generalized to incorporate any variety of datasets ; in the future, the authors hope to optimize their method by incorporating data such as chromatin conformation capture ( Burton et al., 2013 ) and linked read sequencing ( Weisenfeld et al., 2017 ).

Q: What is the motivation behind using a dynamic programming algorithm to solve gPedMEC?

The motivation behind using a dynamic programming algorithm is utilize a DP table to determine the optimal haplotype paths more efficiently than with a brute force algorithm, which would require exponential time with respect to the number of bubbles and alignments.

Q: How do the authors construct the true haplotype paths?

to construct the true haplotype paths, the authors seek the maximally likely paths based on confidence scores of how the nodes are connected to each other over long distances, which the authors determine using PacBio reads aligned to their graph.

“output” — 2019/3/16 — page 1 — #1

Bioinformatics

doi.10.1093/bioinformatics/xxxxxx

Advance Access Publication Date: Day Month Year

Manuscript Category

Subject Section

A haplotype-aware de novo assembly of related

individuals using pedigree graph

Shilpa Garg

1,5,∗

, John Aach

, Heng Li

, Richard Durbin

, George

Church

1,5,∗

Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA,

Harvard College, Harvard University, Cambridge,

Massachusetts, USA,

Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA,

Department

of Genetics, University of Cambridge, Cambridge, United Kingdom,

Wyss Institute for Biologically Inspired Engineering, Harvard

University, Boston, Massachusetts, USA

∗

To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation: Reconstructing high-quality haplotype-resolved assemblies for related individuals of various

species has important applications in understanding Mendelian diseases along with evolutionary and

comparative genomics. Through major genomics sequencing efforts such as the Personal Genome

Project, the Vertebrate Genome Project (VGP), the Earth Biogenome Project (EBP) and the Genome in

a Bottle project (GIAB), a variety of sequencing datasets from mother-father-child trios of various diploid

species are becoming available.

Current trio assembly approaches are not designed to incorporate long-read sequencing data from parents

in a trio, and therefore require relatively high coverages of costly long-read data to produce high-quality

assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale

diploid genomes in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need

of the genomics community.

Results: We present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina

data and long-read Paciﬁc Biosciences (PacBio) data from all related individuals, thereby generalizing

our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on

a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from

Arabidopsis Thaliana. We show that we require as little as 30× coverage Illumina data and 15× PacBio

data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we

show that we can detect and phase variants from generated phased assemblies.

Availability: https://github.com/shilpagarg/WHdenovo

Contact: shilpa_garg@hms.harvard.edu, gchurch@genetics.med.harvard.edu

1 Introduction

The ability to faithfully reconstruct genomes is a crucial step in better

understanding evolution and the nature of inherited disease (Tewhey

et al. (2011)). De novo genome assembly aims to address this problem

by generating complete genome sequences from error-prone sequencing

reads alone, without the use of a reference genome. Creating a de novo

assembler that resolves genomic repeats and is generalized to genomes

of varying heterozygosity rate (the average proportion of loci that differ

between homologous sequences) has posed a signiﬁcant challenge to

the scientiﬁc community. Assembling diploid genomes adds further

difﬁculty; in order to accurately represent diploid genomes, assemblies

must correctly identify and phase (i.e. determine the correct haplotype

of) homologous sequences. One promising approach to diploid genome

assembly is incorporating sequencing information from a related set of

The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” — 2019/3/16 — page 2 — #2

2 S. Garg, J. Aach, H. Li, R. Durbin, G. Church

Fig. 1. The Illuminadata (middle) from the trio of genomes can be represented as a pedigree

graph. The bubbles in the graph (bottom) show four different variants; from the left, there

is an indel, SNV, SV, and SNV.

individuals, particularly from mother-father-child trios, and using the

Mendelian information offered by the corresponding pedigree to infer the

layout of alleles along homologous sequences.

Some assemblers (Levy et al., 2007; Pryszcz and Gabaldón, 2016;

Simpson and Durbin, 2012; Bankevich et al., 2012; Li, 2015) tackle single-

individual assembly using Next-Generation Sequencing data; however,

while accurate, the short length of NGS reads often leads assemblies to

fragment at repetitive and highly-heterozygous regions. Other assemblers

(Koren et al., 2017; Vaser et al., 2017; Berlin et al., 2015; Chin et al.,

2013) utilize longer Third-generation sequencing reads to obtain more

contiguous sequences, and to help resolve repeats and heterozygous

regions. Yet, these require high coverage due to the high error rate in

long-read data, which is very costly. Hybrid assemblers utilize both types

of reads, taking advantage of the accuracy of short reads and the scale of

long reads to generate complete, high-quality assemblies (Bashir et al.,

2012; Antipov et al., 2015; Zimin et al., 2017).

However, the methods described above collapse differences between

homologous pairs into a single consensus sequence, without regards for

the rich information given by the layout of different alleles along two

DNA strands (Simpson and Pop, 2015). By contrast, other assemblers by

Chin et al. (2016); Garg et al. (2018); Weisenfeld et al. (2017) have been

developed to generate haplotigs, haplotype-resolves assemblies for diploid

genomes.

Certain assemblers have been developed to employ pedigree

information into the process of assembly as well, speciﬁcally for the case

of mother-father-child trios. For example, trio-sga creates haplotypes of

the child based on parental Illumina data (Malinsky et al., 2016), while

TrioCanu uses such data to partition child long-read data and subsequently

assembles the partitioned reads separately (Koren et al., 2018). Yet, these

two methods cannot phase variants which are heterozygous in all three

individuals in a trio, and by relying solely on parental Illumina data,

may not correctly haplotype long reads which cover repetitive genomic

regions. Furthermore, TrioCanu does not work properly at low coverages

of long-read data.

Another method approaches the diploid assembly problem by aligning

long-read data from all individuals in a pedigree to a reference-genome,

then ﬁnding the most likely partitioning of reads as determined by the

PedMEC problem (Garg et al., 2016). Essentially, the PedMEC (Weighted

Minimum Error Correction on Pedigrees) problem ﬁnds the partitioning of

reads from related individuals that incurs the least cost, which is calculated

based on the likelihood of errors occurring at various locations along the

reads as well as recombination costs between each site of heterozygosity.

This method, however, only concerns bi-allelic variants, and contains

Pedigree Graph

SNV1

1 01

Reads

Mother

Father

Child

Haplotigs

Mother

Father

Child

SNV2

SNV3

SNV4

SV1

SV2

Fig. 2. Input: This ﬁgure shows the pedigree graph (top) (consisting of four SNVs and two

SVs) and the PacBio reads (gray) with respective alleles (white digits). Output: the ﬁnal

haplotigs (crimson) for each individual in a trio. Our method can phase the variants that are

heterozygous (SNV1) in all individuals and any variant covered by atleast one read from

any individual (SV2 in child), resulting in continuous and complete haplotigs

reference bias, meaning that unique DNA sequences may not be detected

or phased correctly.

Thus, developing a haplotype-resolved de novo assembly approach for

related individuals which is cost-effective, ﬂexible with regard to genomic

complexity and heterozygosity rate, and which does not contain reference

bias, is a pressing need for the genomics community.

Contributions. Our graph-based method, implemented as a new tool

WHdenovo, performs phasing in the space of a pedigree graph (deﬁned

below), and is generalized to assemble genomes of varying heterozygosity

rate and with multi-allelic variants, thus allowing for the creation of

accurate, complete haplotigs.

More precisely, our approach buildsa pedigree graph, or joint sequence

graph, using combined Illumina data from all individuals in a pedigree; the

graph represents heterozygous locations as bubbles, as shown in Figure 1.

Given a pedigree graph containing a series of bubbles, long-read (PacBio)

data from each individual are threaded through it; in essence, the most

probable paths that the long reads trace through the bubbles in our pedigree

graph, and which obey the Mendelian constraints imposed by the pedigree,

represent our true haplotypes. We draw key concepts from the graph-based

assembly approach for single individuals described in Garg et al. (2018)

and the PedMEC formulation set forth in Garg et al. (2016), yet synthesize

and extend them to overcome their respective shortcomings.

Our graph-based approach poses several advantages. For example, we

can detect and phase all types of small and large variants, and require

relatively lowcoverage of costly long-read data. We can also phase variants

that are heterozygous in all individuals; for example, SNV1 from Figure

2. Moreover, by incorporating hybrid data from all related individuals, we

can effectively phase reads in repetitive genomic regions, and as shown by

SV2 in Figure 2, if parental reads span a variant but child reads do not, we

can still correctly identify and phase the variant in all three individuals.

Assemblers such as TrioCanu would not be able to phase variants under

these circumstances, and may forced to break assembly contiguity.

To demonstrate the practical effectiveness of our method of haplotype-

aware de novo assembly for related individuals, we assemble two sets of

genomes. First, we haplotype the genomes of a simulated trio of pseudo-

diploid yeast, which allows us to comprehensively study assembly at

varying read coverage and heterozygosity rates. Then, we use real data to

assemble the complete diploid genome of a trio of Arabidopsis Thaliana.

These results indicate that our hybrid method is adaptable to genomes of

The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” — 2019/3/16 — page 3 — #3

A haplotype-aware de novo assembly of related individuals 3

Fig. 3. Overview of the pedigree-aware phased assembly pipeline. After we generate a

pedigree graph using Illumina data, we align PacBio reads from all related individuals

from a pedigree to the graph. Using these alignments, we detect ordered sequences of

heterozygous regions, represented as bubble chains. We then ﬁnd the best partitioning of

each set of reads into haplotypes based on the paths taken through the bubble chains, and

assemble the partitioned reads.

varying heterozygosity rates. We demonstrate that our method is cost-

effective, requiring only 30× short-read coverage and 15× long-read

coverage for every individual in a pedigree to generate near chromosomal-

scale assemblies for all individuals. Moreover, at these coverages, we show

that our assemblies for both real and simulated data are more accurate and

contiguous when compared to those produced by TrioCanu at 45× child

long-read coverage. In a ﬁnal experiment, we also show that we can detect

and phase variants.

2 Pedigree-aware phased assembly pipeline

In this section, we present the workﬂow of our pipeline, which takes input

as raw Illumina and PacBio sequencing data from all related individuals

in a pedigree, and outputs ﬁnal, polished haplotigs. Once we create a

pedigree sequence graph, our goal is to ﬁnd the walks through this graph

that correspond to the true haplotypes of all related individuals. These

haplotype paths will encode the phasing of all variants in the haplotigs

for each individual in the pedigree. Due to errors of Illumina data and

genomic characteristics such as repeats, there are inevitably multiple paths

through this graph that do not correspond to true haplotype paths. Thus,

to construct the true haplotype paths, we seek the maximally likely paths

based on conﬁdence scores of how the nodes are connected to each other

over long distances, which we determine using PacBio reads aligned to

our graph.

This pipeline generalizes our previous single-individual approach

(Garg et al., 2018) to related individuals to yield the chromosome-scale

haplotigs. Figure 3 represents a conceptual workﬂow of our pipeline,

detailed below.

Pedigree Graph. We use short, accurate Illumina reads from all related

individuals as the basis for generating a pedigree graph. We denote the

bidirected pedigree graph as G

, containing a set of nodes N

and

a set of edges E

. Conceptually, each node n

∈ N

represents a

segment of DNA found in the Illumina data, and the node n

represents

its reverse complement. Note that because nodes are generated using the

combined reads of all individuals in a pedigree, not every node sequence

is necessarily present in the genome of every individual. Additionally,

nodes can be traversed in either direction; when traversed in the reverse

Fig. 4. This ﬁgure considers a toy example of our algorithm to ﬁnd bubble chains. Shown

is the joint sequence graph, containing nodes and bubbles, with many reads spanning them.

In the ﬁltration step, n3 is ﬁltered because it is only covered by a single read, and n7 is

removed because it branches (has degree > 2). Unitigs U1-4 are generated by using DFS for

the pairs of edges and bubbles whose connection are covered by at least 5 reads. Because

there is a read that spans U1 and U2, and a read that spans U2 and U3, we can combine

these smaller unitigs into larger ﬁnal bubble chain C1.

direction, its sequence is considered reverse-complemented. Every edge

∈ E

represents an adjacency between the sequences represented by

node n

and n

The pedigree graph G

contains a set of bubbles L that represents

heterozygous variants (or sequencing errors). The terminology bubble is

drawn from Garg et al. (2018) and follows from the ultrabubble coined

by (Paten et al., 2017). Graphically speaking, bubbles are directed and

acyclic, biconnected, and minimal, as discussed by Garg et al. (2018).

We deﬁne each bubble l

to be the set of allele paths contained within it,

where each allele path is a unique sequence of nodes spanning a common

start and end node. Intuitively, each bubble is bookended by common

sequences —the start and end nodes —and the unique node sequences

connecting them represent the alleles gleaned from our Illumina data.

Figures 1 demonstrates the bubbles representing various variants.

PacBio Alignments. We align PacBio long reads from all individuals to

produce paths through the pedigree graph G

, in a similar process to Garg

et al. (2018). The concept of aligning a PacBio read to G

captures the

idea that the sequence of a single PacBio read can trace a path through

the sequences contained in many nodes. Thus, for a given PacBio read,

we deﬁne a read alignment r

to be a path through G

, deﬁned by the

oriented nodes n

. . . n

which map to a particular read. Given a set of

individuals I, we will align the PacBio reads from every individual i ∈ I

to G

, resulting in a set of read alignments R

= {r

, r

, . . . , r

} for

every individual. As an intuitive checkpoint, note that as PacBio reads

from different individuals trace different paths through nodes and bubbles

in G

, we gain information about not only the correct ordering of nodes

to form the genome in question, but also the alleles and phasing present in

each individuals.

Bubble Chains. To consolidate the information gained by the PacBio

alignments, we need to formalize the process of ﬁnding ordered bubble

chains which represent the layout of heterozygous sites across the genome.

Ideally, we would be able to use our PacBio alignments to obtain C, a set

The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” — 2019/3/16 — page 4 — #4

4 S. Garg, J. Aach, H. Li, R. Durbin, G. Church

of bubble chains containing one continuous bubble chain corresponding

to each chromosome. Before generating bubble chains, we must perform

a series of ﬁltration steps to remove erroneous bubbles according to the

algorithm described in the following paragraph:

On input R, the set of all read alignments R

for individuals i ∈ I, we

project all partial alignments to bubble space—that is, for every instance

where a read traverses a bubble, we replace the corresponding set of nodes

by the appropriate bubble ID. We now perform our ﬁltration steps; for

every pair of nodes or bubbles from every alignment path in R, denoted

as (x, y) ∈ R, we compute the coverage c

(x,y)

, and if c

(x,y)

< 5, we

remove the pair. Next, we calculate the degree of every remaining node x,

and if deg (x) ≥ 3, we remove all pairs containing x. Using the resulting

ﬁltered pairs, we then perform DFS to ﬁnd U, an initial set of unambiguous

bubblechains termed unitigs. Finally, if there is at leastone read connecting

two unitigs every pair of unitigs in U , we can gain information about the

ordering between these initial unitigs. Record the resulting orderings as

C, the set of ﬁnal bubble chains. Figure 4 demonstrates the algorithm with

a basic example.

Fig. 5. For a subgraph of G

, the example shows three bubbles l

, l

, and l

, and their

corresponding alleles. Reads from mother, father and child traverse the bubbles.

Graph-based phasing on pedigrees. We introduce the Graph version of

Weighted Minimum Error Correction on Pedigrees Problem, or gPedMEC,

as our central phasing algorithm. gPedMEC relates to the PedMEC

formulation set forth by Garg et al. (2016), which concerns alignment to

a reference genome and applies only to bi-allelic variants. Considering

multi-allelic variants in the gPedMEC framework requires additional

phasing-related observations. Speciﬁcally, bubbles which represent bi-

allelic variants ensure that a child’s haplotype paths must be the same

as a combination of parental haplotype paths. By contrast, in bubbles

containing multi-allelic variants, it is possible for child haplotype paths to

differ from parental ones; this would occur under the circumstance that the

variant encoded by the bubble is long, and a new allele were created as a

result of recombination within the variant itself. Based on these insights,

we create a new algorithm to determine the haplotypes of any set related

individuals in a more ﬂexible, accurate, and representative manner.

Ultimately, the goal of solving gPedMEC is to recreate the haplotype

paths of every individuals through the bubble chains we calculated in C.

The haplotype paths for a given individual are determined by deducing

the two paths through the bubble chains which incur the least cost (i.e. are

most likely), where cost is determined by conﬁdence in alignment paths

and recombination costs. Doing so will inherently also compute long-read

partitionings.

In order to represent the paths taken through our bubble chains

for each set of alignments R

, we create bubble matrices F

∈

{0, 1, . . . e, −}

×M

for each individual i. Here, e is the maximum

number of alleles contained in any bubble, and M is the number of bubbles

in a chain. Figure 5 shows a sample 3-bubble chain with PacBio reads from

each individual aligned to it; the corresponding bubble matrices are shown

for the three individuals are shown below:







1 0 0

0 0 0

1 0 1

1 0







, F







2 1 1

2 0 0













0 0 0

2 1 1

0 1 0

1 1







(1)

If all reads in a bubble matrix F

were to follow exactly one of two

paths through a bubble chain, solving for the true haplotype paths for the

corresponding individual, h

, h

∈ {0, 1, ..., e, −}

, would be trivial.

For example, F

is conﬂict–free, and the corresponding haplotype paths

are self-evident. However, due to long-read sequencing errors, conﬂicts

within bubble matrices are inevitable. Thus, our goal is to ﬁnd the optimal

set of matrix entries which can be ﬂipped in order to create conﬂict free

matrices containing a bipartition of reads which follow the two haplotype

paths of each individual (and which obey the Mendelian constraints of the

pedigree). So, for every individual, we also need to introduce a weight

matrix W

∈

×M×e

, where each entry W

(j, k, x) represents the

cost of ﬂipping read j, at bubble l

, to allele x, where k ∈ {0, 1, ...M}

and x ∈ {0, 1..., |l

|}. An example weight matrix, corresponding to F

is shown below:







[2, 0, 10] [0, 1] [0, 10]

[0, 10, 10] [0, 2] [0, 5]

[12, 0, 5] [0, 1] [4, 10]

[7, 0, 8] [0, 10]







(2)

Other weight matrices can be written similarly.

We need to account for the Mendelian constraints imposed by the

pedigree as well. Thus, we introduce the transmission vectors

∈

{0, 1, na}

for each triple (m, f, c) ∈ T to denote the alleles passed

from each parent to child at every location in a bubble chain, where

= 1 if the mother passes on an allele from h

, and so on. Under the

circumstance that recombination occurs within a bubble, in which case the

value of na is passed on, no haplotype of any parent is directly transmitted

to the child. Thus, the recombined sequences will form allele-paths in the

bubble.

Additionally, we introduce recombination costs X ∈ N

, where

X (k) denotes thecost between bubblesof indices k−1 andk. For example,

if a transmission vector

passes on an allele from h

at location k − 1

and an allele from h

at k, this would incur a recombination cost of X (k).

Yet, when transmission vector

passes on na at location k − 1 and an

allele from h

at bubble k, this would not incur any recombination cost.

Having deﬁned all necessary terms, we can now more clearly deﬁne

the gPedMEC problem. Consider a pedigree graph G

containing bubble

chains C and recombination costs X , from a pedigree with individuals

I and relationships T ; each individual has corresponding PacBio read

alignments R

, and matrices F

and W

. The crux of gPedMEC is then

to determine the set of all bubble matrix entries to be ﬂipped which

accrues the minimal cost, based on a) the weight of ﬂipping these entries

coupled with b) the recombination cost incurred by the transmission

vectors implied by the resulting child haplotype paths. Once these matrix

entries are determined, the haplotypes h

and h

for all individuals become

self-evident.

Dynamic Programming to Solve gPedMEC. In the following section, we

discuss our dynamic programming (DP) algorithm to solve gPedMEC,

The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” — 2019/3/16 — page 5 — #5

A haplotype-aware de novo assembly of related individuals 5

Fig. 6. This ﬁgure shows the process of our DP algorithm. The top row shows the three

input bubble matrices from mother, father and child, with their weights (a). The following

rows show sample processes for DP column initialization and recurrence. In block (c),

sample bipartitions and transmission values are shown with the respective calculations in

block (d). The cost calculation ﬁnds the best assignment of allele-pairs to the example

set of partitions which accrues the minimum cost. For example, for computing the allele

assignment cost of allele 0 to the green partition, we pay a cost of 2. The other costs can

be computed similarly. In block (f), the recurrence step is shown that minimizes the value

of the bipartition for all previous columns, plus the initialization cost, and allowing various

possibilities of recombinations. This process will be repeated for all possible transmission

vectors and compatible partitions until last column. Figure adapted from Garg (2018)

which is the generalization of Garg et al. (2016), to which we refer the

reader to for additional information about the algorithm. The motivation

behind using a dynamic programming algorithm is utilize a DP table to

determine the optimal haplotype paths more efﬁciently than with a brute

force algorithm, which would require exponential time with respect to the

number of bubbles and alignments.

DP cell initialization As in Garg et al. (2016), we proceed by ﬁrst

computing the initial DP cell cost ∆

(k, B, t) accrued by ﬂipping entries

in column k of each bubble matrix, where k is the index of bubble

, B is a bipartition of reads and t ∈ {0, 1, na}

is a transmission

tuple indicating which haplotype from mother and father is respectively

inherited, if applicable.

Intuitively, there is a relationship between a bipartition B, a

transmission tuple t, and the resulting set of readsets induced at location k,

termed S(k , B, t). For example, if the transmission tuple does not contain

any na values, then the child’s haplotype path must be a combination

of parental haplotypes - correspondingly, the reads in the child’s bubble

matrix can be merged with bipartitions in the parental bubble matrices due

to the identity by descent (IBD) condition. In the case of a trio, this would

lead to two resulting bipartition readsets. In the presence of na values, the

child’s reads would not be merged with parental reads, leading to three

bipartition readsets in the case of a trio.

For a given bubble l

, each bipartition can be assigned any pair of

alleles (x, y), of which there are (





+ |l

|) (the added |l

| accounts

for the possibility of assigning the same allele). To notate this, we let

k,R

represent the cost of ﬂipping reads for a speciﬁc bipartition readset

R ∈ S(k, B, t) to an allele-pair (d

, d

) ∈ A, where A is the set of

allele-pairs {(x, y) ∈ l

× l

|x ≤ y}:

k,R

= min{w

k,X

+ w

k,Y

, w

k,Y

+ w

k,X

}

where X and Y are the readsets of a bipartition readset R. The cost of

ﬂipping to a given allele can be computed as:

k,C

(i,j)∈C

(j, k) 6= dK · W

(j, k, d),

Thus, to initialize the cost ∆

(k, B, t), we need to ﬁnd the assignment

of allele pairs to all bipartition readsets which incurs the minimal ﬂipping

cost. This can be formalized in the following way:

∆

(k, B, t) = min

a∈A

S(k,B,t)







R∈S(k,B,t)

a(R)

k,R







(3)

The inner sum computes the minimum allele-pair assignment cost from

all bipartitions in S(k , B, t). The calculations in Figure 6 provide a

more concrete example for calculating DP cell initialization cost. In this

example, we assume a sample partition of reads and child partition as per

transmission value shown in Figure 6(c) where the mother passes on the

green allele and the father passes on the blue. To calculate the DP cell

initialization cost in (d), we ﬁnd the assigment of allele-pairs to each of

the bipartition readsets which incurs the lowest cost. For example, the cost

associated with the allele-pair assignment (0,2) and (0,1) to bipartitions

(green, purple) and (orange, blue), respectively, is (2) + (5+8) + (4+1) +

(3+4+4) = 29 because the minimum cost of ﬂipping all reads in the green

partition to allele 0 is 2, that of ﬂipping the purple partition to allele 2

is (5+8) = 13, and so on. In this way, we can compute costs for other

allele-pair assignments and store the minimum in the DP cell.

DP column initialization. Every entry C(1, B, t) in the ﬁrst column of

the DP table is ﬁlled with the the corresponding cell initialization cost

∆

(k, B, t) for all bipartitions B and transmissions t. Thus, on input

the set of reads covering l

, each C(1, B, t) is calculated according to

Equation 3.

As described in Patterson et al. (2014) and Garg et al. (2016), we

enumerate over bipartitions in Gray code order, thus ensuring a runtime

of O((





+ |l

|I|−|T |

· 2

|F(k )|+2·|T |

DP recurrence. In the recurrence step, we compute C(k, B, t) at column

k, which intuitively represents the optimal allele-pair assignment cost to

bipartitions B until column k. In general, the cell C(k, B, t) can be

computed as the optimal accumulated cost of the bipartitions B until k −1

columns plus ∆

(k, B, t) under various possibilities of recombination.

Thus we add ∆

(k, B, t) to values from column k − 1, where possible

recombination costs are incurred according to the various values of t. The

additional constraint is that only entries in column k−1 whose bipartitions

are compatible with ∆

(k, B, t) are considered, where two bipartitions

are deemed compatible if they share the same readsets. By only considering

compatible readsets, we are ensuring that we maintain the reality that a

single PacBio read cannot come from more than one haplotype. For two

bipartitions B and B

, compatibility is denoted as B ' B

In more formal terms, the value of C(k, B, t) can be written as

following:

C(k, B, t) = ∆

(k, B, t) (4)

+ min

∈B(F(k−1)):B

∈{0,1,na}

2|T |



C(k − 1, B

, t

) + d

(t, t

) · X (k)



In this equation, the term d

(t, t

) refers to the distance, or number

of changes, between two transmission vectors and thereby represents the

The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

A haplotype-aware de novo assembly of related individuals using pedigree graph

Figures

Citations

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Chromosome-scale, haplotype-resolved assembly of human genomes.

Accurate chromosome-scale haplotype-resolved assembly of human genomes

Efficient chromosome-scale haplotype-resolved assembly of human genomes

Recovering individual haplotypes and a contiguous genome assembly from pooled long read sequencing of the diamondback moth (Lepidoptera: Plutellidae)

References

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data

The Diploid Genome Sequence of an Individual Human

Related Papers (5)

Assemblathon 1: A competitive assessment of de novo short read assembly methods

MsPAC: a tool for haplotype-phased structural variant detection.

A critical comparison of technologies for a plant genome sequencing project.

HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data.

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

Frequently Asked Questions (15)

Q1. What have the authors contributed in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?

Q2. What are the future works in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?

Q3. How did the authors evaluate the predicted assemblies?

Q4. What is the motivation behind using a dynamic programming algorithm to solve gPedMEC?

Q5. What is the main reason for the need for a haplotype-resolved?

Q6. How will the authors use heuristic approaches to perform polyploidy phasing?

Q7. How did the authors convert the assembly graph to a bluntified sequence graph?

Q8. How do the authors construct the true haplotype paths?

Q9. Why did the authors use a simulated genome?

Q10. What is the limitation of the TrioCanu method?

Q11. How many SNVs are predicted for Arabidopsis Thaliana?

Q12. How does the average identity of the haplotigs change with the increase in coverage?

Q13. How many bubbles did the authors consider to be classified?

Q14. How many Mb of assemblies can the authors produce for each individual in a trio?

Q15. How can the authors perform haplotype-aware error correction on the reads?