scispace - formally typeset
Open AccessPosted ContentDOI

A haplotype-aware de novo assembly of related individuals using pedigree graph

Reads0
Chats0
TLDR
A novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences data from all related individuals is presented, thereby generalizing previous work on single individuals.

Content maybe subject to copyright    Report

“output” 2019/3/16 page 1 #1
Bioinformatics
doi.10.1093/bioinformatics/xxxxxx
Advance Access Publication Date: Day Month Year
Manuscript Category
Subject Section
A haplotype-aware de novo assembly of related
individuals using pedigree graph
Shilpa Garg
1,5,
, John Aach
1
, Heng Li
3
, Richard Durbin
4
, George
Church
1,5,
1
Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA,
2
Harvard College, Harvard University, Cambridge,
Massachusetts, USA,
3
Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA,
4
Department
of Genetics, University of Cambridge, Cambridge, United Kingdom,
5
Wyss Institute for Biologically Inspired Engineering, Harvard
University, Boston, Massachusetts, USA
To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation: Reconstructing high-quality haplotype-resolved assemblies for related individuals of various
species has important applications in understanding Mendelian diseases along with evolutionary and
comparative genomics. Through major genomics sequencing efforts such as the Personal Genome
Project, the Vertebrate Genome Project (VGP), the Earth Biogenome Project (EBP) and the Genome in
a Bottle project (GIAB), a variety of sequencing datasets from mother-father-child trios of various diploid
species are becoming available.
Current trio assembly approaches are not designed to incorporate long-read sequencing data from parents
in a trio, and therefore require relatively high coverages of costly long-read data to produce high-quality
assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale
diploid genomes in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need
of the genomics community.
Results: We present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina
data and long-read Pacific Biosciences (PacBio) data from all related individuals, thereby generalizing
our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on
a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from
Arabidopsis Thaliana. We show that we require as little as 30× coverage Illumina data and 15× PacBio
data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we
show that we can detect and phase variants from generated phased assemblies.
Availability: https://github.com/shilpagarg/WHdenovo
Contact: shilpa_garg@hms.harvard.edu, gchurch@genetics.med.harvard.edu
1 Introduction
The ability to faithfully reconstruct genomes is a crucial step in better
understanding evolution and the nature of inherited disease (Tewhey
et al. (2011)). De novo genome assembly aims to address this problem
by generating complete genome sequences from error-prone sequencing
reads alone, without the use of a reference genome. Creating a de novo
assembler that resolves genomic repeats and is generalized to genomes
of varying heterozygosity rate (the average proportion of loci that differ
between homologous sequences) has posed a significant challenge to
the scientific community. Assembling diploid genomes adds further
difficulty; in order to accurately represent diploid genomes, assemblies
must correctly identify and phase (i.e. determine the correct haplotype
of) homologous sequences. One promising approach to diploid genome
assembly is incorporating sequencing information from a related set of
© The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 1
the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” 2019/3/16 page 2 #2
2 S. Garg, J. Aach, H. Li, R. Durbin, G. Church
Fig. 1. The Illuminadata (middle) from the trio of genomes can be represented as a pedigree
graph. The bubbles in the graph (bottom) show four different variants; from the left, there
is an indel, SNV, SV, and SNV.
individuals, particularly from mother-father-child trios, and using the
Mendelian information offered by the corresponding pedigree to infer the
layout of alleles along homologous sequences.
Some assemblers (Levy et al., 2007; Pryszcz and Gabaldón, 2016;
Simpson and Durbin, 2012; Bankevich et al., 2012; Li, 2015) tackle single-
individual assembly using Next-Generation Sequencing data; however,
while accurate, the short length of NGS reads often leads assemblies to
fragment at repetitive and highly-heterozygous regions. Other assemblers
(Koren et al., 2017; Vaser et al., 2017; Berlin et al., 2015; Chin et al.,
2013) utilize longer Third-generation sequencing reads to obtain more
contiguous sequences, and to help resolve repeats and heterozygous
regions. Yet, these require high coverage due to the high error rate in
long-read data, which is very costly. Hybrid assemblers utilize both types
of reads, taking advantage of the accuracy of short reads and the scale of
long reads to generate complete, high-quality assemblies (Bashir et al.,
2012; Antipov et al., 2015; Zimin et al., 2017).
However, the methods described above collapse differences between
homologous pairs into a single consensus sequence, without regards for
the rich information given by the layout of different alleles along two
DNA strands (Simpson and Pop, 2015). By contrast, other assemblers by
Chin et al. (2016); Garg et al. (2018); Weisenfeld et al. (2017) have been
developed to generate haplotigs, haplotype-resolves assemblies for diploid
genomes.
Certain assemblers have been developed to employ pedigree
information into the process of assembly as well, specifically for the case
of mother-father-child trios. For example, trio-sga creates haplotypes of
the child based on parental Illumina data (Malinsky et al., 2016), while
TrioCanu uses such data to partition child long-read data and subsequently
assembles the partitioned reads separately (Koren et al., 2018). Yet, these
two methods cannot phase variants which are heterozygous in all three
individuals in a trio, and by relying solely on parental Illumina data,
may not correctly haplotype long reads which cover repetitive genomic
regions. Furthermore, TrioCanu does not work properly at low coverages
of long-read data.
Another method approaches the diploid assembly problem by aligning
long-read data from all individuals in a pedigree to a reference-genome,
then finding the most likely partitioning of reads as determined by the
PedMEC problem (Garg et al., 2016). Essentially, the PedMEC (Weighted
Minimum Error Correction on Pedigrees) problem finds the partitioning of
reads from related individuals that incurs the least cost, which is calculated
based on the likelihood of errors occurring at various locations along the
reads as well as recombination costs between each site of heterozygosity.
This method, however, only concerns bi-allelic variants, and contains
Pedigree Graph
SNV1
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
01
1
0
1
0
1
1
0
0
1 01
0
1
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
01
1
0
0
1
1
0
0
1
0
1
Reads
Mother
Father
Child
Haplotigs
Mother
Father
Child
1
0
SNV2
SNV3
SNV4
SV1
SV2
Fig. 2. Input: This figure shows the pedigree graph (top) (consisting of four SNVs and two
SVs) and the PacBio reads (gray) with respective alleles (white digits). Output: the final
haplotigs (crimson) for each individual in a trio. Our method can phase the variants that are
heterozygous (SNV1) in all individuals and any variant covered by atleast one read from
any individual (SV2 in child), resulting in continuous and complete haplotigs
reference bias, meaning that unique DNA sequences may not be detected
or phased correctly.
Thus, developing a haplotype-resolved de novo assembly approach for
related individuals which is cost-effective, flexible with regard to genomic
complexity and heterozygosity rate, and which does not contain reference
bias, is a pressing need for the genomics community.
Contributions. Our graph-based method, implemented as a new tool
WHdenovo, performs phasing in the space of a pedigree graph (defined
below), and is generalized to assemble genomes of varying heterozygosity
rate and with multi-allelic variants, thus allowing for the creation of
accurate, complete haplotigs.
More precisely, our approach buildsa pedigree graph, or joint sequence
graph, using combined Illumina data from all individuals in a pedigree; the
graph represents heterozygous locations as bubbles, as shown in Figure 1.
Given a pedigree graph containing a series of bubbles, long-read (PacBio)
data from each individual are threaded through it; in essence, the most
probable paths that the long reads trace through the bubbles in our pedigree
graph, and which obey the Mendelian constraints imposed by the pedigree,
represent our true haplotypes. We draw key concepts from the graph-based
assembly approach for single individuals described in Garg et al. (2018)
and the PedMEC formulation set forth in Garg et al. (2016), yet synthesize
and extend them to overcome their respective shortcomings.
Our graph-based approach poses several advantages. For example, we
can detect and phase all types of small and large variants, and require
relatively lowcoverage of costly long-read data. We can also phase variants
that are heterozygous in all individuals; for example, SNV1 from Figure
2. Moreover, by incorporating hybrid data from all related individuals, we
can effectively phase reads in repetitive genomic regions, and as shown by
SV2 in Figure 2, if parental reads span a variant but child reads do not, we
can still correctly identify and phase the variant in all three individuals.
Assemblers such as TrioCanu would not be able to phase variants under
these circumstances, and may forced to break assembly contiguity.
To demonstrate the practical effectiveness of our method of haplotype-
aware de novo assembly for related individuals, we assemble two sets of
genomes. First, we haplotype the genomes of a simulated trio of pseudo-
diploid yeast, which allows us to comprehensively study assembly at
varying read coverage and heterozygosity rates. Then, we use real data to
assemble the complete diploid genome of a trio of Arabidopsis Thaliana.
These results indicate that our hybrid method is adaptable to genomes of
the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” 2019/3/16 page 3 #3
A haplotype-aware de novo assembly of related individuals 3
Fig. 3. Overview of the pedigree-aware phased assembly pipeline. After we generate a
pedigree graph using Illumina data, we align PacBio reads from all related individuals
from a pedigree to the graph. Using these alignments, we detect ordered sequences of
heterozygous regions, represented as bubble chains. We then find the best partitioning of
each set of reads into haplotypes based on the paths taken through the bubble chains, and
assemble the partitioned reads.
varying heterozygosity rates. We demonstrate that our method is cost-
effective, requiring only 30× short-read coverage and 15× long-read
coverage for every individual in a pedigree to generate near chromosomal-
scale assemblies for all individuals. Moreover, at these coverages, we show
that our assemblies for both real and simulated data are more accurate and
contiguous when compared to those produced by TrioCanu at 45× child
long-read coverage. In a final experiment, we also show that we can detect
and phase variants.
2 Pedigree-aware phased assembly pipeline
In this section, we present the workflow of our pipeline, which takes input
as raw Illumina and PacBio sequencing data from all related individuals
in a pedigree, and outputs final, polished haplotigs. Once we create a
pedigree sequence graph, our goal is to find the walks through this graph
that correspond to the true haplotypes of all related individuals. These
haplotype paths will encode the phasing of all variants in the haplotigs
for each individual in the pedigree. Due to errors of Illumina data and
genomic characteristics such as repeats, there are inevitably multiple paths
through this graph that do not correspond to true haplotype paths. Thus,
to construct the true haplotype paths, we seek the maximally likely paths
based on confidence scores of how the nodes are connected to each other
over long distances, which we determine using PacBio reads aligned to
our graph.
This pipeline generalizes our previous single-individual approach
(Garg et al., 2018) to related individuals to yield the chromosome-scale
haplotigs. Figure 3 represents a conceptual workflow of our pipeline,
detailed below.
Pedigree Graph. We use short, accurate Illumina reads from all related
individuals as the basis for generating a pedigree graph. We denote the
bidirected pedigree graph as G
p
, containing a set of nodes N
p
and
a set of edges E
p
. Conceptually, each node n
i
N
p
represents a
segment of DNA found in the Illumina data, and the node n
0
i
represents
its reverse complement. Note that because nodes are generated using the
combined reads of all individuals in a pedigree, not every node sequence
is necessarily present in the genome of every individual. Additionally,
nodes can be traversed in either direction; when traversed in the reverse
Fig. 4. This figure considers a toy example of our algorithm to find bubble chains. Shown
is the joint sequence graph, containing nodes and bubbles, with many reads spanning them.
In the filtration step, n3 is filtered because it is only covered by a single read, and n7 is
removed because it branches (has degree > 2). Unitigs U1-4 are generated by using DFS for
the pairs of edges and bubbles whose connection are covered by at least 5 reads. Because
there is a read that spans U1 and U2, and a read that spans U2 and U3, we can combine
these smaller unitigs into larger final bubble chain C1.
direction, its sequence is considered reverse-complemented. Every edge
e
ij
E
p
represents an adjacency between the sequences represented by
node n
i
and n
j
.
The pedigree graph G
p
contains a set of bubbles L that represents
heterozygous variants (or sequencing errors). The terminology bubble is
drawn from Garg et al. (2018) and follows from the ultrabubble coined
by (Paten et al., 2017). Graphically speaking, bubbles are directed and
acyclic, biconnected, and minimal, as discussed by Garg et al. (2018).
We define each bubble l
k
to be the set of allele paths contained within it,
where each allele path is a unique sequence of nodes spanning a common
start and end node. Intuitively, each bubble is bookended by common
sequences —the start and end nodes —and the unique node sequences
connecting them represent the alleles gleaned from our Illumina data.
Figures 1 demonstrates the bubbles representing various variants.
PacBio Alignments. We align PacBio long reads from all individuals to
produce paths through the pedigree graph G
p
, in a similar process to Garg
et al. (2018). The concept of aligning a PacBio read to G
p
captures the
idea that the sequence of a single PacBio read can trace a path through
the sequences contained in many nodes. Thus, for a given PacBio read,
we define a read alignment r
i
to be a path through G
p
, defined by the
oriented nodes n
1
. . . n
k
which map to a particular read. Given a set of
individuals I, we will align the PacBio reads from every individual i I
to G
p
, resulting in a set of read alignments R
i
= {r
1
, r
2
, . . . , r
j
} for
every individual. As an intuitive checkpoint, note that as PacBio reads
from different individuals trace different paths through nodes and bubbles
in G
p
, we gain information about not only the correct ordering of nodes
to form the genome in question, but also the alleles and phasing present in
each individuals.
Bubble Chains. To consolidate the information gained by the PacBio
alignments, we need to formalize the process of finding ordered bubble
chains which represent the layout of heterozygous sites across the genome.
Ideally, we would be able to use our PacBio alignments to obtain C, a set
the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” 2019/3/16 page 4 #4
4 S. Garg, J. Aach, H. Li, R. Durbin, G. Church
of bubble chains containing one continuous bubble chain corresponding
to each chromosome. Before generating bubble chains, we must perform
a series of filtration steps to remove erroneous bubbles according to the
algorithm described in the following paragraph:
On input R, the set of all read alignments R
i
for individuals i I, we
project all partial alignments to bubble space—that is, for every instance
where a read traverses a bubble, we replace the corresponding set of nodes
by the appropriate bubble ID. We now perform our filtration steps; for
every pair of nodes or bubbles from every alignment path in R, denoted
as (x, y) R, we compute the coverage c
(x,y)
, and if c
(x,y)
< 5, we
remove the pair. Next, we calculate the degree of every remaining node x,
and if deg (x) 3, we remove all pairs containing x. Using the resulting
filtered pairs, we then perform DFS to find U, an initial set of unambiguous
bubblechains termed unitigs. Finally, if there is at leastone read connecting
two unitigs every pair of unitigs in U , we can gain information about the
ordering between these initial unitigs. Record the resulting orderings as
C, the set of final bubble chains. Figure 4 demonstrates the algorithm with
a basic example.
Fig. 5. For a subgraph of G
p
, the example shows three bubbles l
1
, l
2
, and l
3
, and their
corresponding alleles. Reads from mother, father and child traverse the bubbles.
Graph-based phasing on pedigrees. We introduce the Graph version of
Weighted Minimum Error Correction on Pedigrees Problem, or gPedMEC,
as our central phasing algorithm. gPedMEC relates to the PedMEC
formulation set forth by Garg et al. (2016), which concerns alignment to
a reference genome and applies only to bi-allelic variants. Considering
multi-allelic variants in the gPedMEC framework requires additional
phasing-related observations. Specifically, bubbles which represent bi-
allelic variants ensure that a child’s haplotype paths must be the same
as a combination of parental haplotype paths. By contrast, in bubbles
containing multi-allelic variants, it is possible for child haplotype paths to
differ from parental ones; this would occur under the circumstance that the
variant encoded by the bubble is long, and a new allele were created as a
result of recombination within the variant itself. Based on these insights,
we create a new algorithm to determine the haplotypes of any set related
individuals in a more flexible, accurate, and representative manner.
Ultimately, the goal of solving gPedMEC is to recreate the haplotype
paths of every individuals through the bubble chains we calculated in C.
The haplotype paths for a given individual are determined by deducing
the two paths through the bubble chains which incur the least cost (i.e. are
most likely), where cost is determined by confidence in alignment paths
and recombination costs. Doing so will inherently also compute long-read
partitionings.
In order to represent the paths taken through our bubble chains
for each set of alignments R
i
, we create bubble matrices F
i
{0, 1, . . . e, −}
R
i
×M
for each individual i. Here, e is the maximum
number of alleles contained in any bubble, and M is the number of bubbles
in a chain. Figure 5 shows a sample 3-bubble chain with PacBio reads from
each individual aligned to it; the corresponding bubble matrices are shown
for the three individuals are shown below:
F
1
=
l
1
l
2
l
3
r
1
1 0 0
r
2
0 0 0
r
3
1 0 1
r
4
1 0
, F
2
=
l
1
l
2
l
3
r
5
2 1 1
r
6
2 1 1
r
7
2 0 0
r
8
2 0 0
,
F
3
=
l
1
l
2
l
3
r
9
0 0 0
r
10
2 1 1
r
11
0 1 0
r
12
1 1
(1)
If all reads in a bubble matrix F
i
were to follow exactly one of two
paths through a bubble chain, solving for the true haplotype paths for the
corresponding individual, h
0
i
, h
1
i
{0, 1, ..., e, −}
M
, would be trivial.
For example, F
2
is conflict–free, and the corresponding haplotype paths
are self-evident. However, due to long-read sequencing errors, conflicts
within bubble matrices are inevitable. Thus, our goal is to find the optimal
set of matrix entries which can be flipped in order to create conflict free
matrices containing a bipartition of reads which follow the two haplotype
paths of each individual (and which obey the Mendelian constraints of the
pedigree). So, for every individual, we also need to introduce a weight
matrix W
i
N
R
i
×M×e
, where each entry W
i
(j, k, x) represents the
cost of flipping read j, at bubble l
k
, to allele x, where k {0, 1, ...M}
and x {0, 1..., |l
k
|}. An example weight matrix, corresponding to F
1
,
is shown below:
W
1
=
l
1
l
2
l
3
r
1
[2, 0, 10] [0, 1] [0, 10]
r
2
[0, 10, 10] [0, 2] [0, 5]
r
3
[12, 0, 5] [0, 1] [4, 10]
r
4
[7, 0, 8] [0, 10]
,
(2)
Other weight matrices can be written similarly.
We need to account for the Mendelian constraints imposed by the
pedigree as well. Thus, we introduce the transmission vectors
~
t
m
,
~
t
f
{0, 1, na}
M
for each triple (m, f, c) T to denote the alleles passed
from each parent to child at every location in a bubble chain, where
~
t
m
= 1 if the mother passes on an allele from h
1
m
, and so on. Under the
circumstance that recombination occurs within a bubble, in which case the
value of na is passed on, no haplotype of any parent is directly transmitted
to the child. Thus, the recombined sequences will form allele-paths in the
bubble.
Additionally, we introduce recombination costs X N
M
, where
X (k) denotes thecost between bubblesof indices k1 andk. For example,
if a transmission vector
~
t
f
passes on an allele from h
1
f
at location k 1
and an allele from h
0
f
at k, this would incur a recombination cost of X (k).
Yet, when transmission vector
~
t
m
passes on na at location k 1 and an
allele from h
0
f
at bubble k, this would not incur any recombination cost.
Having defined all necessary terms, we can now more clearly define
the gPedMEC problem. Consider a pedigree graph G
p
containing bubble
chains C and recombination costs X , from a pedigree with individuals
I and relationships T ; each individual has corresponding PacBio read
alignments R
i
, and matrices F
i
and W
i
. The crux of gPedMEC is then
to determine the set of all bubble matrix entries to be flipped which
accrues the minimal cost, based on a) the weight of flipping these entries
coupled with b) the recombination cost incurred by the transmission
vectors implied by the resulting child haplotype paths. Once these matrix
entries are determined, the haplotypes h
0
i
and h
1
i
for all individuals become
self-evident.
Dynamic Programming to Solve gPedMEC. In the following section, we
discuss our dynamic programming (DP) algorithm to solve gPedMEC,
the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

“output” 2019/3/16 page 5 #5
A haplotype-aware de novo assembly of related individuals 5
Fig. 6. This figure shows the process of our DP algorithm. The top row shows the three
input bubble matrices from mother, father and child, with their weights (a). The following
rows show sample processes for DP column initialization and recurrence. In block (c),
sample bipartitions and transmission values are shown with the respective calculations in
block (d). The cost calculation finds the best assignment of allele-pairs to the example
set of partitions which accrues the minimum cost. For example, for computing the allele
assignment cost of allele 0 to the green partition, we pay a cost of 2. The other costs can
be computed similarly. In block (f), the recurrence step is shown that minimizes the value
of the bipartition for all previous columns, plus the initialization cost, and allowing various
possibilities of recombinations. This process will be repeated for all possible transmission
vectors and compatible partitions until last column. Figure adapted from Garg (2018)
which is the generalization of Garg et al. (2016), to which we refer the
reader to for additional information about the algorithm. The motivation
behind using a dynamic programming algorithm is utilize a DP table to
determine the optimal haplotype paths more efficiently than with a brute
force algorithm, which would require exponential time with respect to the
number of bubbles and alignments.
DP cell initialization As in Garg et al. (2016), we proceed by first
computing the initial DP cell cost
C
(k, B, t) accrued by flipping entries
in column k of each bubble matrix, where k is the index of bubble
l
k
, B is a bipartition of reads and t {0, 1, na}
2
is a transmission
tuple indicating which haplotype from mother and father is respectively
inherited, if applicable.
Intuitively, there is a relationship between a bipartition B, a
transmission tuple t, and the resulting set of readsets induced at location k,
termed S(k , B, t). For example, if the transmission tuple does not contain
any na values, then the child’s haplotype path must be a combination
of parental haplotypes - correspondingly, the reads in the child’s bubble
matrix can be merged with bipartitions in the parental bubble matrices due
to the identity by descent (IBD) condition. In the case of a trio, this would
lead to two resulting bipartition readsets. In the presence of na values, the
child’s reads would not be merged with parental reads, leading to three
bipartition readsets in the case of a trio.
For a given bubble l
k
, each bipartition can be assigned any pair of
alleles (x, y), of which there are (
|l
k
|
2
+ |l
k
|) (the added |l
k
| accounts
for the possibility of assigning the same allele). To notate this, we let
W
d
k,R
represent the cost of flipping reads for a specific bipartition readset
R S(k, B, t) to an allele-pair (d
1
, d
2
) A, where A is the set of
allele-pairs {(x, y) l
k
× l
k
|x y}:
W
d
k,R
= min{w
x
k,X
+ w
y
k,Y
, w
x
k,Y
+ w
y
k,X
}
where X and Y are the readsets of a bipartition readset R. The cost of
flipping to a given allele can be computed as:
w
d
k,C
=
X
(i,j)C
JF
i
(j, k) 6= dK · W
i
(j, k, d),
Thus, to initialize the cost
C
(k, B, t), we need to find the assignment
of allele pairs to all bipartition readsets which incurs the minimal flipping
cost. This can be formalized in the following way:
C
(k, B, t) = min
aA
S(k,B,t)
X
R∈S(k,B,t)
W
a(R)
k,R
(3)
The inner sum computes the minimum allele-pair assignment cost from
all bipartitions in S(k , B, t). The calculations in Figure 6 provide a
more concrete example for calculating DP cell initialization cost. In this
example, we assume a sample partition of reads and child partition as per
transmission value shown in Figure 6(c) where the mother passes on the
green allele and the father passes on the blue. To calculate the DP cell
initialization cost in (d), we find the assigment of allele-pairs to each of
the bipartition readsets which incurs the lowest cost. For example, the cost
associated with the allele-pair assignment (0,2) and (0,1) to bipartitions
(green, purple) and (orange, blue), respectively, is (2) + (5+8) + (4+1) +
(3+4+4) = 29 because the minimum cost of flipping all reads in the green
partition to allele 0 is 2, that of flipping the purple partition to allele 2
is (5+8) = 13, and so on. In this way, we can compute costs for other
allele-pair assignments and store the minimum in the DP cell.
DP column initialization. Every entry C(1, B, t) in the first column of
the DP table is filled with the the corresponding cell initialization cost
C
(k, B, t) for all bipartitions B and transmissions t. Thus, on input
the set of reads covering l
1
, each C(1, B, t) is calculated according to
Equation 3.
As described in Patterson et al. (2014) and Garg et al. (2016), we
enumerate over bipartitions in Gray code order, thus ensuring a runtime
of O((
|l
k
|
2
+ |l
k
|)
|I|−|T |
· 2
|F(k )|+2·|T |
).
DP recurrence. In the recurrence step, we compute C(k, B, t) at column
k, which intuitively represents the optimal allele-pair assignment cost to
bipartitions B until column k. In general, the cell C(k, B, t) can be
computed as the optimal accumulated cost of the bipartitions B until k 1
columns plus
C
(k, B, t) under various possibilities of recombination.
Thus we add
C
(k, B, t) to values from column k 1, where possible
recombination costs are incurred according to the various values of t. The
additional constraint is that only entries in column k1 whose bipartitions
are compatible with
C
(k, B, t) are considered, where two bipartitions
are deemed compatible if they share the same readsets. By only considering
compatible readsets, we are ensuring that we maintain the reality that a
single PacBio read cannot come from more than one haplotype. For two
bipartitions B and B
0
, compatibility is denoted as B ' B
0
.
In more formal terms, the value of C(k, B, t) can be written as
following:
C(k, B, t) =
C
(k, B, t) (4)
+ min
B
0
∈B(F(k1)):B
0
'B
t
0
∈{0,1,na}
2|T |
C(k 1, B
0
, t
0
) + d
H
(t, t
0
) · X (k)
,
In this equation, the term d
H
(t, t
0
) refers to the distance, or number
of changes, between two transmission vectors and thereby represents the
the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprint (which was not certified by peer review) isthis version posted March 17, 2019. ; https://doi.org/10.1101/580159doi: bioRxiv preprint

Figures
Citations
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Posted ContentDOI

Recovering individual haplotypes and a contiguous genome assembly from pooled long read sequencing of the diamondback moth (Lepidoptera: Plutellidae)

TL;DR: An approach to resolve divergent haplotype sequences is described and multiple validation approaches are described to achieve three outcomes from pooled long read sequencing: the generation of a contiguous haploid reference sequence, the sequences of heterozygous haplotypes; and reconstructed genomic sequences of individuals related to the pooled material.
References
More filters

SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)

Glenn Tesler
TL;DR: SPAdes as mentioned in this paper is a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler and on popular assemblers Velvet and SoapDeNovo (for multicell data).
Journal ArticleDOI

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

TL;DR: Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies.
Journal ArticleDOI

Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data

TL;DR: This work presents a hierarchical genome-assembly process (HGAP) for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT) DNA sequencing.
Related Papers (5)

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Dent Earl, +78 more
- 16 Sep 2011 - 
Frequently Asked Questions (15)
Q1. What have the authors contributed in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?

Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project ( VGP ), the Earth Biogenome Project ( EBP ) and the Genome in a Bottle project ( GIAB ), a variety of sequencing datasets from mother-father-child trios of various diploid species are becoming available. The authors present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences ( PacBio ) data from all related individuals, thereby generalizing their previous work on single individuals. The authors demonstrate the effectiveness of their pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from Arabidopsis Thaliana. The authors show that they require as little as 30× coverage Illumina data and 15× PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, the authors show that they can detect and phase variants from generated phased assemblies. 

One restriction in their model is the use of constant recombination rates ; the authors aim to fine tune this parameter in the future according to genomic distances, and properly incorporate recombination hotspots. Their framework, in principle, is generalized to incorporate any variety of datasets ; in the future, the authors hope to optimize their method by incorporating data such as chromatin conformation capture ( Burton et al., 2013 ) and linked read sequencing ( Weisenfeld et al., 2017 ). 

The authors evaluate the child’s predicted haplotype-resolved assemblies by aligning their predicted assemblies to the true simulated genome in their yeast-based experiments, and to TrioCanu’s published assemblies of A. Thaliana when handling real data. 

The motivation behind using a dynamic programming algorithm is utilize a DP table to determine the optimal haplotype paths more efficiently than with a brute force algorithm, which would require exponential time with respect to the number of bubbles and alignments. 

developing a haplotype-resolved de novo assembly approach for related individuals which is cost-effective, flexible with regard to genomic complexity and heterozygosity rate, and which does not contain reference bias, is a pressing need for the genomics community. 

The authors will explore heuristic approaches to perform polyploidy phasing in an efficient manner, and will aim to use a joint phasing framework to obtain more contiguous diploid genome assemblies. 

using VG (Garrison et al., 2017), the authors converted the assembly graph to a bluntified sequence graph—that is, with redundant node sequences removed. 

to construct the true haplotype paths, the authors seek the maximally likely paths based on confidence scores of how the nodes are connected to each other over long distances, which the authors determine using PacBio reads aligned to their graph. 

Due to the lack of sufficient parental long-read data to pursue this goal (which major sequencing efforts will likely produce in the near future), the authors primarily considered simulated data for their comprehensive study of assembly behavior at varying heterozygosity rates and long-read coverage. 

The TrioCanu method (Koren et al., 2018) is a hybrid approach that takes advantage of parental Illumina data and long reads from the child in a trio; yet, it has the limitation of not phasing variants that are heterozygous in all individuals. 

From Figure 7, the authors observe that the the number of predicted SNVs or short indels rises in response to increasing heterozygosity rate; for example, 58,541 and 178,743 for genomes with heterozygosity rates of 0.5% and 1.5%, respectively. 

In their approach, the authors observe that as the authors increase the long read coverage from 5× to 15× for each individual, the average identity of the haplotigs increases from 99.3% to 99.9%. 

In measuring partitioning accuracy of long reads, the authors considered reads to be classified only if covering a fixed threshold of bubbles. 

For real data from Arabidopsis Thaliana, the authors can produce complete assemblies of 119 Mb at 15× coverage for each individual in a trio. 

One the authors obtain the partitions of long-read data for each individual, the authors can perform haplotype-aware error-correction on the reads.