A haplotype-aware de novo assembly of related individuals using pedigree graph
Citations
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
Chromosome-scale, haplotype-resolved assembly of human genomes.
Accurate chromosome-scale haplotype-resolved assembly of human genomes
Efficient chromosome-scale haplotype-resolved assembly of human genomes
Recovering individual haplotypes and a contiguous genome assembly from pooled long read sequencing of the diamondback moth (Lepidoptera: Plutellidae)
References
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing
SPAdes, a new genome assembly algorithm and its applications to single-cell sequencing ( 7th Annual SFAF Meeting, 2012)
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
The Diploid Genome Sequence of an Individual Human
Related Papers (5)
Assemblathon 1: A competitive assessment of de novo short read assembly methods
MsPAC: a tool for haplotype-phased structural variant detection.
Frequently Asked Questions (15)
Q2. What are the future works in "A haplotype-aware de novo assembly of related individuals using pedigree graph" ?
One restriction in their model is the use of constant recombination rates ; the authors aim to fine tune this parameter in the future according to genomic distances, and properly incorporate recombination hotspots. Their framework, in principle, is generalized to incorporate any variety of datasets ; in the future, the authors hope to optimize their method by incorporating data such as chromatin conformation capture ( Burton et al., 2013 ) and linked read sequencing ( Weisenfeld et al., 2017 ).
Q3. How did the authors evaluate the predicted assemblies?
The authors evaluate the child’s predicted haplotype-resolved assemblies by aligning their predicted assemblies to the true simulated genome in their yeast-based experiments, and to TrioCanu’s published assemblies of A. Thaliana when handling real data.
Q4. What is the motivation behind using a dynamic programming algorithm to solve gPedMEC?
The motivation behind using a dynamic programming algorithm is utilize a DP table to determine the optimal haplotype paths more efficiently than with a brute force algorithm, which would require exponential time with respect to the number of bubbles and alignments.
Q5. What is the main reason for the need for a haplotype-resolved?
developing a haplotype-resolved de novo assembly approach for related individuals which is cost-effective, flexible with regard to genomic complexity and heterozygosity rate, and which does not contain reference bias, is a pressing need for the genomics community.
Q6. How will the authors use heuristic approaches to perform polyploidy phasing?
The authors will explore heuristic approaches to perform polyploidy phasing in an efficient manner, and will aim to use a joint phasing framework to obtain more contiguous diploid genome assemblies.
Q7. How did the authors convert the assembly graph to a bluntified sequence graph?
using VG (Garrison et al., 2017), the authors converted the assembly graph to a bluntified sequence graph—that is, with redundant node sequences removed.
Q8. How do the authors construct the true haplotype paths?
to construct the true haplotype paths, the authors seek the maximally likely paths based on confidence scores of how the nodes are connected to each other over long distances, which the authors determine using PacBio reads aligned to their graph.
Q9. Why did the authors use a simulated genome?
Due to the lack of sufficient parental long-read data to pursue this goal (which major sequencing efforts will likely produce in the near future), the authors primarily considered simulated data for their comprehensive study of assembly behavior at varying heterozygosity rates and long-read coverage.
Q10. What is the limitation of the TrioCanu method?
The TrioCanu method (Koren et al., 2018) is a hybrid approach that takes advantage of parental Illumina data and long reads from the child in a trio; yet, it has the limitation of not phasing variants that are heterozygous in all individuals.
Q11. How many SNVs are predicted for Arabidopsis Thaliana?
From Figure 7, the authors observe that the the number of predicted SNVs or short indels rises in response to increasing heterozygosity rate; for example, 58,541 and 178,743 for genomes with heterozygosity rates of 0.5% and 1.5%, respectively.
Q12. How does the average identity of the haplotigs change with the increase in coverage?
In their approach, the authors observe that as the authors increase the long read coverage from 5× to 15× for each individual, the average identity of the haplotigs increases from 99.3% to 99.9%.
Q13. How many bubbles did the authors consider to be classified?
In measuring partitioning accuracy of long reads, the authors considered reads to be classified only if covering a fixed threshold of bubbles.
Q14. How many Mb of assemblies can the authors produce for each individual in a trio?
For real data from Arabidopsis Thaliana, the authors can produce complete assemblies of 119 Mb at 15× coverage for each individual in a trio.
Q15. How can the authors perform haplotype-aware error correction on the reads?
One the authors obtain the partitions of long-read data for each individual, the authors can perform haplotype-aware error-correction on the reads.