Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly
read more
Citations
Nanopore sequencing and assembly of a human genome with ultra-long reads
The Human Pangenome Project: a global resource to map genomic diversity
A joint NCBI and EMBL-EBI transcript set for clinical genomics and research
Diversity in non-repetitive human sequences not found in the reference genome
Semi-automated assembly of high-quality diploid human reference genomes
References
Initial sequencing and analysis of the human genome.
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
A global reference for human genetic variation.
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
An integrated map of genetic variation from 1,092 human genomes
Related Papers (5)
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
SRAssembler: Selective Recursive local Assembly of homologous genomic regions.
Assemblathon 1: A competitive assessment of de novo short read assembly methods
Frequently Asked Questions (16)
Q2. Why did the authors add the modeled centromeres to the reference assembly?
The authors added the modeled centromeres to the reference assembly to serve as catalysts for analyses of these biologically important and highly variant genomic regions, as annotation targets, and to act as read sinks for centromere-containing reads in mapping analyses (Miga et al. 2015).
Q3. Why did the authors seek to identify and correct erroneous reference bases?
Because erroneous reference bases, estimated to occur at a rate of 10−5 (International Human Genome Sequencing Consortium 2004), can result in incorrect variant calls, complicate gene annotation, and in the case of indels, complicate read alignments, the authors sought to identify and correct such sites (International Human Genome Sequencing Consortium 2004).
Q4. Why did the authors use FRC curves to evaluate compression and expansion in each assembly?
because repetitive sequences have typically been prone to collapse in WGS assemblies, the authors also used FRC curves to evaluate compression and expansion in each of the assemblies.
Q5. Why are some gaps in the genome being created?
New reference-quality sequence sources are needed, because generation of finished sequence from clone libraries is in significant decline due to cost and some remaining assembly gaps occur in regions recalcitrant to cloning.
Q6. What is the current human reference genome assembly?
The human reference genome assembly, initially released more than a decade ago, remains at the nexus of basic and clinical research.
Q7. Why did the authors examine other facets of the assemblies?
because variant calling is only one use case for the reference assembly, the authors also examined other facets of these de novo assemblies.
Q8. What are the challenges and limitations of de novo assemblies?
The de novo assemblies also demonstrate the challenges and limitations in transforming data associated with repetitive or complex genomic regions from a rich graph-based assembler representation to a narrower linear assembly representation.
Q9. How many reads are now mapped to the GRCh38 primary assembly?
Although the GRCh37 primary assembly is an excellent mapping target, with 99.92% of reads aligned, the authors find that 64.32% of the unmapped reads are now mapped to the GRCh38 primary assembly.
Q10. What are the expected changes in the modeled sequences?
The authors anticipate that these modeled sequences will be updated in future assembly versions as new sequencing and assembly technologies make it possible to provide longer-range representations for these regions.
Q11. What are the list of haplotypes that are coplaced?
These lists also include haplotype-specific or copynumber variant genes, for which coplacement occurs when they are absent from the sample haplotype.
Q12. How did the authors preserve the assembly representation of genes for which theCHM1haplotype is?
Wherever possible, the authors preserved the assembly representation of genes for which theCHM1haplotype is deleted by adding components containing these genes to alternate loci scaffolds.
Q13. How did the authors determine the impact of assembly updates on read mappings in the 2.6 Gb?
Although assembly updates are expected to alter read alignments in changed regions, the authors also investigated their impact on read mappings in the 2.6 Gbp of unchanged reference sequence, using a script written for this purpose (Supplemental Code).
Q14. How much of the transcripts dropped from the CHM1 assembly due to coplacement?
there are 35%–40% fewer transcripts dropped from the CHM1_1.1 assembly due to coplacement than from the FALCON or Celera Assembler CHM1 assemblies, indicating that assembly method has a substantial impact on gene representation.
Q15. What is the important change in the content of the reference genome assembly?
Amajor change in the content of the reference genome assembly is the replacement of the 3-Mbp centromeric gaps on all GRCh37 chromosomes with modeled centromeres from the LinearCen1.1 (normalized) assembly, derived from a database of centromeric sequences from the HuRef genome (GCA_000442335.2)
Q16. What is the role of the reference assembly in the evolution of genome biology?
The reference assembly provides context for both the scale and types of variation that will be observed from one sample to the next.