scispace - formally typeset
Open AccessPosted ContentDOI

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

TLDR
It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract
The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009 and reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that while the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

read more

Content maybe subject to copyright    Report

Evaluation of GRCh38 and de novo haploid genome
assemblies demonstrates the enduring quality of the
reference assembly
Valerie A. Schneider,
1
Tina Graves-Lindsay,
2
Kerstin Howe,
3
Nathan Bouk,
1
Hsiu-Chuan Chen,
1
Paul A. Kitts,
1
Terence D. Murphy,
1
Kim D. Pruitt,
1
Françoise Thibaud-Nissen,
1
Derek Albracht,
2
Robert S. Fulton,
2
Milinn Kremitzki,
2
Vincent Magrini,
2,10
Chris Markovic,
2
Sean McGrath,
2
Karyn Meltz Steinberg,
2
Kate Auger,
3
William Chow,
3
Joanna Collins,
3
Glenn Harden,
3
Timothy Hubbard,
3,11
Sarah Pelan,
3
Jared T. Simpson,
3,12,13
Glen Threadgold,
3
James Torrance,
3
Jonathan M. Wood,
3
Laura Clarke,
4
Sergey Koren,
5
Matthew Boitano,
6
Paul Peluso,
6
Heng Li,
7
Chen-Shan Chin,
6
Adam M. Phillippy,
5
Richard Durbin,
3
Richard K. Wilson,
2
Paul Flicek,
4
Evan E. Eichler,
8,9
and Deanna M. Church
1,14
1
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland
20894, USA;
2
McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA;
3
Wellcome Trust Sanger
Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;
4
European Molecular Biology Laboratory,
European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom;
5
National Human
Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
6
Pacific Biosciences, Menlo Park,
California 94025, USA;
7
Broad Institute, Cambridge, Massachusetts 02142, USA;
8
Department of Genome Sciences, University of
Washington School of Medicine, Seattle, Washington 98195, USA;
9
Howard Hughes Medical Institute, University of Washington,
Seattle, Washington 98195, USA
The human reference genome assembly plays a central role in nearly all aspects of todays basic and clinical research.
GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues
and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations,
gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation
for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify
and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the
centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representa-
tion of human population variation. We demonstrate that the updates render the reference an improved annotation sub-
strate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We
additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assem-
blies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still
provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in
GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understand-
ing of human biology and advance our efforts to improve health.
[Supplemental material is available for this article.]
The human reference genome assembly remains a critical resource
for the biological and clinical research communities (International
Human Genome Sequencing Consortium 2001, 2004). It is distin-
guished from the growing number of human genome assemblies
in public databases by virtue of its long contig and scaffold
N50s, high base-pair accuracy, and robust representations of repet-
itive and segmentally duplicated genomic regions, all of which
Present addresses:
10
Nationwide Childrens Hospital, Columbus, OH
43205, USA;
11
Kings College London, London WC2R 2LS, UK;
12
Ontario Institute for Cancer Research, Toronto, Ontario, Canada
M5G 0A3;
13
Department of Computer Science, University of
Toronto, Toronto, Ontario, Canada M5S 2E4;
14
10X Genomics,
Pleasanton, CA 94566, USA
Corresponding author: schneiva@ncbi.nlm.nih.gov
Article published online before print. Article, suppl emental material, and publi-
cation date are at http://www.genome.org/cgi/doi/10.1101/gr.213611.116.
Freely available online through the Genome Research Open Access option.
© 2017 Schneider et al. This article, published in Genome Research, is available
under a Creative Commons License (Attribution 4.0 International), as described
at http://creativecommons.org/licenses/by/4.0/.
Resource
27:849864 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 849
www.genome.org
Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

reflect the clone-based assembly approach and Sanger sequencing
methods that were the basis of its generation. In particular, it was
the use of large insert BAC clones (>150 kb inserts) and the deep
coverage provided by multiple end-sequenced clone libraries, cou-
pled with extensive use of radiation hybrid, genetic linkage, and
fingerprint maps, that made it possible to span large repetitive re-
gions and achieve the as-yet unsurpassed contiguity of the refer-
ence. Assembled from the DNA of multiple donors, the reference
was intended to provide representation for the pan-human ge-
nome, rather than a single individual or population group, and
is a mosaic of haplotypes whose borders coincide with the under-
lying clone boundaries.
A revision to the assembly model, first used in the previous
version of the reference, GRCh37 (GCA_000001405.1), expanded
the ability of the reference assembly to represent the extent of
structural variation and population genomic diversity whose dis-
covery it facilitated (The International HapMap Consortium
2005; Kidd et al. 2008; Sudmant et al. 2010; Church et al. 2011;
The 1000 Genomes Project Consortium 2015). The introduction
of alternate loci scaffolds enabled GRCh37 to include additional
sequence representations for the highly variant MHC region, as
well as the divergent haplotypes of the MAPT and UGT2B loci,
while retaining the linear chromosome representations familiar
and intuitive to most users (Horton et al. 2008; Xue et al. 2008;
Zody et al. 2008). A second feature of the updated model, assembly
patches, permitted subsequent corrections and addition of new se-
quence representations to the GRCh37 assembly without chang-
ing the chromosome sequences or coordinates on which an
increasing volume of data were being mapped (Zook et al. 2014;
The 1000 Genomes Project Consortium 2015; Pierson et al.
2015). The assembly model remains for GRCh38, the current refer-
ence version. Together, these features of the assembly model
helped ensure that the human reference assembly would continue
to present the most accurate representation of the human genome
possible while providing a stable substrate for large-scale analysis.
The GRCh37 assembly underwent 13 patch releases in the
period from 2009 to 2013 (GCA_000001405.2GCA_00000
1405.14). Despite the availability of these sequences in public da-
tabases, their use has been limited by the inability of common bio-
informatics file formats and tool chains to manage the allelic
duplication they introduce, as well as by their constrained repre-
sentation in popular genome browsers (Church et al. 2015). In ad-
dition, the patches represented only a subset of the assembly
updates made by the Genome Reference Consortium (GRC).
Thus, coordinate changing assembly updates remain essential for
users to access the full suite of assembly improvements, despite
the challenge of transporting data and results to the new assembly
(Hickey et al. 2013; Zhao et al. 2014).
In producing GRCh38, we of the GRC placed special empha-
sis on addressing the following types of assembly issues found in
GRCh37: (1) resolution of tiling path errors and gaps associated
with complex haplotypes and segmental duplications; (2) base-
pairlevel updates for sequencing errors; (3) addition of missing
sequences, with an emphasis on paralogous sequences and popu-
lation variation; and (4) providing sequence representation for ge-
nomic features, such as centromeres and telomeres. Making these
updates involved the use of bioinformatics and experimental
resources and techniques not previously available. We will demon-
strate how the new approaches used in this effort result in a human
reference genome assembly that is more contiguous and complete
than ever before and that provides better gene and variant repre-
sentation than GRCh37, features critical to both basic research
and clinical uses of the assembly. We will also show how assembly
updates in GRCh38 impact analyses throughout the genome, even
in regions that are unchanged between the two assemblies.
Together, these analyses suggest adoption of the new assembly
will have a positive impact on both genome-wide analysis as well
as regional analysis.
With long-range sequencing and assembly technologies mak-
ing the generation of highly contiguous whole-genome de novo as-
semblies possible, the overall value of GRCh38 and the human
reference genome assembly in general, must now also be consid-
ered (Chaisson et al. 2015b). The reference assembly is not just a
substrate for alignment, but is also the coordinate system on which
we annotate our biological knowledge. Several recently published
individual human de novo assemblies have been favorably com-
pared to GRCh38 with respect to continuity metrics, and although
they each contain sequence not present in the reference assembly,
none yet surpass the global quality of GRCh38 (Li et al. 2010;
Steinberg et al. 2014; Berlin et al. 2015; Cao et al. 2015;
Pendleton et al. 2015; Seo et al. 2016; Shi et al. 2016). Such assem-
blies are oftensuggested as sequence sources for use in closure of ref-
erence assembly gaps, whereas other studies have called for one or
more individual genomes to replace the reference (Rosenfeld et al.
2012). To address these issues, we generated and evaluated a collec-
tion of de novo assemblies representing the essentially haploid
complete hydatidiform mole samples CHM1 and CHM13 (Fan
et al. 2002; Steinberg et al. 2014). The assemblies were derived
from the same sequence data, but assembled using different algo-
rithms and/or parameters, and assessed with a range of assembly
metrics with respect to each other and GRCh38. To our knowledge,
these efforts represent the first such assessment performed specifi-
cally to explore the suitability of de novo assemblies for use in cura-
tion or replacement of the human reference assembly.
Results
Assembly updates
Upon the release of GRCh37.p13 in June 2013, the cumulative set
of 204 patch scaffolds covered 3.15% of the chromosome assem-
blies, included >7 Mb of novel sequence, and met previously de-
fined GRC criteria for the trigger of a major assembly release
(Church et al. 2011). We submitted GRCh38, a coordinate chang-
ing update of the human reference assembly, to the International
Nucleotide Sequence Database Collaboration (INSDC) in Decem-
ber 2013 (GCA_000001405.15). Because the reference remains
under active curation, we have subsequently provided quarterly
GRCh38 patch releases, which do not affect the chromosome
coordinates, the latest of which was GRCh38.p10 (GCA_
000001405.25). The initial GRCh38 release represents the resolu-
tion of more than 1000 issues reported to the GRC tracking system,
spanning all chromosomes and encompassing a variety of
problem types, including gaps, component and tiling path
errors, and variant representation (https://www.ncbi.nlm.nih.
gov/projects/genome/assembly/grc/human/issues/) (Fig. 1). Ge-
nome-wide alignments of GRCh38 to GRCh37 reveal 11 Mb
(0.37% of total length) of inverted sequence, whereas 75 Mb
(2.3% of total length) of ungapped sequence in the new assembly
has no alignment to GRCh37 (Supplemental Worksheet S3). In
contrast, only 5 Mb (0.17%) of ungapped GRCh37 sequence has
no alignment to GRCh38. As in previous assembly updates, we
used finished, clone-based components for assembly updates
wherever possible because of their high per-base accuracy and
Schneider et al.
850 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

haploid representation of actual human sequence. With >95% of
the chromosome total sequence and 98% of noncentromeric se-
quence derived from genomic clone components, the GRCh38 ref-
erence assembly chromosomes continue to provide a mosaic
haploid representation of the human genome, rather than a con-
sensus haploid representation. The sequence contribution from
RP11, an anonymous male donor of likely African-European ad-
mixed ancestry, remains dominant (70%), but has decreased by
1.5% relative to the previous assembly version (Supplemental
Fig. S1; Green et al. 2010, Supplementary Online Materials 16).
Table 1 summarizes the GRCh38 assembly statistics of length,
N50 and gaps relative to GRCh37, and several recently generated
de novo assemblies. The GRCh38 assembly is longer and
more contiguous than previous reference assembly versions
Figure 1. Summary of GRCh38 updates. (A) Chart showing issues resolved for GRCh38 on each chromosome by issue type. Each issue represents a
unique assembly evaluation and corresponding curation decision. (B) Changes in placed scaffold N50 length from GRCh37 to GRCh38 . Changes on
Chromosomes 5, 13, 19, and Y are <55 kbp each. (C) Addition of whole-genome sequencing components (orange bars) resolves a GRCh37 gap, consol-
idating the split annotation of INPP5D and restoring a missing exon (asterisk) in GRCh38. The default 50-kbp gap in GRCh37 greatly overestimates the
actual amount of missing sequence (6 kbp). (D) Schematic of a curated collapse in GRCh38 Chr 10. Clones from two incompatible haplotypes (pink
and light blue) were mixed in the GRCh37 tiling path, creating a false gap and segmental duplication involving the single copy genes TMEM236 and
MRC1 (top). In GRCh38 (bottom), clones from the blue haplotype have been eliminated (200 kbp), closing the gap and providing the correct gene
content.
GRCh38anddenovoassemblyquality
Genome Research 851
www.genome.org
Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

(https://ww w.ncbi.nlm.nih.gov/projects/genome/ass embly/grc/
human/data/) (Fig. 1; Table 1). Although the total number of ref-
erence assembly gaps grew, increases occur when sequence added
into a preexisting gap is not contiguous with either gap edge or
when sequence additions are comprised of scaffolded whole-ge-
nome sequencing (WGS) contigs. The increase in gap count in
GRCh38 is largely attributable to the replacement of the single
centromere gap in each chromosome with scaffolds of modeled se-
quence (described below), and WGS sequences flank more
unspanned gaps and spanned gaps in GRCh38 than in GRCh37
(Supplemental Table S1). For more details of assembly gaps, see
the Supplemental Notes and Supplemental Table S2.
The suite of updates provided in the GRCh38 assembly had a
positive impact on assembly annotation. Comparison of the NCBI
Homo sapiens annotation release 105 of GRCh37.p13 (https://www.
ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/105/)
and annotation release 106 of GRCh38 (https://www.ncbi.nlm.
nih.gov/genome/annotation_euk/Homo_sapiens/106/) shows an
increase in the numbers of genes and protein coding transcripts,
with a concomitant decrease in partially represented coding se-
quences and transcripts split over assembly gaps (Fig. 1; Table 2).
Because the transcript content of these two annotation releases
was not identical and may contribute to observed differences in
the annotation statistics, we also aligned two large public annota-
tion sets (GENCODE23 [basic] and RefSeq71) to the GRCh37 and
GRCh38 full assemblies to gauge the impact of improvements on
gene representation (Harrow et al. 2012; OLeary et al. 2016).
Similar to the previously described comparison, in GRCh38 we
find that both annotation sets show increases in overall transcript
alignments with a substantial decrease in split and low quality
transcript alignments (Table 3; Supplemental Worksheet S1). We
looked at the intersection of the transcripts with problematic
alignments with two clinically relevant gene lists: a set of genes
enriched for de novo loss of function mutations identified in
Autism Spectrum Disorder (n = 1003) (Samocha et al. 2014) and a
collection of genes preliminarily proposed for the development
of a medical exome kit (n = 4623) (https://www.genomeweb.
com/sequencing/emory-chop-harvard-develop-medical-exome-
kit-complete-coverage-5k-disease-associ). Among the set of
RefSeq transcripts with problematic alignments to GRCh37, we
observed six gene overlaps with the former and 14 with the latter,
whereas we found six and 22 for the GENCODE cohort
(Supplemental Worksheet S1). The majority of these genes
(RefSeq: n = 6/6 and n = 9/14 and GENCODE: n = 5/6 and n =9/
22, respectively) are no longer associated with transcript align-
ment issues in GRCh38, suggesting the newer assembly is a better
substrate for clinical studies.
Centromeres
A major change in the content of the reference genome assembly is
the replacement of the 3-Mbp centromeric gaps on all GRCh37
chromosomes with modeled centromeres from the LinearCen1.1
(normalized) assembly, derived from a database of centromeric
sequences from the HuRef genome (GCA_000442335.2)
(Supplemental Methods; Levy et al. 2007; Miga et al. 2014). We
added the modeled centromeres to the reference assembly to serve
as catalysts for analyses of these biologically important and highly
variant genomic regions, as annotation targets, and to act as read
sinks for centromere-containing reads in mapping analyses
(Miga et al. 2015). Consistent with our reasoning that such se-
quences may improve read alignments, 21.7% (by length) of the
decoy sequence used in the 1000 Genomes Project to reduce
spurious read mapping, and previously shown to improve variant
calling (Li 2014), was identified by RepeatMasker as alpha-satellite
centromeric repeat (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
technical/reference/phase2_reference_assembly_sequence/) (The
1000 Genomes Project Consortium 2015). Each centromere model
represents the variants and monomer ordering of the chromo-
some-specific alpha-satellite repeats in a manner proportional to
that observed in the initial read database, but the long-range order-
ing of repeats is inferred. In contrast to the remainder of the chro-
mosome sequence, in which each underlying clone component
represents the actual haplotype of its source DNA, the modeled se-
quence is not an actual haplotype, but an averaged representation.
The GRCh38 modeled centromeres also contain largely unordered
and unoriented islands of euchromatic sequences that are taken
from the same collection of HuRef sequences, as well as from geno-
mic clones. One such island, in the modeled centromere for
Chromosome 3, provides reference representation for a PRIM2
Table 1. Comparison of assembly statistics
Assembly short name GenBank accession Total length Contig N50 Scaffold N50 Gap number Gap length QV
GRCh38
a
GCA_000001405.15 3,209,286,105 56,413,054 67,794,873 349
b
526
c
124
d
159,970,007 ND
GRCh37
a
GCA_000001405.1 3,137,144,693 38,508,932 46,395,641 86
b
271
c
100
d
239,850,738 ND
CHM1_1.1 GCA_000306695.2 3,037,866,619 143,936 50,362,920 225
b
40,665
c
210,229,812 ND
CHM1_CA_P6 GCA_001307025.1 2,939,630,703 20,609,304 NA 0 NA 42.29
CHM1_FC_P6 GCA_001297185.1 2,996,426,293 26,899,841 NA 0 NA 44.64
CHM13_CA1 GCA_000983465.1 3,061,240,732 13,331,528 NA 0 NA 41.21
CHM13_CA2 GCA_001015355.1 3,028,917,871 19,357,701 NA 0 NA 39.86
CHM13_CA3 GCA_000983475.1 2,996,416,935 5,550,336 NA 0 NA 42.89
CHM13_CA4 GCA_001015385.3 3,065,003,163 12,252,446 NA 0 NA 41.27
CHM13_FC GCA_000983455.2 2,941,135,618 10,549,591 NA 0 NA 43.00
(QV) Quality value; (NA) not available; (ND) not determined.
a
Values include alternate loci unless noted.
b
Scaffold breaking gap.
c
Nonbreaking gap (excludes alternate loci).
d
Nonbreaking gap (alternate loci).
Schneider et al.
852 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

paralog (NCBI gene LOC101930420) that was missing in GRCh37
(Genovese et al. 2013a,b). Due to the modeled nature of these se-
quence representations, we suggest that variant and other analyses
within these regions be treated independently of similar analyses
made elsewhere in the genome. We anticipate that these modeled
sequences will be updated in future assembly versions as new se-
quencing and assembly technologies make it possible to provide
longer-range representations for these regions.
Retiling
Although a subset of missing sequences is associated with gaps
deemed recalcitrant to cloning, segmental duplications or other
complex genomic architectures are implicated in most remaining
gaps or misassemblies (Bailey et al. 2001; Sharp et al. 2005;
Chaisson et al. 2015a). In collaboration with various external
groups, we identified and investigated reported path issues and as-
sociated assembly gaps using a combination of techniques, includ-
ing optical maps (Teague et al. 2010; Howe and Wood 2015),
Strand-seq (Falconer et al. 2012), admixture mapping (Genovese
et al. 2013a) and reevaluation of component sequences and over-
laps (Mueller et al. 2013). These analyses uncovered some substan-
tial misassemblies in GRCh37 that spanned several megabases and
many genes, including the regions at 1q21, 10q11, and a peri-cen-
tromeric inversion of Chromosome 9. Although we were able to im-
prove or resolve some path problems through reordering of existing
assembly components to match optical maps, we found that other
approaches were needed at more complex regions where allelic and
paralogous variation made it impossible to confidently define
paths with clones representing a mosaic of diploid DNA sources.
In these instances, we replaced GRCh37 components with new til-
ing paths comprised of BAC clones representing the single haplo-
type of the essentially haploid CHM1 genome (Dennis et al.
2012; Steinberg et al. 2014), or on Chromosome X, with the single
haplotype represented in RP11 (Mueller et al. 2013). We also retiled
several genomic loci associated with immune responses (IGK, IGH,
LRC-KIR, and the cytokine cluster on 17q) with CHM1 clones, re-
placing the unvalidated mosaic representations in GRCh37 and
previous assembly versions to ensure the reference-provided repre-
sentations of these clinically important regions that actually exist
Table 2. Summary of RefSeq Annotation Releases 105 and 106
Feature
NCBI Annotation Release 105
a
NCBI Annotation Release 106
b
GRCh37.p13 GRCh38
Full assembly
c
Primary assembly All alternate loci Full assembly
c
Primary assembly All alternate loci
Genes and pseudogenes 40,158 39,947 428 41,722 41,566 1981
mRNAs 67,517 64,734 1360 69,826 67,793 3408
Other RNAs 15,063 14,151 443 17,857 16,914 1152
CDSs 68,035 65,099 1360 70,368 68,177 3564
Coverage <95%
d
NA 65 NA NA 25 NA
Split alignments
e
NA 30 NA NA 3 NA
a
Entrez query date: August 3, 2013 (42,339 known RefSeqs (NM_/NR_) https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/105/.
b
Entrez query date: January 17, 2014 (45,911 known RefSeqs (NM_/NR_) https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/
106/.
c
Features annotated on both the primary assembly and alternate loci are only counted once in the full assembly.
d
Known NM_ and NR_ RefSeqs for which <95% of the CDS aligns to the genomic sequence.
e
Known NM_ and NR_ RefSeqs with multiple best alignments (split genes).
Table 3. GENCODE 23 and RefSeq 71 alignments to GRCh37 and GRCh38
GENCODE 23
a
RefSeq 71
a
GRCh37 only GRCh38 only GRCh38 and GRCh37 GRCh37 only GRCh38 only GRCh38 and GRCh37
Not aligned
Transcripts 86 0 122 15 0 1
Genes 83 0 122 11 0 1
Split alignments
Transcripts 61 5 21 39 2 6
Genes 34 5 19 18 2 4
Coverage <95%
b
Transcripts 160 5 104 79 5 14
Genes 103 5 100 41 4 13
Rejected placement
Transcripts 65 2 86 36 8 8
Genes 56 2 84 26 8 8
Dropped-conflict
c
Transcripts NA NA NA 47 1 2
Genes NA NA NA 45 1 2
a
GENCODE: 92,193 transcripts; RefSeq: 50,337 transcripts.
b
Coverage values were calculated for RefSeq CDS and GENCODE full-length transcripts.
c
Dropped due to coplacement with another sequence having a different NCBI GeneID.
GRCh38anddenovoassemblyquality
Genome Research 853
www.genome.org
Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

Figures
Citations
More filters
Journal ArticleDOI

Semi-automated assembly of high-quality diploid human reference genomes

TL;DR: The Human Pangenome Reference Consortium (HPC) as mentioned in this paper was formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangeneome reference that represents human genetic diversity.
References
More filters
Journal ArticleDOI

Initial sequencing and analysis of the human genome.

Eric S. Lander, +248 more
- 15 Feb 2001 - 
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Posted ContentDOI

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li
- 16 Mar 2013 - 
TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.
Journal ArticleDOI

An integrated map of genetic variation from 1,092 human genomes

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Related Papers (5)

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Keith Bradnam, +95 more
- 23 Jan 2013 - 

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Dent Earl, +78 more
- 16 Sep 2011 - 
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly" ?

Valerie A. Schneider, Tina Graves-Lindsay, Kerstin Howe, Nathan Bouk, Hsiu-Chuan Chen, Paul A. Wilson, Paul Flicek, Evan E. Simpson, Glen Threadgold, James Torrance, Jonathan M. Phillippy, Richard Durbin, Richard K. Murphy, Kim D. Pruitt, Françoise Thibaud-Nissen, Derek Albracht, Robert S. 

The authors added the modeled centromeres to the reference assembly to serve as catalysts for analyses of these biologically important and highly variant genomic regions, as annotation targets, and to act as read sinks for centromere-containing reads in mapping analyses (Miga et al. 2015). 

Because erroneous reference bases, estimated to occur at a rate of 10−5 (International Human Genome Sequencing Consortium 2004), can result in incorrect variant calls, complicate gene annotation, and in the case of indels, complicate read alignments, the authors sought to identify and correct such sites (International Human Genome Sequencing Consortium 2004). 

because repetitive sequences have typically been prone to collapse in WGS assemblies, the authors also used FRC curves to evaluate compression and expansion in each of the assemblies. 

New reference-quality sequence sources are needed, because generation of finished sequence from clone libraries is in significant decline due to cost and some remaining assembly gaps occur in regions recalcitrant to cloning. 

The human reference genome assembly, initially released more than a decade ago, remains at the nexus of basic and clinical research. 

because variant calling is only one use case for the reference assembly, the authors also examined other facets of these de novo assemblies. 

The de novo assemblies also demonstrate the challenges and limitations in transforming data associated with repetitive or complex genomic regions from a rich graph-based assembler representation to a narrower linear assembly representation. 

Although the GRCh37 primary assembly is an excellent mapping target, with 99.92% of reads aligned, the authors find that 64.32% of the unmapped reads are now mapped to the GRCh38 primary assembly. 

The authors anticipate that these modeled sequences will be updated in future assembly versions as new sequencing and assembly technologies make it possible to provide longer-range representations for these regions. 

These lists also include haplotype-specific or copynumber variant genes, for which coplacement occurs when they are absent from the sample haplotype. 

Wherever possible, the authors preserved the assembly representation of genes for which theCHM1haplotype is deleted by adding components containing these genes to alternate loci scaffolds. 

Although assembly updates are expected to alter read alignments in changed regions, the authors also investigated their impact on read mappings in the 2.6 Gbp of unchanged reference sequence, using a script written for this purpose (Supplemental Code). 

there are 35%–40% fewer transcripts dropped from the CHM1_1.1 assembly due to coplacement than from the FALCON or Celera Assembler CHM1 assemblies, indicating that assembly method has a substantial impact on gene representation. 

Amajor change in the content of the reference genome assembly is the replacement of the 3-Mbp centromeric gaps on all GRCh37 chromosomes with modeled centromeres from the LinearCen1.1 (normalized) assembly, derived from a database of centromeric sequences from the HuRef genome (GCA_000442335.2) 

The reference assembly provides context for both the scale and types of variation that will be observed from one sample to the next.