Why did the authors seek to identify and correct erroneous reference bases?

Because erroneous reference bases, estimated to occur at a rate of 10−5 (International Human Genome Sequencing Consortium 2004), can result in incorrect variant calls, complicate gene annotation, and in the case of indels, complicate read alignments, the authors sought to identify and correct such sites (International Human Genome Sequencing Consortium 2004).

Why did the authors use FRC curves to evaluate compression and expansion in each assembly?

because repetitive sequences have typically been prone to collapse in WGS assemblies, the authors also used FRC curves to evaluate compression and expansion in each of the assemblies.

Why are some gaps in the genome being created?

New reference-quality sequence sources are needed, because generation of finished sequence from clone libraries is in significant decline due to cost and some remaining assembly gaps occur in regions recalcitrant to cloning.

What is the current human reference genome assembly?

The human reference genome assembly, initially released more than a decade ago, remains at the nexus of basic and clinical research.

Why did the authors examine other facets of the assemblies?

because variant calling is only one use case for the reference assembly, the authors also examined other facets of these de novo assemblies.

What are the challenges and limitations of de novo assemblies?

The de novo assemblies also demonstrate the challenges and limitations in transforming data associated with repetitive or complex genomic regions from a rich graph-based assembler representation to a narrower linear assembly representation.

How many reads are now mapped to the GRCh38 primary assembly?

Although the GRCh37 primary assembly is an excellent mapping target, with 99.92% of reads aligned, the authors find that 64.32% of the unmapped reads are now mapped to the GRCh38 primary assembly.

What are the list of haplotypes that are coplaced?

These lists also include haplotype-specific or copynumber variant genes, for which coplacement occurs when they are absent from the sample haplotype.

How did the authors preserve the assembly representation of genes for which theCHM1haplotype is?

Wherever possible, the authors preserved the assembly representation of genes for which theCHM1haplotype is deleted by adding components containing these genes to alternate loci scaffolds.

How did the authors determine the impact of assembly updates on read mappings in the 2.6 Gb?

Although assembly updates are expected to alter read alignments in changed regions, the authors also investigated their impact on read mappings in the 2.6 Gbp of unchanged reference sequence, using a script written for this purpose (Supplemental Code).

How much of the transcripts dropped from the CHM1 assembly due to coplacement?

there are 35%–40% fewer transcripts dropped from the CHM1_1.1 assembly due to coplacement than from the FALCON or Celera Assembler CHM1 assemblies, indicating that assembly method has a substantial impact on gene representation.

What is the role of the reference assembly in the evolution of genome biology?

The reference assembly provides context for both the scale and types of variation that will be observed from one sample to the next.

(Open Access) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly (2016) | Valerie A. Schneider

Q: Why did the authors add the modeled centromeres to the reference assembly?

The authors added the modeled centromeres to the reference assembly to serve as catalysts for analyses of these biologically important and highly variant genomic regions, as annotation targets, and to act as read sinks for centromere-containing reads in mapping analyses (Miga et al. 2015).

Q: What are the expected changes in the modeled sequences?

The authors anticipate that these modeled sequences will be updated in future assembly versions as new sequencing and assembly technologies make it possible to provide longer-range representations for these regions.

Evaluation of GRCh38 and de novo haploid genome

assemblies demonstrates the enduring quality of the

reference assembly

Valerie A. Schneider,

Tina Graves-Lindsay,

Kerstin Howe,

Nathan Bouk,

Hsiu-Chuan Chen,

Paul A. Kitts,

Terence D. Murphy,

Kim D. Pruitt,

Françoise Thibaud-Nissen,

Derek Albracht,

Robert S. Fulton,

Milinn Kremitzki,

Vincent Magrini,

2,10

Chris Markovic,

Sean McGrath,

Karyn Meltz Steinberg,

Kate Auger,

William Chow,

Joanna Collins,

Glenn Harden,

Timothy Hubbard,

3,11

Sarah Pelan,

Jared T. Simpson,

3,12,13

Glen Threadgold,

James Torrance,

Jonathan M. Wood,

Laura Clarke,

Sergey Koren,

Matthew Boitano,

Paul Peluso,

Heng Li,

Chen-Shan Chin,

Adam M. Phillippy,

Richard Durbin,

Richard K. Wilson,

Paul Flicek,

Evan E. Eichler,

8,9

and Deanna M. Church

1,14

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland

20894, USA;

McDonnell Genome Institute at Washington University, St. Louis, Missouri 63018, USA;

Wellcome Trust Sanger

Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;

European Molecular Biology Laboratory,

European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom;

National Human

Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;

Pacific Biosciences, Menlo Park,

California 94025, USA;

Broad Institute, Cambridge, Massachusetts 02142, USA;

Department of Genome Sciences, University of

Washington School of Medicine, Seattle, Washington 98195, USA;

Howard Hughes Medical Institute, University of Washington,

Seattle, Washington 98195, USA

The human reference genome assembly plays a central role in nearly all aspects of today’s basic and clinical research.

GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues

and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations,

gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation

for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify

and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the

centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representa-

tion of human population variation. We demonstrate that the updates render the reference an improved annotation sub-

strate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We

additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assem-

blies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still

provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in

GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understand-

ing of human biology and advance our efforts to improve health.

[Supplemental material is available for this article.]

The human reference genome assembly remains a critical resource

for the biological and clinical research communities (International

Human Genome Sequencing Consortium 2001, 2004). It is distin-

guished from the growing number of human genome assemblies

in public databases by virtue of its long contig and scaffold

N50s, high base-pair accuracy, and robust representations of repet-

itive and segmentally duplicated genomic regions, all of which

Present addresses:

Nationwide Children’s Hospital, Columbus, OH

43205, USA;

King’s College London, London WC2R 2LS, UK;

Ontario Institute for Cancer Research, Toronto, Ontario, Canada

M5G 0A3;

Department of Computer Science, University of

Toronto, Toronto, Ontario, Canada M5S 2E4;

10X Genomics,

Pleasanton, CA 94566, USA

Corresponding author: schneiva@ncbi.nlm.nih.gov

Article published online before print. Article, suppl emental material, and publi-

cation date are at http://www.genome.org/cgi/doi/10.1101/gr.213611.116.

Freely available online through the Genome Research Open Access option.

under a Creative Commons License (Attribution 4.0 International), as described

at http://creativecommons.org/licenses/by/4.0/.

Resource

27:849–864 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/17; www.genome.org Genome Research 849

www.genome.org

Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

reflect the clone-based assembly approach and Sanger sequencing

methods that were the basis of its generation. In particular, it was

the use of large insert BAC clones (>150 kb inserts) and the deep

coverage provided by multiple end-sequenced clone libraries, cou-

pled with extensive use of radiation hybrid, genetic linkage, and

fingerprint maps, that made it possible to span large repetitive re-

gions and achieve the as-yet unsurpassed contiguity of the refer-

ence. Assembled from the DNA of multiple donors, the reference

was intended to provide representation for the pan-human ge-

nome, rather than a single individual or population group, and

is a mosaic of haplotypes whose borders coincide with the under-

lying clone boundaries.

A revision to the assembly model, first used in the previous

version of the reference, GRCh37 (GCA_000001405.1), expanded

the ability of the reference assembly to represent the extent of

structural variation and population genomic diversity whose dis-

covery it facilitated (The International HapMap Consortium

2005; Kidd et al. 2008; Sudmant et al. 2010; Church et al. 2011;

The 1000 Genomes Project Consortium 2015). The introduction

of alternate loci scaffolds enabled GRCh37 to include additional

sequence representations for the highly variant MHC region, as

well as the divergent haplotypes of the MAPT and UGT2B loci,

while retaining the linear chromosome representations familiar

and intuitive to most users (Horton et al. 2008; Xue et al. 2008;

Zody et al. 2008). A second feature of the updated model, assembly

patches, permitted subsequent corrections and addition of new se-

quence representations to the GRCh37 assembly without chang-

ing the chromosome sequences or coordinates on which an

increasing volume of data were being mapped (Zook et al. 2014;

The 1000 Genomes Project Consortium 2015; Pierson et al.

2015). The assembly model remains for GRCh38, the current refer-

ence version. Together, these features of the assembly model

helped ensure that the human reference assembly would continue

to present the most accurate representation of the human genome

possible while providing a stable substrate for large-scale analysis.

The GRCh37 assembly underwent 13 patch releases in the

period from 2009 to 2013 (GCA_000001405.2–GCA_00000

1405.14). Despite the availability of these sequences in public da-

tabases, their use has been limited by the inability of common bio-

informatics file formats and tool chains to manage the allelic

duplication they introduce, as well as by their constrained repre-

sentation in popular genome browsers (Church et al. 2015). In ad-

dition, the patches represented only a subset of the assembly

updates made by the Genome Reference Consortium (GRC).

Thus, coordinate changing assembly updates remain essential for

users to access the full suite of assembly improvements, despite

the challenge of transporting data and results to the new assembly

(Hickey et al. 2013; Zhao et al. 2014).

In producing GRCh38, we of the GRC placed special empha-

sis on addressing the following types of assembly issues found in

GRCh37: (1) resolution of tiling path errors and gaps associated

with complex haplotypes and segmental duplications; (2) base-

pair–level updates for sequencing errors; (3) addition of “missing”

sequences, with an emphasis on paralogous sequences and popu-

lation variation; and (4) providing sequence representation for ge-

nomic features, such as centromeres and telomeres. Making these

updates involved the use of bioinformatics and experimental

resources and techniques not previously available. We will demon-

strate how the new approaches used in this effort result in a human

reference genome assembly that is more contiguous and complete

than ever before and that provides better gene and variant repre-

sentation than GRCh37, features critical to both basic research

and clinical uses of the assembly. We will also show how assembly

updates in GRCh38 impact analyses throughout the genome, even

in regions that are unchanged between the two assemblies.

Together, these analyses suggest adoption of the new assembly

will have a positive impact on both genome-wide analysis as well

as regional analysis.

With long-range sequencing and assembly technologies mak-

ing the generation of highly contiguous whole-genome de novo as-

semblies possible, the overall value of GRCh38 and the human

reference genome assembly in general, must now also be consid-

ered (Chaisson et al. 2015b). The reference assembly is not just a

substrate for alignment, but is also the coordinate system on which

we annotate our biological knowledge. Several recently published

individual human de novo assemblies have been favorably com-

pared to GRCh38 with respect to continuity metrics, and although

they each contain sequence not present in the reference assembly,

none yet surpass the global quality of GRCh38 (Li et al. 2010;

Steinberg et al. 2014; Berlin et al. 2015; Cao et al. 2015;

Pendleton et al. 2015; Seo et al. 2016; Shi et al. 2016). Such assem-

blies are oftensuggested as sequence sources for use in closure of ref-

erence assembly gaps, whereas other studies have called for one or

more individual genomes to replace the reference (Rosenfeld et al.

2012). To address these issues, we generated and evaluated a collec-

tion of de novo assemblies representing the essentially haploid

complete hydatidiform mole samples CHM1 and CHM13 (Fan

et al. 2002; Steinberg et al. 2014). The assemblies were derived

from the same sequence data, but assembled using different algo-

rithms and/or parameters, and assessed with a range of assembly

metrics with respect to each other and GRCh38. To our knowledge,

these efforts represent the first such assessment performed specifi-

cally to explore the suitability of de novo assemblies for use in cura-

tion or replacement of the human reference assembly.

Results

Assembly updates

Upon the release of GRCh37.p13 in June 2013, the cumulative set

of 204 patch scaffolds covered 3.15% of the chromosome assem-

blies, included >7 Mb of novel sequence, and met previously de-

fined GRC criteria for the trigger of a major assembly release

(Church et al. 2011). We submitted GRCh38, a coordinate chang-

ing update of the human reference assembly, to the International

Nucleotide Sequence Database Collaboration (INSDC) in Decem-

ber 2013 (GCA_000001405.15). Because the reference remains

under active curation, we have subsequently provided quarterly

GRCh38 patch releases, which do not affect the chromosome

coordinates, the latest of which was GRCh38.p10 (GCA_

000001405.25). The initial GRCh38 release represents the resolu-

tion of more than 1000 issues reported to the GRC tracking system,

spanning all chromosomes and encompassing a variety of

problem types, including gaps, component and tiling path

errors, and variant representation (https://www.ncbi.nlm.nih.

gov/projects/genome/assembly/grc/human/issues/) (Fig. 1). Ge-

nome-wide alignments of GRCh38 to GRCh37 reveal 11 Mb

(0.37% of total length) of inverted sequence, whereas 75 Mb

(2.3% of total length) of ungapped sequence in the new assembly

has no alignment to GRCh37 (Supplemental Worksheet S3). In

contrast, only 5 Mb (0.17%) of ungapped GRCh37 sequence has

no alignment to GRCh38. As in previous assembly updates, we

used finished, clone-based components for assembly updates

wherever possible because of their high per-base accuracy and

Schneider et al.

850 Genome Research

www.genome.org

Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

haploid representation of actual human sequence. With >95% of

the chromosome total sequence and 98% of noncentromeric se-

quence derived from genomic clone components, the GRCh38 ref-

erence assembly chromosomes continue to provide a mosaic

haploid representation of the human genome, rather than a con-

sensus haploid representation. The sequence contribution from

RP11, an anonymous male donor of likely African-European ad-

mixed ancestry, remains dominant (∼70%), but has decreased by

∼1.5% relative to the previous assembly version (Supplemental

Fig. S1; Green et al. 2010, Supplementary Online Materials 16).

Table 1 summarizes the GRCh38 assembly statistics of length,

N50 and gaps relative to GRCh37, and several recently generated

de novo assemblies. The GRCh38 assembly is longer and

more contiguous than previous reference assembly versions

Figure 1. Summary of GRCh38 updates. (A) Chart showing issues resolved for GRCh38 on each chromosome by issue type. Each issue represents a

unique assembly evaluation and corresponding curation decision. (B) Changes in placed scaffold N50 length from GRCh37 to GRCh38 . Changes on

Chromosomes 5, 13, 19, and Y are <55 kbp each. (C) Addition of whole-genome sequencing components (orange bars) resolves a GRCh37 gap, consol-

idating the split annotation of INPP5D and restoring a missing exon (asterisk) in GRCh38. The default 50-kbp gap in GRCh37 greatly overestimates the

actual amount of missing sequence (∼6 kbp). (D) Schematic of a curated collapse in GRCh38 Chr 10. Clones from two incompatible haplotypes (pink

and light blue) were mixed in the GRCh37 tiling path, creating a false gap and segmental duplication involving the single copy genes TMEM236 and

MRC1 (top). In GRCh38 (bottom), clones from the blue haplotype have been eliminated (∼200 kbp), closing the gap and providing the correct gene

content.

GRCh38anddenovoassemblyquality

Genome Research 851

www.genome.org

Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

(https://ww w.ncbi.nlm.nih.gov/projects/genome/ass embly/grc/

human/data/) (Fig. 1; Table 1). Although the total number of ref-

erence assembly gaps grew, increases occur when sequence added

into a preexisting gap is not contiguous with either gap edge or

when sequence additions are comprised of scaffolded whole-ge-

nome sequencing (WGS) contigs. The increase in gap count in

GRCh38 is largely attributable to the replacement of the single

centromere gap in each chromosome with scaffolds of modeled se-

quence (described below), and WGS sequences flank more

unspanned gaps and spanned gaps in GRCh38 than in GRCh37

(Supplemental Table S1). For more details of assembly gaps, see

the Supplemental Notes and Supplemental Table S2.

The suite of updates provided in the GRCh38 assembly had a

positive impact on assembly annotation. Comparison of the NCBI

Homo sapiens annotation release 105 of GRCh37.p13 (https://www.

ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/105/)

and annotation release 106 of GRCh38 (https://www.ncbi.nlm.

nih.gov/genome/annotation_euk/Homo_sapiens/106/) shows an

increase in the numbers of genes and protein coding transcripts,

with a concomitant decrease in partially represented coding se-

quences and transcripts split over assembly gaps (Fig. 1; Table 2).

Because the transcript content of these two annotation releases

was not identical and may contribute to observed differences in

the annotation statistics, we also aligned two large public annota-

tion sets (GENCODE23 [basic] and RefSeq71) to the GRCh37 and

GRCh38 full assemblies to gauge the impact of improvements on

gene representation (Harrow et al. 2012; O’Leary et al. 2016).

Similar to the previously described comparison, in GRCh38 we

find that both annotation sets show increases in overall transcript

alignments with a substantial decrease in split and low quality

transcript alignments (Table 3; Supplemental Worksheet S1). We

looked at the intersection of the transcripts with problematic

alignments with two clinically relevant gene lists: a set of genes

enriched for de novo loss of function mutations identified in

Autism Spectrum Disorder (n = 1003) (Samocha et al. 2014) and a

collection of genes preliminarily proposed for the development

of a medical exome kit (n = 4623) (https://www.genomeweb.

com/sequencing/emory-chop-harvard-develop-medical-exome-

kit-complete-coverage-5k-disease-associ). Among the set of

RefSeq transcripts with problematic alignments to GRCh37, we

observed six gene overlaps with the former and 14 with the latter,

whereas we found six and 22 for the GENCODE cohort

(Supplemental Worksheet S1). The majority of these genes

(RefSeq: n = 6/6 and n = 9/14 and GENCODE: n = 5/6 and n =9/

22, respectively) are no longer associated with transcript align-

ment issues in GRCh38, suggesting the newer assembly is a better

substrate for clinical studies.

Centromeres

A major change in the content of the reference genome assembly is

the replacement of the 3-Mbp centromeric gaps on all GRCh37

chromosomes with modeled centromeres from the LinearCen1.1

(normalized) assembly, derived from a database of centromeric

sequences from the HuRef genome (GCA_000442335.2)

(Supplemental Methods; Levy et al. 2007; Miga et al. 2014). We

added the modeled centromeres to the reference assembly to serve

as catalysts for analyses of these biologically important and highly

variant genomic regions, as annotation targets, and to act as read

sinks for centromere-containing reads in mapping analyses

(Miga et al. 2015). Consistent with our reasoning that such se-

quences may improve read alignments, 21.7% (by length) of the

“decoy” sequence used in the 1000 Genomes Project to reduce

spurious read mapping, and previously shown to improve variant

calling (Li 2014), was identified by RepeatMasker as alpha-satellite

centromeric repeat (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/

technical/reference/phase2_reference_assembly_sequence/) (The

1000 Genomes Project Consortium 2015). Each centromere model

represents the variants and monomer ordering of the chromo-

some-specific alpha-satellite repeats in a manner proportional to

that observed in the initial read database, but the long-range order-

ing of repeats is inferred. In contrast to the remainder of the chro-

mosome sequence, in which each underlying clone component

represents the actual haplotype of its source DNA, the modeled se-

quence is not an actual haplotype, but an averaged representation.

The GRCh38 modeled centromeres also contain largely unordered

and unoriented islands of euchromatic sequences that are taken

from the same collection of HuRef sequences, as well as from geno-

mic clones. One such island, in the modeled centromere for

Chromosome 3, provides reference representation for a PRIM2

Table 1. Comparison of assembly statistics

Assembly short name GenBank accession Total length Contig N50 Scaffold N50 Gap number Gap length QV

GRCh38

GCA_000001405.15 3,209,286,105 56,413,054 67,794,873 349

526

124

159,970,007 ND

GRCh37

GCA_000001405.1 3,137,144,693 38,508,932 46,395,641 86

271

100

239,850,738 ND

CHM1_1.1 GCA_000306695.2 3,037,866,619 143,936 50,362,920 225

40,665

210,229,812 ND

CHM1_CA_P6 GCA_001307025.1 2,939,630,703 20,609,304 NA 0 NA 42.29

CHM1_FC_P6 GCA_001297185.1 2,996,426,293 26,899,841 NA 0 NA 44.64

CHM13_CA1 GCA_000983465.1 3,061,240,732 13,331,528 NA 0 NA 41.21

CHM13_CA2 GCA_001015355.1 3,028,917,871 19,357,701 NA 0 NA 39.86

CHM13_CA3 GCA_000983475.1 2,996,416,935 5,550,336 NA 0 NA 42.89

CHM13_CA4 GCA_001015385.3 3,065,003,163 12,252,446 NA 0 NA 41.27

CHM13_FC GCA_000983455.2 2,941,135,618 10,549,591 NA 0 NA 43.00

(QV) Quality value; (NA) not available; (ND) not determined.

Values include alternate loci unless noted.

Scaffold breaking gap.

Nonbreaking gap (excludes alternate loci).

Nonbreaking gap (alternate loci).

Schneider et al.

852 Genome Research

www.genome.org

Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

paralog (NCBI gene LOC101930420) that was missing in GRCh37

(Genovese et al. 2013a,b). Due to the modeled nature of these se-

quence representations, we suggest that variant and other analyses

within these regions be treated independently of similar analyses

made elsewhere in the genome. We anticipate that these modeled

sequences will be updated in future assembly versions as new se-

quencing and assembly technologies make it possible to provide

longer-range representations for these regions.

Retiling

Although a subset of missing sequences is associated with gaps

deemed recalcitrant to cloning, segmental duplications or other

complex genomic architectures are implicated in most remaining

gaps or misassemblies (Bailey et al. 2001; Sharp et al. 2005;

Chaisson et al. 2015a). In collaboration with various external

groups, we identified and investigated reported path issues and as-

sociated assembly gaps using a combination of techniques, includ-

ing optical maps (Teague et al. 2010; Howe and Wood 2015),

Strand-seq (Falconer et al. 2012), admixture mapping (Genovese

et al. 2013a) and reevaluation of component sequences and over-

laps (Mueller et al. 2013). These analyses uncovered some substan-

tial misassemblies in GRCh37 that spanned several megabases and

many genes, including the regions at 1q21, 10q11, and a peri-cen-

tromeric inversion of Chromosome 9. Although we were able to im-

prove or resolve some path problems through reordering of existing

assembly components to match optical maps, we found that other

approaches were needed at more complex regions where allelic and

paralogous variation made it impossible to confidently define

paths with clones representing a mosaic of diploid DNA sources.

In these instances, we replaced GRCh37 components with new til-

ing paths comprised of BAC clones representing the single haplo-

type of the essentially haploid CHM1 genome (Dennis et al.

2012; Steinberg et al. 2014), or on Chromosome X, with the single

haplotype represented in RP11 (Mueller et al. 2013). We also retiled

several genomic loci associated with immune responses (IGK, IGH,

LRC-KIR, and the cytokine cluster on 17q) with CHM1 clones, re-

placing the unvalidated mosaic representations in GRCh37 and

previous assembly versions to ensure the reference-provided repre-

sentations of these clinically important regions that actually exist

Table 2. Summary of RefSeq Annotation Releases 105 and 106

Feature

NCBI Annotation Release 105

NCBI Annotation Release 106

GRCh37.p13 GRCh38

Full assembly

Primary assembly All alternate loci Full assembly

Primary assembly All alternate loci

Genes and pseudogenes 40,158 39,947 428 41,722 41,566 1981

mRNAs 67,517 64,734 1360 69,826 67,793 3408

Other RNAs 15,063 14,151 443 17,857 16,914 1152

CDSs 68,035 65,099 1360 70,368 68,177 3564

Coverage <95%

NA 65 NA NA 25 NA

Split alignments

NA 30 NA NA 3 NA

Entrez query date: August 3, 2013 (42,339 known RefSeqs (NM_/NR_) https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/105/.

Entrez query date: January 17, 2014 (45,911 known RefSeqs (NM_/NR_) https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/

106/.

Features annotated on both the primary assembly and alternate loci are only counted once in the full assembly.

Known NM_ and NR_ RefSeqs for which <95% of the CDS aligns to the genomic sequence.

Known NM_ and NR_ RefSeqs with multiple best alignments (split genes).

Table 3. GENCODE 23 and RefSeq 71 alignments to GRCh37 and GRCh38

GENCODE 23

RefSeq 71

GRCh37 only GRCh38 only GRCh38 and GRCh37 GRCh37 only GRCh38 only GRCh38 and GRCh37

Not aligned

Transcripts 86 0 122 15 0 1

Genes 83 0 122 11 0 1

Split alignments

Transcripts 61 5 21 39 2 6

Genes 34 5 19 18 2 4

Coverage <95%

Transcripts 160 5 104 79 5 14

Genes 103 5 100 41 4 13

Rejected placement

Transcripts 65 2 86 36 8 8

Genes 56 2 84 26 8 8

Dropped-conflict

Transcripts NA NA NA 47 1 2

Genes NA NA NA 45 1 2

GENCODE: 92,193 transcripts; RefSeq: 50,337 transcripts.

Coverage values were calculated for RefSeq CDS and GENCODE full-length transcripts.

Dropped due to coplacement with another sequence having a different NCBI GeneID.

GRCh38anddenovoassemblyquality

Genome Research 853

www.genome.org

Cold Spring Harbor Laboratory Press on August 9, 2022 - Published by genome.cshlp.orgDownloaded from

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Figures

Citations

Nanopore sequencing and assembly of a human genome with ultra-long reads

The Human Pangenome Project: a global resource to map genomic diversity

A joint NCBI and EMBL-EBI transcript set for clinical genomics and research

Diversity in non-repetitive human sequences not found in the reference genome

Semi-automated assembly of high-quality diploid human reference genomes

References

Initial sequencing and analysis of the human genome.

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

A global reference for human genetic variation.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

An integrated map of genetic variation from 1,092 human genomes

Related Papers (5)

Reference-guided de novo assembly approach improves genome reconstruction for related species

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

SRAssembler: Selective Recursive local Assembly of homologous genomic regions.

Assemblathon 1: A competitive assessment of de novo short read assembly methods

Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly" ?

Q2. Why did the authors add the modeled centromeres to the reference assembly?

Q3. Why did the authors seek to identify and correct erroneous reference bases?

Q4. Why did the authors use FRC curves to evaluate compression and expansion in each assembly?

Q5. Why are some gaps in the genome being created?

Q6. What is the current human reference genome assembly?

Q7. Why did the authors examine other facets of the assemblies?

Q8. What are the challenges and limitations of de novo assemblies?

Q9. How many reads are now mapped to the GRCh38 primary assembly?

Q10. What are the expected changes in the modeled sequences?

Q11. What are the list of haplotypes that are coplaced?

Q12. How did the authors preserve the assembly representation of genes for which theCHM1haplotype is?

Q13. How did the authors determine the impact of assembly updates on read mappings in the 2.6 Gb?

Q14. How much of the transcripts dropped from the CHM1 assembly due to coplacement?

Q15. What is the important change in the content of the reference genome assembly?

Q16. What is the role of the reference assembly in the evolution of genome biology?