scispace - formally typeset
Open AccessPosted ContentDOI

An open resource of structural variation for medical and population genetics

TLDR
A reference atlas of SVs from deep whole-genome sequencing of 14,891 individuals across diverse global populations as a component of gnomAD is constructed, finding strong correlations between constraint against predicted loss-of-function (pLoF) SNVs and rare SVs that both disrupt and duplicate protein-coding genes.
Abstract
SUMMARY Structural variants (SVs) rearrange the linear and three-dimensional organization of the genome, which can have profound consequences in evolution, diversity, and disease. As national biobanks, human disease association studies, and clinical genetic testing are increasingly reliant on whole-genome sequencing, population references for small variants (i.e., SNVs & indels) in protein-coding genes, such as the Genome Aggregation Database (gnomAD), have become integral for the evaluation and interpretation of genomic variation. However, no comparable large-scale reference maps for SVs exist to date. Here, we constructed a reference atlas of SVs from deep whole-genome sequencing (WGS) of 14,891 individuals across diverse global populations (54% non-European) as a component of gnomAD. We discovered a rich landscape of 498,257 unique SVs, including 5,729 multi-breakpoint complex SVs across 13 mutational subclasses, and examples of localized chromosome shattering, like chromothripsis, in the general population. The mutation rates and densities of SVs were non-uniform across chromosomes and SV classes. We discovered strong correlations between constraint against predicted loss-of-function (pLoF) SNVs and rare SVs that both disrupt and duplicate protein-coding genes, suggesting that existing per-gene metrics of pLoF SNV constraint do not simply reflect haploinsufficiency, but appear to capture a gene’s general sensitivity to dosage alterations. The average genome in gnomAD-SV harbored 8,202 SVs, and approximately eight genes altered by rare SVs. When incorporating these data with pLoF SNVs, we estimate that SVs comprise at least 25% of all rare pLoF events per genome. We observed large (≥1Mb), rare SVs in 3.1% of genomes (∼1:32 individuals), and a clinically reportable pathogenic incidental finding from SVs in 0.24% of genomes (∼1:417 individuals). We also estimated the prevalence of previously reported pathogenic recurrent CNVs associated with genomic disorders, which highlighted differences in frequencies across populations and confirmed that WGS-based analyses can readily recapitulate these clinically important variants. In total, gnomAD-SV includes at least one CNV covering 57% of the genome, while the remaining 43% is significantly enriched for CNVs found in tumors and individuals with developmental disorders. However, current sample sizes remain markedly underpowered to establish estimates of SV constraint on the level of individual genes or noncoding loci. The gnomAD-SV resources have been integrated into the gnomAD browser (https://gnomad.broadinstitute.org), where users can freely explore this dataset without restrictions on reuse, which will have broad utility in population genetics, disease association, and diagnostic screening.

read more

Content maybe subject to copyright    Report

444 | Nature | Vol 581 | 28 May 2020
Article
A structural variation reference for medical
and population genetics
Ryan L. Collins
1,2,3,158
, Harrison Brand
1,2,4,158
, Konrad J. Karczewski
1,5
, Xuefang Zhao
1,2,4
,
Jessica Alföldi
1,5
, Laurent C. Francioli
1,5,6
, Amit V. Khera
1,2
, Chelsea Lowther
1,2,4
,
Laura D. Gauthier
1,7
, Harold Wang
1,2
, Nicholas A. Watts
1,5
, Matthew Solomonson
1,5
,
Anne O’Donnell-Luria
1,5
, Alexander Baumann
7
, Ruchi Munshi
7
, Mark Walker
1,7
,
Christopher W. Whelan
7
, Yongqing Huang
7
, Ted Brookings
7
, Ted Sharpe
7
, Matthew R. Stone
1,2
,
Elise Valkanas
1,2,3
, Jack Fu
1,2,4
, Grace Tiao
1,5
, Kristen M. Laricchia
1,5
, Valentin Ruano-Rubio
7
,
Christine Stevens
1
, Namrata Gupta
1
, Caroline Cusick
1
, Lauren Margolin
1
, Genome
Aggregation Database Production Team*, Genome Aggregation Database Consortium*,
Kent D. Taylor
8
, Henry J. Lin
8
, Stephen S. Rich
9
, Wendy S. Post
10
, Yii-Der Ida Chen
8
,
Jerome I. Rotter
8
, Chad Nusbaum
1,155
, Anthony Philippakis
7
, Eric Lander
1,11,12
, Stacey Gabriel
1
,
Benjamin M. Neale
1,2,5,13
, Sekar Kathiresan
1,2,6,14
, Mark J. Daly
1,2,5,13
, Eric Banks
7
,
Daniel G. MacArthur
1,2,5,6,156,157
& Michael E. Talkowski
1,2,4,13
 ✉
Structural variants (SVs) rearrange large segments of DNA
1
and can have profound
consequences in evolution and human disease
2,3
. As national biobanks,
disease-association studies, and clinical genetic testing have grown increasingly
reliant on genome sequencing, population references such as the Genome
Aggregation Database (gnomAD)
4
have become integral in the interpretation of
single-nucleotide variants (SNVs)
5
. However, there are no reference maps of SVs from
high-coverage genome sequencing comparable to those for SNVs. Here we present a
reference of sequence-resolved SVs constructed from 14,891 genomes across diverse
global populations (54% non-European) in gnomAD. We discovered a rich and
complex landscape of 433,371 SVs, from which we estimate that SVs are responsible
for 25–29% of all rare protein-truncating events per genome. We found strong
correlations between natural selection against damaging SNVs and rare SVs that
disrupt or duplicate protein-coding sequence, which suggests that genes that are
highly intolerant to loss-of-function are also sensitive to increased dosage
6
. We also
uncovered modest selection against noncoding SVs in cis-regulatory elements,
although selection against protein-truncating SVs was stronger than all noncoding
eects. Finally, we identied very large (over one megabase), rare SVs in 3.9% of
samples, and estimate that 0.13% of individuals may carry an SV that meets the
existing criteria for clinically important incidental ndings
7
. This SV resource is freely
distributed via the gnomAD browser
8
and will have broad utility in population
genetics, disease-association studies, and diagnostic screening.
SVs are DNA rearrangements that involve at least 50 nucleotides
1
.
By virtue of their size and abundance, SVs represent an important
mutational force that shape genome evolution and function
2,3
, and
contribute to germline and somatic diseases
9–11
. The profound effect
of SVs is also attributable to the numerous mechanisms by which they
can disrupt protein-coding genes and cis-regulatory architecture
12
.
SVs can be grouped into mutational classes that include ‘unbalanced’
gains or losses of DNA (for example, copy-number variants, CNVs),
and ‘balanced’ rearrangements that occur without corresponding
dosage alterations (such as inversions and translocations)
1
(Fig.1a).
Other common forms of SVs include mobile elements that insert them-
selves throughout the genome, and multiallelic CNVs (MCNVs) that can
exist at high copy numbers
1
. More recently, exotic species of complex
SVs have been discovered that involve two or more distinct SV signa-
tures in a single mutational event interleaved on the same allele, and
can range from CNV-flanked inversions to rare instances of localized
chromosome shattering, such as chromothripsis
13,14
. The diversity of
SVs in humans is therefore far greater than has been widely appreciated,
as is their influence on genome structure and function.
Although SVs alter more nucleotides per genome than SNVs and
short insertion/deletion variants (indels; <50 bp)
1
, surprisingly little
is known about their mutational spectra on a global scale. The largest
published population study of SVs using whole-genome sequencing
(WGS) remains the 1000 Genomes Project (n=2,504; 7× sequence
https://doi.org/10.1038/s41586-020-2287-8
Received: 2 March 2019
Accepted: 31 March 2020
Published online: 27 May 2020
Open access
Check for updates
Lists of afiliations and consortium members appear at the end of the paper.

Nature | Vol 581 | 28 May 2020 | 445
coverage)
1
, and the substantial technical challenges of SV discovery
from WGS
15
has led to non-uniform SV analyses across contemporary
studies
1620
. Moreover, short-read WGS is unable to capture a sub
-
set of SVs accessible to more expensive niche technologies, such as
long-read WGS
21
. Owing to the combination of these challenges, SV
references are dwarfed by contemporary resources for short variants,
such as the Exome Aggregation Consortium (ExAC) and its successor,
the Genome Aggregation Database (gnomAD), which have jointly ana-
lysed more than 140,000 individuals
4,6
. Publicly available resources
such as ExAC and gnomAD have transformed many aspects of human
genetics, including defining sets of genes constrained against dam-
aging coding mutations
6
and providing frequency filters for variant
interpretation
5
. As short-read WGS is rapidly becoming the predomi-
nant technology in large-scale human disease studies, and will prob-
ably displace conventional methods for diagnostic screening, there
is a mounting need for comparable references of SVs across global
populations.
In this study, we developed gnomAD-SV, a sequence-resolved refer-
ence for SVs from 14,891 genomes. Our analyses revealed diverse muta-
tional patterns among SVs, and principles of selection acting against
reciprocal dosage changes in genes and noncoding cis-regulatory
elements. From these analyses, we determined that SVs represent more
than 25% of all rare protein-truncating events per genome, emphasizing
the unrealized potential of routine SV detection in WGS studies. This
SV reference has been integrated into the gnomAD browser (http://
gnomad.broadinstitute.org) with no restrictions on reuse so that it
can be mined for new insights into genome biology and applied as a
resource to interpret SVs in diagnostic screening.
SV discovery and genotyping
We analysed WGS data for 14,891 samples (average coverageof 32×)
aggregated from large-scale sequencing projects, of which 14,237
(95.6%) passed all quality thresholds, representing a general adult popu-
lation depleted for severe Mendelian diseases (median ageof49years)
(Supplementary Table1, Supplementary Figs.1, 2). This cohort included
46.1% European, 34.9% African or African American, 9.2% East Asian,
and 8.7% Latino samples, as well as 1.2% samples from admixed or other
populations (Fig.1). Following family-based analyses using 970 parent–
child trios for quality assessments, we pruned all first-degree relatives
from the cohort, retaining 12,653 unrelated genomes for subsequent
analyses.
We discovered and genotyped SVs using a cloud-based,
multi-algorithm pipeline for short-read WGS (Supplementary Fig.3),
which we prototyped in a study of 519 autism quartet families
20
. This
pipeline integrated four orthogonal evidence types to capture SVs
across the size and allele frequency spectra, including six classes of
canonical SVs (Fig.1a) and 11 subclasses of complex SVs
22
(Fig.2). We
augmented this pipeline with new methods to account for the technical
heterogeneity of aggregated datasets (Extended Data Fig.1, Supple-
mentary Figs.4, 5), and discovered 433,371 SVs (Fig.1c). After exclud-
ing low-quality SVs, which were predominantly (61.6%) composed of
incompletely resolved breakpoint junctions (that is, ‘breakends’) that
lack interpretable alternative allele structures for functional annota-
tion and produce high false-discovery rates
20
(Extended Data Fig.2a),
we retained 335,470 high-quality SVs for subsequent analyses (Sup-
plementary Table3). This final set of high-quality SVs corresponded
to a median of 7,439 SVs per genome, or more than twice the number
of variants per genome captured by previous WGS-based SV stud-
ies such as the 1000 Genomes Project (3,441 SVs per genome from
approximately 7× coverage WGS), which underscores the benefits of
high-coverage WGS and improved multi-algorithm ensemble methods
for SV discovery.
Given that there are no gold-standard benchmarking procedures
for SVs from WGS, we evaluated the technical qualities of gnomAD-SV
using seven orthogonal approaches. These analyses are described in
detail in Extended Data Figs.2, 3, Supplementary Figs.6–12, Supple-
mentary Table4 and Supplementary Note1, but we highlight just a few
here to demonstrate that gnomAD-SV conforms to many fundamental
principles of population genetics, including Mendelian segregation,
genotype distributions, and linkage disequilibrium. We found that the
precision of gnomAD-SV was comparable to our previous study of 519
autism quartets that attained a 97% molecular validation rate for all
denovo SV predictions
20
: in gnomAD, analyses of 970 parent–child
trios indicated a median Mendelian violation rate of 3.8% and a het-
erozygous denovo rate of 3.0%. We also observed that 86% of SVs were
in Hardy–Weinberg equilibrium, and common SVs were in strong linkage
disequilibrium with nearby SNVs or indels (median peak R
2
=0.85). We
performed extensive in silico confirmation of 19,316 SVs predicted from
short-read WGS using matched long-read WGS from four samples
21,23
,
finding a 94.0% confirmation rate with breakpoint-level read evidence,
and revealing that 59.8% of breakpoint coordinates were accurate within
a single nucleotide of the long-read data. These and other benchmark-
ing approaches suggested that gnomAD-SV was sufficiently sensitive
and specific to be used as a reference dataset for most applications in
human genomics.
0510 15
Samples (×1,000)
14,237
gnomAD−SV
This study
2,504
1000G
769
GoNL
gnomAD−SV
1000G
GoNL
147
GTEx
AFR
AMR
EAS
OTH
EUR
b
0 200 400
SVs (×1,000)
433,371
This study
68,818
67,357
23,602
GTEx
DEL
DUP
MCNV
INV
CPX
BND
INS
c
–30 –20 –10 010
PC1
0
10
20
PC2
AFR
AMR
EAS
OTH
EUR
d
3,505 DEL
220 MCNV (loss)
723 DUP
328 MCNV (gain)
2,612 INS
14 INV
37 CPX

0HGLDQ
8,775
AFR
7,376
AMR
7,338
EAS
7,132
EUR
7,692
OTH
0
2,000
4,000
6,000
8,000
10,000
e
SVs per genome
DEL
DUP
MCNV
INV
CPX
INS
Alu
SVA
LINE1
fg
Rare
(AF < 1%)
Common
(AF > 1%)
110 100 1 k 10 k
Allele count
50%
60%
70%
80%
90%
100%
Fraction of SVs
O
O
O
O
O
<1 kb
1−10 kb
10−100 kb
100 kb−1 Mb
>1 Mb
50%
60%
70%
80%
SV size
h
Singleton proportion
a
Deletion Duplication Insertion Inversion BreakendsComplex SVMultiallelic CNV
Ref.
SV class
Abbrev.
Example
alternatives
CNV Other SV (non-CNV)
DEL
DUP
MCNV
INV
Translocation
CTX
CPX
BNDINS
Unresolved
$ $
$
$
$
;$
$
$
$ $
1
chrA
chrB
chrA
chrB
"
"
Discarded
$ %
$
%
%
(See )LJ)
100 bp 1 kb 10 kb 100 kb 1 Mb 10 Mb
SV size
10
100
1 k
10 k
100 k
SVs discovered
Fig. 1 | Properties of SVs across human populations. a, SV classes catalogued
in this study. We also documented unresolved non-reference ‘breakends’
(BNDs), but they were excluded from all analyses as low-quality variants.
b, After quality control, we analysed 14,237 samples across continental
populations, including African/African American (AFR), Latino (AMR), East
Asian (EAS), and European (EUR), or other populations (OTH). Three publicly
available WGS-based SV datasets are provided for comparison (1000 Genomes
Project (1000G), approximately 7× coverage; Genome of the Netherlands
Project (GoNL), around 13× coverage; Genotype-Tissue Expression Project
(GTEx), approximately 50× coverage)
1,16,17
. c, We discovered 433,371 SVs, and
provide counts from previous studies for comparison
1,16,17
. d, A principal
component (PC) analysis of genotypes for 15,395 common SVs separated
samples along axes corresponding to genetic ancestry. e, The median genome
contained 7,439 SVs. f, Most SVs were small. Expected Alu, SVA and LINE1
mobile element insertion peaks are marked at approximately 300 bp, 2.1 kb and
6 kb, respectively. g, Most SVs were rare (allele frequency (AF) <1%), and 49.8%
of SVs were singletons (solid bars). h, Allele frequencies were inversely
correlated with SV size across all 335,470 resolved SVs in unrelated individuals.
Values are mean and 95% confidence interval from 100-fold bootstrapping.
Colour codes are consistent between a, c, eh, and between b and d.

446 | Nature | Vol 581 | 28 May 2020
Article
Population genetics and genome biology
The distribution of SVs across samples matched expectations based
on human demographic history, with the top three components of
genetic variance separating continental populations (Fig.1d, Sup-
plementary Fig.13). African and African American samples exhibited
the greatest genetic diversity and their common SVs were in weaker
linkage disequilibrium with nearby short variants than Europeans,
whereas East Asians featured the highest levels of homozygosity
(Fig.1e, Extended Data Fig.4a–d, Supplementary Fig.7). The muta-
tional diversity of gnomAD-SV was extensive: we completely resolved
5,295 complex SVs across 11 mutational subclasses, of which 3,901
(73.7%) involved inverted segments (Fig.2), confirming that inversion
variation is predominantly composed of complex SVs rather than
canonical inversions
1,24
. Across all SV classes, most SVs were small
(median sizeof331 bp) and rare (allele frequency < 1%; 92% of SVs),
with half of all SVs (49.8%) appearing as ‘singletons’ (that is, only one
allele observed across all samples) (Fig.1f, g). Although the proportion
of singletons varied by SV class, it was strongly dependent on SV size
across all classes, which suggests that the amount of DNA rearranged
is a key determinant of selection against most SVs (Fig.1h, Extended
Data Fig.5a).
Mutation rate estimates for SVs have remained elusive owing to
limited sample sizes, poor resolution of conventional technologies,
technical challenges of SV discovery, and use of cell line-derived DNA
in population studies
1,25
. Here, we used the Watterson estimator
26
to
project a mean mutation rate of 0.29 denovo SVs (95% confidence inter-
val 0.13–0.44) per generation in regions of the genome accessible to
short-read WGS, or roughly one new SV every 2–8 live births, with muta-
tion rates varying markedly by SV class (Fig.3a). Although this imperfect
method extrapolates from data pooled across unrelated individuals, we
previously demonstrated comparable rates from molecularly validated
observations in 519 quartet families
20
. Like mutation rates, the distri-
bution of SVs throughout the genome was non-uniform, significantly
correlated with repetitive sequence contexts, and was enriched near
centromeres and telomeres
23
(Supplementary Fig.16). These trends
were dependent on SV class, as biallelic deletions and duplications were
predominantly enriched at telomeres, whereas MCNVs were enriched
in centromeric segmental duplications (Fig.3b–d). Given the reduced
sensitivity of short-read WGS in repetitive sequences, this study cer-
tainly underestimates the true SV mutation rates; nevertheless, these
analyses implicate several aspects of chromosomal context and SV class
in determining SV mutation rates throughout the genome.
Dosage sensitivity of coding and noncoding loci
Owing to their size and mutational diversity, SVs can have varied con-
sequences on protein-coding genes
12
(Fig.4a, Supplementary Fig.17).
In principle, any SV can result in predicted loss-of-function (pLoF),
either by deleting coding nucleotides or altering open-reading frames.
Coding duplications can result in copy-gain of entire genes, or of a
subset of exons within a gene (referred to here as intragenic exonic
Abbrev. SV size APS
All Complex SVs CPX
Varies Varies Varies
5,295
2.2 kb 0.02
Paired−duplication
inversion
dupINVdup
258
155.5 kb 0.07
Paired−deletion
inversion
delINVdel
616
9.7 kb 0.04
Paired−deletion/
duplication inversion
delINVdup
dupINVdel
551
8.6 kb –0.02
Deletion−anked
inversion
delINV
INVdel
623
4 kb 0.02
Insertion with
insertion site deletion
dDUP−iDEL
INS−iDEL
288
3.9 kb 0.11
Duplication−anked
inversion
dupINV
INVdup
1,851
1.5 kb 0.01
Dispersed
duplication
dDUP
1,106
0.3 kb 0.02
100 bp
1 kb
10 kb
100 kb
1 Mb
10 Mb
–0.2
–0.1
0
0.1
0.2
Reference Deletion Duplication InversionInsertion
A B C
B
A B C A C
A’ B C’
A B
A
A’ B
B
A B’
or
A B C
C
B C’
A
A’ B
or
A
A
A’
A B
or
A
A’
A
A B
A
B
or
Complex SV
subclass
Mutational
signatures
Ref. allele
structure
Alt. allele
structure(s)
Resolved
SVs
Fig. 2 | Complex SVs are abundant in the human genome. We resolved 5,295
complex SVs across 11 mutational subclasses, 73.7% of which involved at least
one inversion. Each subclass is detailed here, including their mutational
signatures, structures, abundance, density of SV sizes (vertical line indicates
median size), and allele frequencies. Five pairs of subclasses have been
collapsed into single rows due to mirrored or similar alternative allele
structures (for example, delINV versus INVdel). Two complex SVs did not
conform to any subclass (Extended Data Fig.8).
Meta-chromosome (mean of au
tosomes)
0
0.5
1.0
1.5
2.0
SV fold-enrichment
Meta-chromosome
0
0.5
1.0
1.5
2.0
Meta-chromosome
0
1
2
3
4
Meta-chromosome
0
2
4
6
8
Meta-chromosome
0
0.5
1.0
1.5
2.0
Meta-chromosome
0
0.5
1.0
1.5
2.0
0
0.5
1.0
1.5
2.0
Meta-chromosome
DEL DUP MCNV
INS INV CPX
n = 172,637 n = 46,408 n = 1,055
n = 109,278 n = 788 n = 5,295
SV fold-enrichment
All DEL INS DUP CPXINV
0.0
0.1
0.2
0.3
0.4
Mutation rate, P (SVs per generation)
O
0.286
O
0.146
O
0.095
O
0.040
O
0.004
O
0.001
O
P from Wattersone
W
in gnomAD, n = 10,000
^
Rate of validated de novo SVs from 519 quartets
ba
cd
5% 90% 5%
T
I
C
ALL
*
*
*
T
I
C
DEL
*
*
*
O
O
O
T
I
C
DUP
*
*
*
O
O
O
T
I
C
MCNV
*
*
*
O
O
O
T
I
C
INS
*
*
*
O
T
I
C
INV
*
*
T
I
C
CPX
*
*
2
–3
2
–2
2
–1
2
2
2
1
2
0
SV fold-enrichment
*Bonferroni P < 0.05
5% 90% 5%
Mean 95% CI
TI ICC T
Fig. 3 | Genome-wide mutational patterns of SVs. a, Mutation rates (μ) from
the Watterson estimator for each SV class
26
. Bars represent 95% confidence
intervals. Rates of molecularly validated denovo SVs from 519 quartet families
are provided for comparison
20
. b, Smoothed enrichment of SVs per 100-kb
window across the average of all autosomes normalized by chromosome arm
length (a ‘meta-chromosome’) (Supplementary Fig.16). c, The distribution of
SVs along the meta-chromosome was dependent on variant class. d, SV
enrichment by class and chromosomal position provided as mean and 95%
confidence intervals (CI). C, centromeric; I, interstitial; T, telomeric. P values
were computed using a two-sided t-test and were Bonferroni-adjusted for
21 comparisons. *P≤2.38×10
−3
.

Nature | Vol 581 | 28 May 2020 | 447
duplication, or IED). The average genome in gnomAD-SV contained a
mean of 179.8 genes altered by biallelic SVs (144.3 pLoF, 24.3 copy-gain,
and 11.2 IED), of which 11.6 were predicted to be completely inacti-
vated by homozygous pLoF (Fig.4b, Extended Data Fig.4e–h). When
restricted to rare (allele frequency < 1%) SVs, we observed a mean of
10.2 altered genes per genome (5.5 pLoF, 3.4 copy-gain, and 1.3 IED).
By comparison, a companion gnomAD paper estimated 122.4 pLoF
short variants per genome, of which 16.3 were rare
4
. These analyses
suggest that 29.4% of rare heterozygous gene inactivation events per
individual are contributed by SVs, or conservatively 25.2% of pLoF
events if we exclude IEDs given the context-dependence of their
functional impact.
A fundamental question in human genetics is the degree to which
natural selection acts on coding and noncoding loci. The proportion
of singleton variants has been established as a proxy for strength of
selection
6
; however, this metric is confounded for SVs given the strong
correlation between allele frequency and SV size, among other factors.
Therefore, we developed a new metric, adjusted proportion of single-
tons (APS), to account for SV class, size, genomic context, and other
technical covariates (Extended Data Fig.5, Supplementary Fig.14).
Under this normalized APS metric, a value of zero corresponds to a sin-
gleton proportion comparable to intergenic SVs, whereas values greater
than zero reflect purifying selection, similar to the ‘mutability-adjusted
proportion of singletons’ (MAPS) metric used for SNVs
6
. Applying this
APS model revealed signals of pervasive selection against nearly all
classes of SVs that overlap genes, including intronic SVs, whole-gene
inversions, SVs in gene promoters, and deletions as small as a single
exon (Fig.4c, Extended Data Fig.6, Supplementary Fig.18). The one
notable exception was copy-gain duplications, which showed no clear
evidence of selection beyond what could already be explained by their
sizes, which were vastly larger than non-copy-gain duplications (median
copy-gain duplication size=134.8 kb; median non-copy-gain duplica-
tion size=2.7 kb; one-tailed Wilcoxon test, W=1.18×10
8
, P<10
−100
). This
result could have numerous explanations, but it is consistent with the
known diverse evolutionary roles of gene duplication events, including
positive selection reported in humans
27,28
.
Methods that quantify evolutionary constraint on a per-gene basis,
such as the probability of intolerance to heterozygous pLoF variation
(pLI)
6
and the pLoF observed/expected upper fraction (LOEUF)
4
, have
become core resources in human genetics. Nearly all existing metrics,
including pLI and LOEUF, are derived from SNVs. Although previous
studies have attempted to compute similar scores using large CNVs
detected by microarray and exome sequencing
29,30
, or to correlate
deletions with pLI
18
, no gene-level metrics comparable to LOEUF exist
for SVs at WGS resolution. To gain insight into this problem, we built
a model to estimate the depletion of rare SVs per gene compared to
expectations based on gene length, genomic context, and the structure
of exons and introns. This model is imperfect, as current sample sizes
are too sparse to derive precise gene-level metrics of constraint from
SVs. Nevertheless, we found strong concordance between the deple-
tion of rare pLoF SVs and existing pLoF and missense SNV constraint
metrics
4
(pLoF Spearman correlation test, ρ=0.90, P<10
−100
) (Fig.4d,
Supplementary Fig.19). Notably, a comparable positive correlation was
also observed for copy-gain SVs and SNV constraint (pLoF Spearman
correlation test, ρ=0.78, P<10
−100
), whereas a weaker yet significant
correlation was detected for IEDs (pLoF Spearman correlation test,
ρ=0.58, P=2.0 × 10
−11
). As orthogonal support for these trends, we
identified an inverse correlation between APS and SNV constraint
across all functional categories of SVs, which was consistent with
our observed depletion of rare, functional SVs in constrained genes
(Extended Data Fig.6f). These comparisons confirm that selection
against most classes of gene-altering SVs mirrors patterns observed
for short variants
18,30
. They further suggest that SNV-derived constraint
metrics such as LOEUF capture a general correspondence between
haploinsufficiency and triplosensitivity for a large fraction of genes in
the genome. It therefore appears that the most highly pLoF-constrained
genes not only aresensitive to pLoF, but also aremore likely to be intol-
erant to increased dosage and other functional alterations.
In contrast to the well-studied effects of coding variation, the effects
of noncoding SVs on regulatory elements are largely unknown. There are
a handful of examples of SVs with strong noncoding effects, although
they are scarce in humans and model organisms
31,32
. In gnomAD-SV,
we explored noncoding dosage sensitivity across 14 regulatory ele
-
ment classes, ranging from high-confidence experimentally validated
enhancers to large databases of computationally predicted elements
(Supplementary Table5). We found that noncoding CNVs overlapping
most element classes had increased proportions of singletons, although
none exceeded the APS observed for pLoF SVs (Fig.5a). In general, the
effects of noncoding deletions appeared stronger than noncoding
duplications, and CNVs predicted to delete or duplicate entire ele-
ments were under stronger selection than partial element disruption
(Fig.5b). We also observed that primary sequence conservation was
correlated with selection against noncoding CNVs (Fig.5c, d), which
provides a foothold for future work on interpretation and functional
effect prediction for noncoding SVs. Broadly, these results followed
trends we observed for protein-coding SVs, which we interpreted as
evidence for weak but widespread selection against CNVs altering
most classes of annotated regulatory elements.
Trait association and clinical genetics
Most large-scale trait association studies have only considered SNVs
in genome-wide association studies (GWAS). Taking advantage of
the sample size and resolution of gnomAD-SV, we evaluated whether
SNVs associated with human traits might be in linkage disequilib-
rium with SVs not directly genotyped in GWAS. We identified 15,634
common SVs (allele frequency >1%) in strong linkage disequilibrium
(R
2
≥0.8) with at least one common short variant (Supplementary
b
pLoF
CG
IED
pLoF
CG
0
100
200
Genes per genome
Biallelic MCNV
All SVs
pLoF
CG
IED
0
5
10
15
Rare SVs
SNVs & indels
All SVs
pLoF (All)
pLoF (INS)
pLoF (DEL)
IED
pLoF (INV/CPX)
Whole-gene INV
Promoter
Intronic
Intergenic
CG
0
0.05
0.10
0.15
0.20
APS
c
No. of SVs
Median Size
SVs per gene
Effect
a
Abbrev.
Reference
Example
SVs
Gene inactivation
n = 9,867
7.9 kb
0.62
Loss-of-Function
DEL
DEL INV
INV
INS
DUP
pLoF
n = 1,951
13.3 kb
0.11
Varies by context
Intragenic Exon Dup.
DUPDUP
IED
n = 3,024
138.2 kb
0.32
Increased dosage
Copy Gain
DUP
DUP
UP
CG
n = 363
508.3 kb
0.60
Whole-gene INV
No direct coding effect
INV
INV
INV
d
0th 25th 50th 75th 100th 0th 25th 50th 75th 100th 0th 25th 50th 75th 100th 0th 25th 50th 75th
100th
0%
50%
100%
150%
ρ = 0.90 ρ = 0.78 ρ = 0.58 ρ = 0.24
P = 1.58 ×10
−2
Rare SV (obs/exp)
pLoF SNV constraint (LOEUF) percentile
pLoF IEDCG
INV
P = 2.00 ×10
−11
P < 10
−100
P < 10
−100
Fig. 4 | Pervasive selection against SVs in genes mirrors coding short
variants. a, Four categories of gene-overlapping SVs, with counts of total SVs,
median SV size, and mean SVs per gene in gnomAD-SV. b, Count of genes
altered by SVs per genome. Horizontal lines indicate medians. Sample sizes per
category listed in Supplementary Table9. c, APS value for SVs overlapping
genes. Bars indicate 100-fold bootstrapped 95% confidence intervals. SVs per
category listed in Supplementary Table9. d, Relationships of constraint against
pLoF SNVs versus gene-overlapping SVs in 100 bins of around 175 genes each,
ranked by SNV constraint
4
. Correlations were assessed with a two-sided
Spearman correlation test. Solid lines represent 21-point rolling means.
See Supplementary Fig.19 for comparisons to missense constraint.

448 | Nature | Vol 581 | 28 May 2020
Article
Fig.7), 14.8% of which matched a reported association from the
NHGRI-EBI GWAS catalogue or a recent analysis of 4,203 phenotypes
in the UK Biobank
33,34
. Common SVs in linkage disequilibrium with
GWAS variants were enriched for genic SVs across multiple functional
categories (Supplementary Table6), and included candidate SVs such
as a deletion of a thyroid enhancer in the first intron of ATP6V0D1
at a hypothyroidism-associated locus
34
(Extended Data Fig.7). We
also identified matches for previously proposed causal SVs tagged
by common SNVs, including pLoF deletions of CFHR3 or CFHR1 in
nephropathies and of LCE3B or LCE3C in psoriasis
35,36
. These results
demonstrate the value of imputing SVs into GWAS, and for the eventual
unification of short variants and SVs in all trait association studies.
Given the potential value of this resource, we have released these link-
age disequilibrium maps in Supplementary Table7.
As genomic medicine advances towards diagnostic screening at
sequence resolution, computational methods for variant discovery
from WGS and population references for interpretation will become
indispensable. One category of disease-associated SVs, recurrent CNVs
mediated by homologous segmental duplications known as genomic
disorders, are particularly important because they collectively repre-
sent a common cause of developmental disorders
37
. Accurate detection
of large, repeat-mediated CNVs is thus crucial for WGS-based diagnostic
testing as chromosomal microarray is the recommended first-tier diag-
nostic screen at present for unexplained developmental disorders
37
.
Using gnomAD-SV, we evaluated our ability to detect genomic disorders
in WGS data by calculating CNV carrier frequencies for 49 genomic
disorders across 10,047 unrelated samples with no known neuropsy-
chiatric disease and found that CNV carrier frequencies in gnomAD-SV
were consistent with those reported from chromosomal microarray in
the UK Biobank
38
(R
2
=0.669; Pearson correlation test, P=7.38×10
−13
)
(Fig.6a, Supplementary Table8, Supplementary Fig.20). The frequen-
cies of carriers of genomic disorders did not vary significantly among
populations, with the exception of duplications of NPHP1 at 2q13, in
which carrier frequencies in East Asian samples were up to 4.6-fold
higher than in other populations, further highlighting the potential
for variant interpretation to be confounded by the limited diversity
of existing SV references (Supplementary Fig.21).
In the context of variant interpretation, the current gnomAD-SV
resource will permit a screening threshold of allele frequencies less
than 0.1% when matching on ancestry to the populations sampled
here, and allele frequencies less than 0.004% globally. In the current
release, we catalogued at least one pLoF or copy-gain variant for 36.9%
and 23.7% of all autosomal genes, respectively, and 490 genes with at
least one homozygous pLoF SV (Fig.6b, Extended Data Fig.6e, Sup-
plementary Fig.22). We also benchmarked carrier rates for several
categories of clinically relevant variants in gnomAD-SV. First, 0.32%
of samples carried a very rare (allele frequency < 0.1%) SV resulting in
pLoF of a gene for which incidental findings are clinically actionable,
nearly half of which (that is, 0.13% of all samples) would meet diagnos-
tic criteria as pathogenic or likely pathogenic based upon the Ameri-
can College of Medical Genetics (ACMG) recommendations
7
(Fig.6c).
Second, 7.22% of individuals were heterozygous carriers of rare pLoF
SVs in known recessive developmental disorder genes
39
. Third, we
estimated that 3.8% of the general population (95% confidence inter-
val of 3.2–4.6%) carries at least one very large (≥1 Mb) rare autosomal
SV, roughly half of which (45.2%) were balanced or complex (Fig.6d).
Among these was an example of localized chromosome shattering
involving at least 49 breakpoints, yet resulting in largely balanced
products, reminiscent of chromothripsis, in an adult with no known
severe disease or DNA repair defect
13,14,22
(Fig.6e, Extended Data Fig.8).
Collectively, these analyses highlight the potential of gnomAD-SV
and WGS-based SV methods to augment disease-association studies
and clinical interpretation across a broad spectrum of variant classes
and study designs.
Discussion
Human genetic research and clinical diagnostics are becoming increas-
ingly invested in capturing the complete landscape of variation in
individual genomes. Ambitious international initiatives to generate
short-read WGS in many thousands of individuals from common disease
cohorts have underwritten this goal
40,41
, and millions of genomes will
be sequenced in the coming years from national biobanks
42,43
. A central
challenge to these efforts will be the uniform analysis and interpretation
of all variation accessible to WGS, particularly SVs, which are frequently
invoked as a source of added value offered by WGS. Indeed, early WGS
studies in cardiovascular disease and autism have been largely consist-
ent in their analyses of short variants, but every study has differed in its
analysis of SVs
18–20,40,41
. Thus, while ExAC and gnomAD have prompted
remarkable advances in medical and population genetics for short
variants, the same gains have not yet been realized for SVs. Although
gnomAD-SV is not exhaustively comprehensive, it was derived from
WGS methods and a reference genome that match those currently used
in many research and clinical settings, which will help to facilitate the
eventual standardization of SV discovery, analysis, and interpretation
across studies.
Most foundational assumptions about human genetic variation were
consistent between SVs and short variants in gnomAD, most notably
that SVs segregate stably on haplotypes in the population and experi-
ence selection commensurate with their predicted biological conse-
quences. This study also spotlights unique aspects of SVs, such as their
remarkable mutational diversity, their varied functional effects on
coding sequence, and the intense selection against large and complex
0.0 0.1 0.2 0.3
APS
OO
OO
Protein-altering (pLoF & IED) CNVs
OO
OO
Protein-altering (constrained genes)
O
O
O
O
O
O
O
OO
OO
O
O
OO
O
O
All intergenic CNVs
O
O
O
O
O
O
O
O
O
O
OO
OO
O
O
OO
O
O
No annotations
OO
OO
Any annotation
O
O
O
O
O
O
OO
VISTA validated enhancers
O
O
O
O
O
O
O
O
O
ChromHMM genic enhancers
O
O
O
O
OO
Ultraconserved elements
O
O
O
O
O
O
ChromHMM bivalent enhancers
O
O
O
OO
ChromHMM enhancers
O
O
OO
ChromHMM polycomb repressed
OO
OO
DNAseI hypersensitive sites
O
O
OO
Human accelerated regions
O
O
O
O
O
O
OO
Enhancer Atlas predictions
OO
OO
TF binding sites
OO
OO
Recombination hotspots
O
O
O
O
OO
Predicted super enhancers
O
O
OO
TAD boundaries
O
O
O
OO
Chromatin loop boundaries
Strictly noncoding CNVs
Partial Full Partial Full
DEL DUP
0.0
0.1
0.2
0.3
0.4
APS
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
P = 1.8
×
10
−2
P = 7.5
×
10
−3
phastCons percentile
APS
−0.05
0
0.05
0.10
−0.05
0
0.05
0.10
phastCons percentile
APS
DEL
DUP
Signicant (Bonferroni)
Non-signicant
ab
c
d
U = 0.62
U = 0.66
P < 10
−100
P < 10
−100
0th 25th 50th 75th 100t
h
0th 25th 50th 75th 100t
h
Fig. 5 | Dosage sensitivity in the noncoding genome. a, Strength of selection
(APS) for noncoding CNVs overlapping 14 categories of noncoding elements
(Supplementary Table5). Bars reflect 95% confidence intervals from 100-fold
bootstrapping. Each category was compared to neutral variation (APS=0)
using a one-tailed t-test. Categories surpassing Bonferroni-corrected
significance for 32 comparisons are indicated with dark shaded points. SVs per
category listed in Supplementary Table9. DEL, deletion; DUP, duplication; TAD,
topologically associating domain; TF, transcription factor. b, CNVs that
completely covered elements (‘full’) had significantly higher average APS
values than CNVs that only partially covered elements (‘partial’). P values
calculated using a two-tailed paired two-sample t-test for the 14 categories
from a. c, d, Spearman correlations between sequence conservation and APS for
noncoding deletions (n=143,353) (c) and duplications (n=30,052) (d).
Noncoding CNVs were sorted into 100-percentile bins based on the sum of the
phastCons scores overlapped by the CNV. Correlations were assessed with a
two-sided Spearman correlation test. Solid lines represent 21-point rolling means.

Figures
Citations
More filters
Posted ContentDOI

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

Konrad J. Karczewski, +95 more
- 30 Jan 2019 - 
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

The genetic architecture of type 2 diabetes

Christian Fuchsberger, +300 more
TL;DR: Large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes, but most fell within regions previously identified by genome-wide association studies.
Journal ArticleDOI

Structural variation in the sequencing era.

TL;DR: To map the full extent of structural variation in the human genome, detection methods are needed that improve on short-read approaches and this Review discusses how ensemble algorithms and emerging sequencing technologies are helping to resolve the full spectrum of structural variations.
Journal ArticleDOI

A robust benchmark for detection of germline large deletions and insertions.

Justin M. Zook, +49 more
- 15 Jun 2020 - 
TL;DR: A sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions is developed and it is demonstrated that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.
References
More filters
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Related Papers (5)

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 

An integrated map of structural variation in 2,504 human genomes

Peter H. Sudmant, +87 more
- 01 Oct 2015 - 

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

Konrad J. Karczewski, +95 more
- 30 Jan 2019 - 

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "A structural variation reference for medical and population genetics" ?

The Genome Aggregation Database Production Team ( GADP ) this paper is a consortium of researchers working on GADB. 

As short-read WGS is rapidly becoming the predominant technology in large-scale human disease studies, and will probably displace conventional methods for diagnostic screening, there is a mounting need for comparable references of SVs across global populations. 

92.7% of all known autosomal protein-coding nucleotides are not localized to simple- or low-copy repeats, and therefore the authors expect that the catalogues of SVs accessible to short-read WGS across large populations like gnomAD-SV will capture a majority of the most interpretable gene-disrupting SVs in humans. 

Mutation rate estimates for SVs have remained elusive owing to limited sample sizes, poor resolution of conventional technologies, technical challenges of SV discovery, and use of cell line-derived DNA in population studies1,25. 

Owing to their size and mutational diversity, SVs can have varied consequences on protein-coding genes12 (Fig. 4a, Supplementary Fig. 17). 

After excluding low-quality SVs, which were predominantly (61.6%) composed of incompletely resolved breakpoint junctions (that is, ‘breakends’) that lack interpretable alternative allele structures for functional annotation and produce high false-discovery rates20 (Extended Data Fig. 2a), the authors retained 335,470 high-quality SVs for subsequent analyses (Supplementary Table 3). 

The mutational diversity of gnomAD-SV was extensive: the authors completely resolved 5,295 complex SVs across 11 mutational subclasses, of which 3,901 (73.7%) involved inverted segments (Fig. 2), confirming that inversion variation is predominantly composed of complex SVs rather than canonical inversions1,24. 

This final set of high-quality SVs corresponded to a median of 7,439 SVs per genome, or more than twice the number of variants per genome captured by previous WGS-based SV studies such as the 1000 Genomes Project (3,441 SVs per genome from approximately 7× coverage WGS), which underscores the benefits of high-coverage WGS and improved multi-algorithm ensemble methods for SV discovery. 

0.32% of samples carried a very rare (allele frequency < 0.1%) SV resulting in pLoF of a gene for which incidental findings are clinically actionable, nearly half of which (that is, 0.13% of all samples) would meet diagnostic criteria as pathogenic or likely pathogenic based upon the American College of Medical Genetics (ACMG) recommendations7 (Fig. 6c). 

Although these data remain insufficient to derive accurate estimates of gene-level constraint, sequence-specific mutation rates, and intolerance to noncoding SVs, they provide a step towards these goals and reinforce the value of data sharing and harmonized analyses of aggregated genomic data sets.