What are the contributions mentioned in the paper "A structural variation reference for medical and population genetics" ?

The Genome Aggregation Database Production Team ( GADP ) this paper is a consortium of researchers working on GADB.

How many SVs are available to short-read WGS?

92.7% of all known autosomal protein-coding nucleotides are not localized to simple- or low-copy repeats, and therefore the authors expect that the catalogues of SVs accessible to short-read WGS across large populations like gnomAD-SV will capture a majority of the most interpretable gene-disrupting SVs in humans.

What are the main goals of gnomAD-SV?

Although these data remain insufficient to derive accurate estimates of gene-level constraint, sequence-specific mutation rates, and intolerance to noncoding SVs, they provide a step towards these goals and reinforce the value of data sharing and harmonized analyses of aggregated genomic data sets.

(Open Access) An open resource of structural variation for medical and population genetics (2019) | Ryan L. Collins

Q: What is the need for a reference for SVs across global populations?

As short-read WGS is rapidly becoming the predominant technology in large-scale human disease studies, and will probably displace conventional methods for diagnostic screening, there is a mounting need for comparable references of SVs across global populations.

Q: What are the reasons why SVs have remained elusive?

Mutation rate estimates for SVs have remained elusive owing to limited sample sizes, poor resolution of conventional technologies, technical challenges of SV discovery, and use of cell line-derived DNA in population studies1,25.

Q: What is the significance of SVs in population studies?

Owing to their size and mutational diversity, SVs can have varied consequences on protein-coding genes12 (Fig. 4a, Supplementary Fig. 17).

Q: How many SVs were retained for subsequent analyses?

After excluding low-quality SVs, which were predominantly (61.6%) composed of incompletely resolved breakpoint junctions (that is, ‘breakends’) that lack interpretable alternative allele structures for functional annotation and produce high false-discovery rates20 (Extended Data Fig. 2a), the authors retained 335,470 high-quality SVs for subsequent analyses (Supplementary Table 3).

Q: What is the mutational diversity of gnomAD-SV?

The mutational diversity of gnomAD-SV was extensive: the authors completely resolved 5,295 complex SVs across 11 mutational subclasses, of which 3,901 (73.7%) involved inverted segments (Fig. 2), confirming that inversion variation is predominantly composed of complex SVs rather than canonical inversions1,24.

Q: How many SVs were found in the 1000 Genomes Project?

This final set of high-quality SVs corresponded to a median of 7,439 SVs per genome, or more than twice the number of variants per genome captured by previous WGS-based SV studies such as the 1000 Genomes Project (3,441 SVs per genome from approximately 7× coverage WGS), which underscores the benefits of high-coverage WGS and improved multi-algorithm ensemble methods for SV discovery.

Q: What is the way to determine if a SV is a pathogenic ?

0.32% of samples carried a very rare (allele frequency < 0.1%) SV resulting in pLoF of a gene for which incidental findings are clinically actionable, nearly half of which (that is, 0.13% of all samples) would meet diagnostic criteria as pathogenic or likely pathogenic based upon the American College of Medical Genetics (ACMG) recommendations7 (Fig. 6c).

444 | Nature | Vol 581 | 28 May 2020

Article

A structural variation reference for medical

and population genetics

Ryan L. Collins

1,2,3,158

, Harrison Brand

1,2,4,158

, Konrad J. Karczewski

1,5

, Xuefang Zhao

1,2,4

Jessica Alföldi

1,5

, Laurent C. Francioli

1,5,6

, Amit V. Khera

1,2

, Chelsea Lowther

1,2,4

Laura D. Gauthier

1,7

, Harold Wang

1,2

, Nicholas A. Watts

1,5

, Matthew Solomonson

1,5

Anne O’Donnell-Luria

1,5

, Alexander Baumann

, Ruchi Munshi

, Mark Walker

1,7

Christopher W. Whelan

, Yongqing Huang

, Ted Brookings

, Ted Sharpe

, Matthew R. Stone

1,2

Elise Valkanas

1,2,3

, Jack Fu

1,2,4

, Grace Tiao

1,5

, Kristen M. Laricchia

1,5

, Valentin Ruano-Rubio

Christine Stevens

, Namrata Gupta

, Caroline Cusick

, Lauren Margolin

, Genome

Aggregation Database Production Team*, Genome Aggregation Database Consortium*,

Kent D. Taylor

, Henry J. Lin

, Stephen S. Rich

, Wendy S. Post

, Yii-Der Ida Chen

Jerome I. Rotter

, Chad Nusbaum

1,155

, Anthony Philippakis

, Eric Lander

1,11,12

, Stacey Gabriel

Benjamin M. Neale

1,2,5,13

, Sekar Kathiresan

1,2,6,14

, Mark J. Daly

1,2,5,13

, Eric Banks

Daniel G. MacArthur

1,2,5,6,156,157

& Michael E. Talkowski

1,2,4,13

✉

Structural variants (SVs) rearrange large segments of DNA

and can have profound

consequences in evolution and human disease

2,3

. As national biobanks,

disease-association studies, and clinical genetic testing have grown increasingly

reliant on genome sequencing, population references such as the Genome

Aggregation Database (gnomAD)

have become integral in the interpretation of

single-nucleotide variants (SNVs)

. However, there are no reference maps of SVs from

high-coverage genome sequencing comparable to those for SNVs. Here we present a

reference of sequence-resolved SVs constructed from 14,891 genomes across diverse

global populations (54% non-European) in gnomAD. We discovered a rich and

complex landscape of 433,371 SVs, from which we estimate that SVs are responsible

for 25–29% of all rare protein-truncating events per genome. We found strong

correlations between natural selection against damaging SNVs and rare SVs that

disrupt or duplicate protein-coding sequence, which suggests that genes that are

highly intolerant to loss-of-function are also sensitive to increased dosage

. We also

uncovered modest selection against noncoding SVs in cis-regulatory elements,

although selection against protein-truncating SVs was stronger than all noncoding

eects. Finally, we identied very large (over one megabase), rare SVs in 3.9% of

samples, and estimate that 0.13% of individuals may carry an SV that meets the

existing criteria for clinically important incidental ndings

. This SV resource is freely

distributed via the gnomAD browser

and will have broad utility in population

genetics, disease-association studies, and diagnostic screening.

SVs are DNA rearrangements that involve at least 50 nucleotides

By virtue of their size and abundance, SVs represent an important

mutational force that shape genome evolution and function

2,3

, and

contribute to germline and somatic diseases

9–11

. The profound effect

of SVs is also attributable to the numerous mechanisms by which they

can disrupt protein-coding genes and cis-regulatory architecture

SVs can be grouped into mutational classes that include ‘unbalanced’

gains or losses of DNA (for example, copy-number variants, CNVs),

and ‘balanced’ rearrangements that occur without corresponding

dosage alterations (such as inversions and translocations)

(Fig.1a).

Other common forms of SVs include mobile elements that insert them-

selves throughout the genome, and multiallelic CNVs (MCNVs) that can

exist at high copy numbers

. More recently, exotic species of complex

SVs have been discovered that involve two or more distinct SV signa-

tures in a single mutational event interleaved on the same allele, and

can range from CNV-flanked inversions to rare instances of localized

chromosome shattering, such as chromothripsis

13,14

. The diversity of

SVs in humans is therefore far greater than has been widely appreciated,

as is their influence on genome structure and function.

Although SVs alter more nucleotides per genome than SNVs and

short insertion/deletion variants (indels; <50 bp)

, surprisingly little

is known about their mutational spectra on a global scale. The largest

published population study of SVs using whole-genome sequencing

(WGS) remains the 1000 Genomes Project (n=2,504; 7× sequence

https://doi.org/10.1038/s41586-020-2287-8

Received: 2 March 2019

Accepted: 31 March 2020

Published online: 27 May 2020

Open access

Check for updates

Lists of afiliations and consortium members appear at the end of the paper.

Nature | Vol 581 | 28 May 2020 | 445

coverage)

, and the substantial technical challenges of SV discovery

from WGS

has led to non-uniform SV analyses across contemporary

studies

16–20

. Moreover, short-read WGS is unable to capture a sub

set of SVs accessible to more expensive niche technologies, such as

long-read WGS

. Owing to the combination of these challenges, SV

references are dwarfed by contemporary resources for short variants,

such as the Exome Aggregation Consortium (ExAC) and its successor,

the Genome Aggregation Database (gnomAD), which have jointly ana-

lysed more than 140,000 individuals

4,6

. Publicly available resources

such as ExAC and gnomAD have transformed many aspects of human

genetics, including defining sets of genes constrained against dam-

aging coding mutations

and providing frequency filters for variant

interpretation

. As short-read WGS is rapidly becoming the predomi-

nant technology in large-scale human disease studies, and will prob-

ably displace conventional methods for diagnostic screening, there

is a mounting need for comparable references of SVs across global

populations.

In this study, we developed gnomAD-SV, a sequence-resolved refer-

ence for SVs from 14,891 genomes. Our analyses revealed diverse muta-

tional patterns among SVs, and principles of selection acting against

reciprocal dosage changes in genes and noncoding cis-regulatory

elements. From these analyses, we determined that SVs represent more

than 25% of all rare protein-truncating events per genome, emphasizing

the unrealized potential of routine SV detection in WGS studies. This

SV reference has been integrated into the gnomAD browser (http://

gnomad.broadinstitute.org) with no restrictions on reuse so that it

can be mined for new insights into genome biology and applied as a

resource to interpret SVs in diagnostic screening.

SV discovery and genotyping

We analysed WGS data for 14,891 samples (average coverageof 32×)

aggregated from large-scale sequencing projects, of which 14,237

(95.6%) passed all quality thresholds, representing a general adult popu-

lation depleted for severe Mendelian diseases (median ageof49years)

(Supplementary Table1, Supplementary Figs.1, 2). This cohort included

46.1% European, 34.9% African or African American, 9.2% East Asian,

and 8.7% Latino samples, as well as 1.2% samples from admixed or other

populations (Fig.1). Following family-based analyses using 970 parent–

child trios for quality assessments, we pruned all first-degree relatives

from the cohort, retaining 12,653 unrelated genomes for subsequent

analyses.

We discovered and genotyped SVs using a cloud-based,

multi-algorithm pipeline for short-read WGS (Supplementary Fig.3),

which we prototyped in a study of 519 autism quartet families

. This

pipeline integrated four orthogonal evidence types to capture SVs

across the size and allele frequency spectra, including six classes of

canonical SVs (Fig.1a) and 11 subclasses of complex SVs

(Fig.2). We

augmented this pipeline with new methods to account for the technical

heterogeneity of aggregated datasets (Extended Data Fig.1, Supple-

mentary Figs.4, 5), and discovered 433,371 SVs (Fig.1c). After exclud-

ing low-quality SVs, which were predominantly (61.6%) composed of

incompletely resolved breakpoint junctions (that is, ‘breakends’) that

lack interpretable alternative allele structures for functional annota-

tion and produce high false-discovery rates

(Extended Data Fig.2a),

we retained 335,470 high-quality SVs for subsequent analyses (Sup-

plementary Table3). This final set of high-quality SVs corresponded

to a median of 7,439 SVs per genome, or more than twice the number

of variants per genome captured by previous WGS-based SV stud-

ies such as the 1000 Genomes Project (3,441 SVs per genome from

approximately 7× coverage WGS), which underscores the benefits of

high-coverage WGS and improved multi-algorithm ensemble methods

for SV discovery.

Given that there are no gold-standard benchmarking procedures

for SVs from WGS, we evaluated the technical qualities of gnomAD-SV

using seven orthogonal approaches. These analyses are described in

detail in Extended Data Figs.2, 3, Supplementary Figs.6–12, Supple-

mentary Table4 and Supplementary Note1, but we highlight just a few

here to demonstrate that gnomAD-SV conforms to many fundamental

principles of population genetics, including Mendelian segregation,

genotype distributions, and linkage disequilibrium. We found that the

precision of gnomAD-SV was comparable to our previous study of 519

autism quartets that attained a 97% molecular validation rate for all

denovo SV predictions

: in gnomAD, analyses of 970 parent–child

trios indicated a median Mendelian violation rate of 3.8% and a het-

erozygous denovo rate of 3.0%. We also observed that 86% of SVs were

in Hardy–Weinberg equilibrium, and common SVs were in strong linkage

disequilibrium with nearby SNVs or indels (median peak R

=0.85). We

performed extensive in silico confirmation of 19,316 SVs predicted from

short-read WGS using matched long-read WGS from four samples

21,23

finding a 94.0% confirmation rate with breakpoint-level read evidence,

and revealing that 59.8% of breakpoint coordinates were accurate within

a single nucleotide of the long-read data. These and other benchmark-

ing approaches suggested that gnomAD-SV was sufficiently sensitive

and specific to be used as a reference dataset for most applications in

human genomics.

0510 15

Samples (×1,000)

14,237

gnomAD−SV

This study

2,504

1000G

769

GoNL

gnomAD−SV

1000G

GoNL

147

GTEx

AFR

AMR

EAS

OTH

EUR

0 200 400

SVs (×1,000)

433,371

This study

68,818

67,357

23,602

GTEx

DEL

DUP

MCNV

INV

CPX

BND

INS

–30 –20 –10 010

PC1

PC2

AFR

AMR

EAS

OTH

EUR

3,505 DEL

220 MCNV (loss)

723 DUP

328 MCNV (gain)

2,612 INS

14 INV

37 CPX



0HGLDQ

8,775

AFR

7,376

AMR

7,338

EAS

7,132

EUR

7,692

OTH

2,000

4,000

6,000

8,000

10,000

SVs per genome

DEL

DUP

MCNV

INV

CPX

INS

Alu

SVA

LINE1

Rare

(AF < 1%)

Common

(AF > 1%)

110 100 1 k 10 k

Allele count

50%

60%

70%

80%

90%

100%

Fraction of SVs

<1 kb

1−10 kb

10−100 kb

100 kb−1 Mb

>1 Mb

50%

60%

70%

80%

SV size

Singleton proportion

Deletion Duplication Insertion Inversion BreakendsComplex SVMultiallelic CNV

Ref.

SV class

Abbrev.

Example

alternatives

CNV Other SV (non-CNV)

DEL

DUP

MCNV

INV

Translocation

CTX

CPX

BNDINS

Unresolved

$ $

;$ $¶

$ $¶

$ $¶¶$¶

$ $¶ $

chrA

chrB

chrA

chrB

Discarded

$ %

$¶ %

(See )LJ)

100 bp 1 kb 10 kb 100 kb 1 Mb 10 Mb

SV size

100

1 k

10 k

100 k

SVs discovered

Fig. 1 | Properties of SVs across human populations. a, SV classes catalogued

in this study. We also documented unresolved non-reference ‘breakends’

(BNDs), but they were excluded from all analyses as low-quality variants.

b, After quality control, we analysed 14,237 samples across continental

populations, including African/African American (AFR), Latino (AMR), East

Asian (EAS), and European (EUR), or other populations (OTH). Three publicly

available WGS-based SV datasets are provided for comparison (1000 Genomes

Project (1000G), approximately 7× coverage; Genome of the Netherlands

Project (GoNL), around 13× coverage; Genotype-Tissue Expression Project

(GTEx), approximately 50× coverage)

1,16,17

. c, We discovered 433,371 SVs, and

provide counts from previous studies for comparison

1,16,17

. d, A principal

component (PC) analysis of genotypes for 15,395 common SVs separated

samples along axes corresponding to genetic ancestry. e, The median genome

contained 7,439 SVs. f, Most SVs were small. Expected Alu, SVA and LINE1

mobile element insertion peaks are marked at approximately 300 bp, 2.1 kb and

6 kb, respectively. g, Most SVs were rare (allele frequency (AF) <1%), and 49.8%

of SVs were singletons (solid bars). h, Allele frequencies were inversely

correlated with SV size across all 335,470 resolved SVs in unrelated individuals.

Values are mean and 95% confidence interval from 100-fold bootstrapping.

Colour codes are consistent between a, c, e–h, and between b and d.

446 | Nature | Vol 581 | 28 May 2020

Article

Population genetics and genome biology

The distribution of SVs across samples matched expectations based

on human demographic history, with the top three components of

genetic variance separating continental populations (Fig.1d, Sup-

plementary Fig.13). African and African American samples exhibited

the greatest genetic diversity and their common SVs were in weaker

linkage disequilibrium with nearby short variants than Europeans,

whereas East Asians featured the highest levels of homozygosity

(Fig.1e, Extended Data Fig.4a–d, Supplementary Fig.7). The muta-

tional diversity of gnomAD-SV was extensive: we completely resolved

5,295 complex SVs across 11 mutational subclasses, of which 3,901

(73.7%) involved inverted segments (Fig.2), confirming that inversion

variation is predominantly composed of complex SVs rather than

canonical inversions

1,24

. Across all SV classes, most SVs were small

(median sizeof331 bp) and rare (allele frequency < 1%; 92% of SVs),

with half of all SVs (49.8%) appearing as ‘singletons’ (that is, only one

allele observed across all samples) (Fig.1f, g). Although the proportion

of singletons varied by SV class, it was strongly dependent on SV size

across all classes, which suggests that the amount of DNA rearranged

is a key determinant of selection against most SVs (Fig.1h, Extended

Data Fig.5a).

Mutation rate estimates for SVs have remained elusive owing to

limited sample sizes, poor resolution of conventional technologies,

technical challenges of SV discovery, and use of cell line-derived DNA

in population studies

1,25

. Here, we used the Watterson estimator

project a mean mutation rate of 0.29 denovo SVs (95% confidence inter-

val 0.13–0.44) per generation in regions of the genome accessible to

short-read WGS, or roughly one new SV every 2–8 live births, with muta-

tion rates varying markedly by SV class (Fig.3a). Although this imperfect

method extrapolates from data pooled across unrelated individuals, we

previously demonstrated comparable rates from molecularly validated

observations in 519 quartet families

. Like mutation rates, the distri-

bution of SVs throughout the genome was non-uniform, significantly

correlated with repetitive sequence contexts, and was enriched near

centromeres and telomeres

(Supplementary Fig.16). These trends

were dependent on SV class, as biallelic deletions and duplications were

predominantly enriched at telomeres, whereas MCNVs were enriched

in centromeric segmental duplications (Fig.3b–d). Given the reduced

sensitivity of short-read WGS in repetitive sequences, this study cer-

tainly underestimates the true SV mutation rates; nevertheless, these

analyses implicate several aspects of chromosomal context and SV class

in determining SV mutation rates throughout the genome.

Dosage sensitivity of coding and noncoding loci

Owing to their size and mutational diversity, SVs can have varied con-

sequences on protein-coding genes

(Fig.4a, Supplementary Fig.17).

In principle, any SV can result in predicted loss-of-function (pLoF),

either by deleting coding nucleotides or altering open-reading frames.

Coding duplications can result in copy-gain of entire genes, or of a

subset of exons within a gene (referred to here as intragenic exonic

Abbrev. SV size APS

All Complex SVs CPX

Varies Varies Varies

5,295

2.2 kb 0.02

Paired−duplication

inversion

dupINVdup

258

155.5 kb 0.07

Paired−deletion

inversion

delINVdel

616

9.7 kb 0.04

Paired−deletion/

duplication inversion

delINVdup

dupINVdel

551

8.6 kb –0.02

Deletion−anked

inversion

delINV

INVdel

623

4 kb 0.02

Insertion with

insertion site deletion

dDUP−iDEL

INS−iDEL

288

3.9 kb 0.11

Duplication−anked

inversion

dupINV

INVdup

1,851

1.5 kb 0.01

Dispersed

duplication

dDUP

1,106

0.3 kb 0.02

100 bp

1 kb

10 kb

100 kb

1 Mb

10 Mb

–0.2

–0.1

0.1

0.2

Reference Deletion Duplication InversionInsertion

A B C

A B C A C

A’ B C’

A B

A’ B

A B’

A B C

B C’

A’ B

A’

A B

A’

A B

Complex SV

subclass

Mutational

signatures

Ref. allele

structure

Alt. allele

structure(s)

Resolved

SVs

Fig. 2 | Complex SVs are abundant in the human genome. We resolved 5,295

complex SVs across 11 mutational subclasses, 73.7% of which involved at least

one inversion. Each subclass is detailed here, including their mutational

signatures, structures, abundance, density of SV sizes (vertical line indicates

median size), and allele frequencies. Five pairs of subclasses have been

collapsed into single rows due to mirrored or similar alternative allele

structures (for example, delINV versus INVdel). Two complex SVs did not

conform to any subclass (Extended Data Fig.8).

Meta-chromosome (mean of au

tosomes)

0.5

1.0

1.5

2.0

SV fold-enrichment

Meta-chromosome

0.5

1.0

1.5

2.0

Meta-chromosome

0.5

1.0

1.5

2.0

Meta-chromosome

0.5

1.0

1.5

2.0

0.5

1.0

1.5

2.0

Meta-chromosome

DEL DUP MCNV

INS INV CPX

n = 172,637 n = 46,408 n = 1,055

n = 109,278 n = 788 n = 5,295

SV fold-enrichment

All DEL INS DUP CPXINV

0.0

0.1

0.2

0.3

0.4

Mutation rate, P (SVs per generation)

0.286

0.146

0.095

0.040

0.004

0.001

P from Wattersone

in gnomAD, n = 10,000

Rate of validated de novo SVs from 519 quartets

5% 90% 5%

ALL

DEL

DUP

MCNV

INS

INV

CPX

–3

–2

–1

SV fold-enrichment

*Bonferroni P < 0.05

5% 90% 5%

Mean 95% CI

TI ICC T

Fig. 3 | Genome-wide mutational patterns of SVs. a, Mutation rates (μ) from

the Watterson estimator for each SV class

. Bars represent 95% confidence

intervals. Rates of molecularly validated denovo SVs from 519 quartet families

are provided for comparison

. b, Smoothed enrichment of SVs per 100-kb

window across the average of all autosomes normalized by chromosome arm

length (a ‘meta-chromosome’) (Supplementary Fig.16). c, The distribution of

SVs along the meta-chromosome was dependent on variant class. d, SV

enrichment by class and chromosomal position provided as mean and 95%

confidence intervals (CI). C, centromeric; I, interstitial; T, telomeric. P values

were computed using a two-sided t-test and were Bonferroni-adjusted for

21 comparisons. *P≤2.38×10

−3

Nature | Vol 581 | 28 May 2020 | 447

duplication, or IED). The average genome in gnomAD-SV contained a

mean of 179.8 genes altered by biallelic SVs (144.3 pLoF, 24.3 copy-gain,

and 11.2 IED), of which 11.6 were predicted to be completely inacti-

vated by homozygous pLoF (Fig.4b, Extended Data Fig.4e–h). When

restricted to rare (allele frequency < 1%) SVs, we observed a mean of

10.2 altered genes per genome (5.5 pLoF, 3.4 copy-gain, and 1.3 IED).

By comparison, a companion gnomAD paper estimated 122.4 pLoF

short variants per genome, of which 16.3 were rare

. These analyses

suggest that 29.4% of rare heterozygous gene inactivation events per

individual are contributed by SVs, or conservatively 25.2% of pLoF

events if we exclude IEDs given the context-dependence of their

functional impact.

A fundamental question in human genetics is the degree to which

natural selection acts on coding and noncoding loci. The proportion

of singleton variants has been established as a proxy for strength of

selection

; however, this metric is confounded for SVs given the strong

correlation between allele frequency and SV size, among other factors.

Therefore, we developed a new metric, adjusted proportion of single-

tons (APS), to account for SV class, size, genomic context, and other

technical covariates (Extended Data Fig.5, Supplementary Fig.14).

Under this normalized APS metric, a value of zero corresponds to a sin-

gleton proportion comparable to intergenic SVs, whereas values greater

than zero reflect purifying selection, similar to the ‘mutability-adjusted

proportion of singletons’ (MAPS) metric used for SNVs

. Applying this

APS model revealed signals of pervasive selection against nearly all

classes of SVs that overlap genes, including intronic SVs, whole-gene

inversions, SVs in gene promoters, and deletions as small as a single

exon (Fig.4c, Extended Data Fig.6, Supplementary Fig.18). The one

notable exception was copy-gain duplications, which showed no clear

evidence of selection beyond what could already be explained by their

sizes, which were vastly larger than non-copy-gain duplications (median

copy-gain duplication size=134.8 kb; median non-copy-gain duplica-

tion size=2.7 kb; one-tailed Wilcoxon test, W=1.18×10

, P<10

−100

). This

result could have numerous explanations, but it is consistent with the

known diverse evolutionary roles of gene duplication events, including

positive selection reported in humans

27,28

Methods that quantify evolutionary constraint on a per-gene basis,

such as the probability of intolerance to heterozygous pLoF variation

(pLI)

and the pLoF observed/expected upper fraction (LOEUF)

, have

become core resources in human genetics. Nearly all existing metrics,

including pLI and LOEUF, are derived from SNVs. Although previous

studies have attempted to compute similar scores using large CNVs

detected by microarray and exome sequencing

29,30

, or to correlate

deletions with pLI

, no gene-level metrics comparable to LOEUF exist

for SVs at WGS resolution. To gain insight into this problem, we built

a model to estimate the depletion of rare SVs per gene compared to

expectations based on gene length, genomic context, and the structure

of exons and introns. This model is imperfect, as current sample sizes

are too sparse to derive precise gene-level metrics of constraint from

SVs. Nevertheless, we found strong concordance between the deple-

tion of rare pLoF SVs and existing pLoF and missense SNV constraint

metrics

(pLoF Spearman correlation test, ρ=0.90, P<10

−100

) (Fig.4d,

Supplementary Fig.19). Notably, a comparable positive correlation was

also observed for copy-gain SVs and SNV constraint (pLoF Spearman

correlation test, ρ=0.78, P<10

−100

), whereas a weaker yet significant

correlation was detected for IEDs (pLoF Spearman correlation test,

ρ=0.58, P=2.0 × 10

−11

). As orthogonal support for these trends, we

identified an inverse correlation between APS and SNV constraint

across all functional categories of SVs, which was consistent with

our observed depletion of rare, functional SVs in constrained genes

(Extended Data Fig.6f). These comparisons confirm that selection

against most classes of gene-altering SVs mirrors patterns observed

for short variants

18,30

. They further suggest that SNV-derived constraint

metrics such as LOEUF capture a general correspondence between

haploinsufficiency and triplosensitivity for a large fraction of genes in

the genome. It therefore appears that the most highly pLoF-constrained

genes not only aresensitive to pLoF, but also aremore likely to be intol-

erant to increased dosage and other functional alterations.

In contrast to the well-studied effects of coding variation, the effects

of noncoding SVs on regulatory elements are largely unknown. There are

a handful of examples of SVs with strong noncoding effects, although

they are scarce in humans and model organisms

31,32

. In gnomAD-SV,

we explored noncoding dosage sensitivity across 14 regulatory ele

ment classes, ranging from high-confidence experimentally validated

enhancers to large databases of computationally predicted elements

(Supplementary Table5). We found that noncoding CNVs overlapping

most element classes had increased proportions of singletons, although

none exceeded the APS observed for pLoF SVs (Fig.5a). In general, the

effects of noncoding deletions appeared stronger than noncoding

duplications, and CNVs predicted to delete or duplicate entire ele-

ments were under stronger selection than partial element disruption

(Fig.5b). We also observed that primary sequence conservation was

correlated with selection against noncoding CNVs (Fig.5c, d), which

provides a foothold for future work on interpretation and functional

effect prediction for noncoding SVs. Broadly, these results followed

trends we observed for protein-coding SVs, which we interpreted as

evidence for weak but widespread selection against CNVs altering

most classes of annotated regulatory elements.

Trait association and clinical genetics

Most large-scale trait association studies have only considered SNVs

in genome-wide association studies (GWAS). Taking advantage of

the sample size and resolution of gnomAD-SV, we evaluated whether

SNVs associated with human traits might be in linkage disequilib-

rium with SVs not directly genotyped in GWAS. We identified 15,634

common SVs (allele frequency >1%) in strong linkage disequilibrium

≥0.8) with at least one common short variant (Supplementary

pLoF

IED

pLoF

100

200

Genes per genome

Biallelic MCNV

All SVs

pLoF

IED

Rare SVs

SNVs & indels

All SVs

pLoF (All)

pLoF (INS)

pLoF (DEL)

IED

pLoF (INV/CPX)

Whole-gene INV

Promoter

Intronic

Intergenic

0.05

0.10

0.15

0.20

APS

No. of SVs

Median Size

SVs per gene

Effect

Abbrev.

Reference

Example

SVs

Gene inactivation

n = 9,867

7.9 kb

0.62

Loss-of-Function

DEL

DEL INV

INV

INS

DUP

pLoF

n = 1,951

13.3 kb

0.11

Varies by context

Intragenic Exon Dup.

DUPDUP

IED

n = 3,024

138.2 kb

0.32

Increased dosage

Copy Gain

DUP

n = 363

508.3 kb

0.60

Whole-gene INV

No direct coding effect

INV

0th 25th 50th 75th 100th 0th 25th 50th 75th 100th 0th 25th 50th 75th 100th 0th 25th 50th 75th

100th

50%

100%

150%

ρ = 0.90 ρ = 0.78 ρ = 0.58 ρ = 0.24

P = 1.58 ×10

−2

Rare SV (obs/exp)

pLoF SNV constraint (LOEUF) percentile

pLoF IEDCG

INV

P = 2.00 ×10

−11

P < 10

−100

P < 10

−100

Fig. 4 | Pervasive selection against SVs in genes mirrors coding short

variants. a, Four categories of gene-overlapping SVs, with counts of total SVs,

median SV size, and mean SVs per gene in gnomAD-SV. b, Count of genes

altered by SVs per genome. Horizontal lines indicate medians. Sample sizes per

category listed in Supplementary Table9. c, APS value for SVs overlapping

genes. Bars indicate 100-fold bootstrapped 95% confidence intervals. SVs per

category listed in Supplementary Table9. d, Relationships of constraint against

pLoF SNVs versus gene-overlapping SVs in 100 bins of around 175 genes each,

ranked by SNV constraint

. Correlations were assessed with a two-sided

Spearman correlation test. Solid lines represent 21-point rolling means.

See Supplementary Fig.19 for comparisons to missense constraint.

448 | Nature | Vol 581 | 28 May 2020

Article

Fig.7), 14.8% of which matched a reported association from the

NHGRI-EBI GWAS catalogue or a recent analysis of 4,203 phenotypes

in the UK Biobank

33,34

. Common SVs in linkage disequilibrium with

GWAS variants were enriched for genic SVs across multiple functional

categories (Supplementary Table6), and included candidate SVs such

as a deletion of a thyroid enhancer in the first intron of ATP6V0D1

at a hypothyroidism-associated locus

(Extended Data Fig.7). We

also identified matches for previously proposed causal SVs tagged

by common SNVs, including pLoF deletions of CFHR3 or CFHR1 in

nephropathies and of LCE3B or LCE3C in psoriasis

35,36

. These results

demonstrate the value of imputing SVs into GWAS, and for the eventual

unification of short variants and SVs in all trait association studies.

Given the potential value of this resource, we have released these link-

age disequilibrium maps in Supplementary Table7.

As genomic medicine advances towards diagnostic screening at

sequence resolution, computational methods for variant discovery

from WGS and population references for interpretation will become

indispensable. One category of disease-associated SVs, recurrent CNVs

mediated by homologous segmental duplications known as genomic

disorders, are particularly important because they collectively repre-

sent a common cause of developmental disorders

. Accurate detection

of large, repeat-mediated CNVs is thus crucial for WGS-based diagnostic

testing as chromosomal microarray is the recommended first-tier diag-

nostic screen at present for unexplained developmental disorders

Using gnomAD-SV, we evaluated our ability to detect genomic disorders

in WGS data by calculating CNV carrier frequencies for 49 genomic

disorders across 10,047 unrelated samples with no known neuropsy-

chiatric disease and found that CNV carrier frequencies in gnomAD-SV

were consistent with those reported from chromosomal microarray in

the UK Biobank

=0.669; Pearson correlation test, P=7.38×10

−13

)

(Fig.6a, Supplementary Table8, Supplementary Fig.20). The frequen-

cies of carriers of genomic disorders did not vary significantly among

populations, with the exception of duplications of NPHP1 at 2q13, in

which carrier frequencies in East Asian samples were up to 4.6-fold

higher than in other populations, further highlighting the potential

for variant interpretation to be confounded by the limited diversity

of existing SV references (Supplementary Fig.21).

In the context of variant interpretation, the current gnomAD-SV

resource will permit a screening threshold of allele frequencies less

than 0.1% when matching on ancestry to the populations sampled

here, and allele frequencies less than 0.004% globally. In the current

release, we catalogued at least one pLoF or copy-gain variant for 36.9%

and 23.7% of all autosomal genes, respectively, and 490 genes with at

least one homozygous pLoF SV (Fig.6b, Extended Data Fig.6e, Sup-

plementary Fig.22). We also benchmarked carrier rates for several

categories of clinically relevant variants in gnomAD-SV. First, 0.32%

of samples carried a very rare (allele frequency < 0.1%) SV resulting in

pLoF of a gene for which incidental findings are clinically actionable,

nearly half of which (that is, 0.13% of all samples) would meet diagnos-

tic criteria as pathogenic or likely pathogenic based upon the Ameri-

can College of Medical Genetics (ACMG) recommendations

(Fig.6c).

Second, 7.22% of individuals were heterozygous carriers of rare pLoF

SVs in known recessive developmental disorder genes

. Third, we

estimated that 3.8% of the general population (95% confidence inter-

val of 3.2–4.6%) carries at least one very large (≥1 Mb) rare autosomal

SV, roughly half of which (45.2%) were balanced or complex (Fig.6d).

Among these was an example of localized chromosome shattering

involving at least 49 breakpoints, yet resulting in largely balanced

products, reminiscent of chromothripsis, in an adult with no known

severe disease or DNA repair defect

13,14,22

(Fig.6e, Extended Data Fig.8).

Collectively, these analyses highlight the potential of gnomAD-SV

and WGS-based SV methods to augment disease-association studies

and clinical interpretation across a broad spectrum of variant classes

and study designs.

Discussion

Human genetic research and clinical diagnostics are becoming increas-

ingly invested in capturing the complete landscape of variation in

individual genomes. Ambitious international initiatives to generate

short-read WGS in many thousands of individuals from common disease

cohorts have underwritten this goal

40,41

, and millions of genomes will

be sequenced in the coming years from national biobanks

42,43

. A central

challenge to these efforts will be the uniform analysis and interpretation

of all variation accessible to WGS, particularly SVs, which are frequently

invoked as a source of added value offered by WGS. Indeed, early WGS

studies in cardiovascular disease and autism have been largely consist-

ent in their analyses of short variants, but every study has differed in its

analysis of SVs

18–20,40,41

. Thus, while ExAC and gnomAD have prompted

remarkable advances in medical and population genetics for short

variants, the same gains have not yet been realized for SVs. Although

gnomAD-SV is not exhaustively comprehensive, it was derived from

WGS methods and a reference genome that match those currently used

in many research and clinical settings, which will help to facilitate the

eventual standardization of SV discovery, analysis, and interpretation

across studies.

Most foundational assumptions about human genetic variation were

consistent between SVs and short variants in gnomAD, most notably

that SVs segregate stably on haplotypes in the population and experi-

ence selection commensurate with their predicted biological conse-

quences. This study also spotlights unique aspects of SVs, such as their

remarkable mutational diversity, their varied functional effects on

coding sequence, and the intense selection against large and complex

0.0 0.1 0.2 0.3

APS

Protein-altering (pLoF & IED) CNVs

Protein-altering (constrained genes)

All intergenic CNVs

No annotations

Any annotation

VISTA validated enhancers

ChromHMM genic enhancers

Ultraconserved elements

ChromHMM bivalent enhancers

ChromHMM enhancers

ChromHMM polycomb repressed

DNAseI hypersensitive sites

Human accelerated regions

Enhancer Atlas predictions

TF binding sites

Recombination hotspots

Predicted super enhancers

TAD boundaries

Chromatin loop boundaries

Strictly noncoding CNVs

Partial Full Partial Full

DEL DUP

0.0

0.1

0.2

0.3

0.4

APS

P = 1.8

−2

P = 7.5

−3

phastCons percentile

APS

−0.05

0.05

0.10

−0.05

0.05

0.10

phastCons percentile

APS

DEL

DUP

Signicant (Bonferroni)

Non-signicant

U = 0.62

U = 0.66

P < 10

−100

P < 10

−100

0th 25th 50th 75th 100t

Fig. 5 | Dosage sensitivity in the noncoding genome. a, Strength of selection

(APS) for noncoding CNVs overlapping 14 categories of noncoding elements

(Supplementary Table5). Bars reflect 95% confidence intervals from 100-fold

bootstrapping. Each category was compared to neutral variation (APS=0)

using a one-tailed t-test. Categories surpassing Bonferroni-corrected

significance for 32 comparisons are indicated with dark shaded points. SVs per

category listed in Supplementary Table9. DEL, deletion; DUP, duplication; TAD,

topologically associating domain; TF, transcription factor. b, CNVs that

completely covered elements (‘full’) had significantly higher average APS

values than CNVs that only partially covered elements (‘partial’). P values

calculated using a two-tailed paired two-sample t-test for the 14 categories

from a. c, d, Spearman correlations between sequence conservation and APS for

noncoding deletions (n=143,353) (c) and duplications (n=30,052) (d).

Noncoding CNVs were sorted into 100-percentile bins based on the sum of the

phastCons scores overlapped by the CNV. Correlations were assessed with a

two-sided Spearman correlation test. Solid lines represent 21-point rolling means.

An open resource of structural variation for medical and population genetics

Figures

Citations

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

The genetic architecture of type 2 diabetes

Structural variation in the sequencing era.

A robust benchmark for detection of germline large deletions and insertions.

Initial whole-genome sequencing and analysis of the host genetic contribution to COVID-19 severity and susceptibility

References

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Analysis of protein-coding genetic variation in 60,706 humans

UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age

The UK Biobank resource with deep phenotyping and genomic data

Related Papers (5)

Analysis of protein-coding genetic variation in 60,706 humans

An integrated map of structural variation in 2,504 human genomes

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications

A global reference for human genetic variation.

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "A structural variation reference for medical and population genetics" ?

Q2. What is the need for a reference for SVs across global populations?

Q3. How many SVs are available to short-read WGS?

Q4. What are the reasons why SVs have remained elusive?

Q5. What is the significance of SVs in population studies?

Q6. How many SVs were retained for subsequent analyses?

Q7. What is the mutational diversity of gnomAD-SV?

Q8. How many SVs were found in the 1000 Genomes Project?

Q9. What is the way to determine if a SV is a pathogenic ?

Q10. What are the main goals of gnomAD-SV?