scispace - formally typeset
Open AccessPosted ContentDOI

Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes

Konrad J. Karczewski, +95 more
- 30 Jan 2019 - 
- pp 531210
Reads0
Chats0
TLDR
Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract
Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

read more

Content maybe subject to copyright    Report

434 | Nature | Vol 581 | 28 May 2020
Article
The mutational constraint spectrum
quantified from variation in 141,456 humans
Konrad J. Karczewski
1,2
 ✉
, Laurent C. Francioli
1,2
, Grace Tiao
1,2
, Beryl B. Cummings
1,2,3
,
Jessica Alföldi
1,2
, Qingbo Wang
1,2,4
, Ryan L. Collins
1,4,5
, Kristen M. Laricchia
1,2
,
Andrea Ganna
1,2,6
, Daniel P. Birnbaum
1,2
, Laura D. Gauthier
7
, Harrison Brand
1,5
,
Matthew Solomonson
1,2
, Nicholas A. Watts
1,2
, Daniel Rhodes
8
, Moriel Singer-Berk
1,2
,
Eleina M. England
1,2
, Eleanor G. Seaby
1,2
, Jack A. Kosmicki
1,2,4
, Raymond K. Walters
1,2,9
,
Katherine Tashman
1,2,9
, Yossi Farjoun
7
, Eric Banks
7
, Timothy Poterba
1,2,9
, Arcturus Wang
1,2,9
,
Cotton Seed
1,2,9
, Nicola Whifin
1,2,10,11
, Jessica X. Chong
12
, Kaitlin E. Samocha
13
,
Emma Pierce-Hoffman
1,2
, Zachary Zappala
1,2,14
, Anne H. O’Donnell-Luria
1,2,15,16
,
Eric Vallabh Minikel
1
, Ben Weisburd
7
, Monkol Lek
17
, James S. Ware
1,10,11
, Christopher Vittal
2,9
,
Irina M. Armean
1,2
, Louis Bergelson
7
, Kristian Cibulskis
7
, Kristen M. Connolly
18
,
Miguel Covarrubias
7
, Stacey Donnelly
1
, Steven Ferriera
18
, Stacey Gabriel
18
, Jeff Gentry
7
,
Namrata Gupta
1,18
, Thibault Jeandet
7
, Diane Kaplan
7
, Christopher Llanwarne
7
, Ruchi Munshi
7
,
Sam Novod
7
, Nikelle Petrillo
7
, David Roazen
7
, Valentin Ruano-Rubio
7
, Andrea Saltzman
1
,
Molly Schleicher
1
, Jose Soto
7
, Kathleen Tibbetts
7
, Charlotte Tolonen
7
, Gordon Wade
7
,
Michael E. Talkowski
1,5,19
, Genome Aggregation Database Consortium*, Benjamin M. Neale
1,2,9
,
Mark J. Daly
1,2,6,9
& Daniel G. MacArthur
1,2,150,151
 ✉
Genetic variants that inactivate protein-coding genes are a powerful source of
information about the phenotypic consequences of gene disruption: genes that are
crucial for the function of an organism will be depleted of such variants in natural
populations, whereas non-essential genes will tolerate their accumulation. However,
predicted loss-of-function variants are enriched for annotation errors, and tend to be
found at extremely low frequencies, so their analysis requires careful variant
annotation and very large sample sizes
1
. Here we describe the aggregation of 125,748
exomes and 15,708 genomes from human sequencing studies into the Genome
Aggregation Database (gnomAD). We identify 443,769 high-condence predicted
loss-of-function variants in this cohort after ltering for artefacts caused by
sequencing and annotation errors. Using an improved model of human mutation
rates, we classify human protein-coding genes along a spectrum that represents
tolerance to inactivation, validate this classication using data from model organisms
and engineered human cells, and show that it can be used to improve the power of
gene discovery for both common and rare diseases.
The physiological function of most genes in the human genome remains
unknown. In biology, as in many engineering and scientific fields, break-
ing the individual components of a complex system can provide valu-
able insight into the structure and behaviour of that system. For the
discovery of gene function, a common approach is to introduce dis-
ruptive mutations into genes and determine their effects on cellular
and physiological phenotypes in mutant organisms or cell lines
2
. Such
studies have yielded valuable insight into eukaryotic physiology and
have guided the design of therapeutic agents
3
. However, although
studies in model organisms and human cell lines have been crucial in
deciphering the function of many human genes, they remain imperfect
proxies for human physiology.
Obvious ethical and technical constraints prevent the large-scale
engineering of loss-of-function mutations in humans. However, recent
exome and genome sequencing projects have revealed a surprisingly
high burden of natural pLoF variation in the human population,
https://doi.org/10.1038/s41586-020-2308-7
Received: 27 January 2019
Accepted: 26 March 2020
Published online: 27 May 2020
Open access
Check for updates
1
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
2
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA,
USA.
3
Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA.
4
Program in Bioinformatics and Integrative Genomics, Harvard Medical School, Boston, MA, USA.
5
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
6
Institute for Molecular Medicine Finland, Helsinki, Finland.
7
Data Sciences Platform, Broad Institute of MIT and
Harvard, Cambridge, MA, USA.
8
Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of
London and Barts Health NHS Trust, London, UK.
9
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
10
National Heart & Lung Institute and MRC
London Institute of Medical Sciences, Imperial College London, London, UK.
11
Cardiovascular Research Centre, Royal Brompton & Hareield Hospitals NHS Trust, London, UK.
12
Department of
Pediatrics, University of Washington, Seattle, WA, USA.
13
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
14
Vertex Pharmaceuticals Inc, Boston, MA, USA.
15
Division
of Genetics and Genomics, Boston Children’s Hospital, Boston, MA, USA.
16
Department of Pediatrics, Harvard Medical School, Boston, MA, USA.
17
Department of Genetics, Yale School of Medicine,
New Haven, CT, USA.
18
Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
19
Department of Neurology, Harvard Medical School, Boston, MA, USA.
150
Present address: Centre
for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, New South Wales, Australia.
151
Present address: Centre for Population Genomics, Murdoch Children’s
Research Institute, Melbourne, Victoria, Australia. *Lists of authors and their afiliations appear at the end of the paper.
e-mail: konradk@broadinstitute.org; d.macarthur@garvan.org.au

Nature | Vol 581 | 28 May 2020 | 435
including stop-gained, essential splice, and frameshift variants
1,4
, which
can serve as natural models for inactivation of human genes. Such
variants have already revealed much about human biology and disease
mechanisms, through many decades of study of the genetic basis of
severe Mendelian diseases
5
, most of which are driven by disruptive vari-
ants in either the heterozygous or homozygous state. These variants
have also proved valuable in identifying potential therapeutic targets:
confirmed LoF variants in the PCSK9 gene have been causally linked to
low levels of low-density lipoprotein cholesterol
6
, and have ultimately
led to the development of several inhibitors of PCSK9 that are now in
clinical use for the reduction of cardiovascular disease risk. A systematic
catalogue of pLoF variants in humans and the classification of genes
along a spectrum of tolerance to inactivation would provide a valuable
resource for medical genetics, identifying candidate disease-causing
mutations, potential therapeutic targets, and windows into the normal
function of many currently uncharacterized human genes.
Several challenges arise when assessing LoF variants at scale. LoF
variants are on average deleterious, and are thus typically main-
tained at very low frequencies in the human population. Systematic
genome-wide discovery of these variants requires whole-exome or
whole-genome sequencing of very large numbers of samples. In addi-
tion, LoF variants are enriched for false positives compared with syn-
onymous or other benign variants, including mapping, genotyping
(including somatic variation), and particularly, annotation errors
1
, and
careful filtering is required to remove such artefacts.
Population surveys of coding variation enable the evaluation of the
strength of natural selection at a gene or region level. As natural selec-
tion purges deleterious variants from human populations, methods to
detect selection have modelled the reduction in variation (constraint)
7
or shift in the allele frequency distribution
8
, compared to an expecta-
tion. For analyses of selection on coding variation, synonymous vari-
ation provides a convenient baseline, controlling for other potential
population genetic forces that may influence the amount of variation
as well as technical features of the local sequence. A model of constraint
was previously applied to define a set of 3,230 genes with a high prob-
ability of intolerance to heterozygous pLoF variation (pLI)
4
and esti-
mated the selection coefficient for variants inthese genes
9
. However,
the ability to comprehensively characterize the degree of selection
against pLoF variants is particularly limited, as for small genes, the
expected number of mutations is still very low, even for samples of up
to 60,000 individuals
4,10
. Furthermore, the previous dichotomization
of pLI, although convenient for the characterization of a set of genes,
disguises variability in the degree of selective pressure against a given
class of variation and overlooks more subtle levels of intolerance to
pLoF variation. With larger sample sizes, a more accurate quantitative
measure of selective pressure is possible.
Here, we describe the detection of pLoF variants in a cohort of 125,748
individuals with whole-exome sequence data and 15,708 individuals
with whole-genome sequence data, as part of the Genome Aggregation
Database (gnomAD; https://gnomad.broadinstitute.org), the successor
to the Exome Aggregation Consortium (ExAC). We develop a continu-
ous measure of intolerance to pLoF variation, which places each gene
on a spectrum of LoF intolerance. We validate this metric by comparing
its distribution to several orthogonal indicators of constraint, includ-
ing the incidence of structural variation and the essentiality of genes
as measured using mouse gene knockout experiments and cellular
inactivation assays. Finally, we demonstrate that this metric improves
the interpretation of genetic variants that influence rare disease and
provides insight into common disease biology. These analyses provide,
to our knowledge, the most comprehensive catalogue so far of the
sensitivity of human genes to disruption.
In a series of accompanying manuscripts, other complementary
analyses of this dataset are described. Using an overlapping set of 14,237
whole genomes, the discovery and characterization of a wide variety of
structural variants (large deletions, duplications, insertions, or other
rearrangements of DNA) is reported
11
. The value of pLoF variants for
the discovery and validation of therapeutic drug targets is explored
12
,
and a case study of the use of these variants from gnomAD and other
large reference datasets is provided to validate the safety of inhibition
of LRRK2—a candidate therapeutic target for Parkinson’s disease
13
. By
combining the gnomAD dataset with a large collection of RNA sequenc-
ing data from adult human tissues
14
, the value of tissue expression
data in the interpretation of genetic variation across a range of human
diseases is reported
15
. Finally, the effect of two understudied classes of
human variation—multi-nucleotide variants
16
and variants that create
or disrupt open-reading frames in the 5′ untranslated region of human
genes—is characterized and investigated
17
.
A high-quality catalogue of variation
We aggregated whole-exome sequencing data from 199,558 individuals
and whole-genome sequencing data from 20,314 individuals. These
data were obtained primarily from case–control studies of common
adult-onset diseases, including cardiovascular disease, type 2 diabe-
tes and psychiatric disorders. Each dataset, totalling more than 1.3
and 1.6 petabytes of raw sequencing data, respectively, was uniformly
processed, joint variant calling was performed on each dataset using a
standardized BWA-Picard-GATK pipeline
18
, and all data processing and
analysis was performed using Hail
19
. We performed stringent sample
quality control (Extended Data Fig.1), removing samples with lower
sequencing quality by a variety of metrics, samples from second-degree
or closer related individuals across both data types, samples with inad-
equate consent for the release of aggregate data, and samples from indi-
viduals known to have a severe childhood-onset disease as well as their
first-degree relatives. The final gnomAD release contains genetic vari-
ation from 125,748 exomes and 15,708 genomes from unique unrelated
individuals with high-quality sequence data, spanning 6 global and 8
sub-continental ancestries (Fig.1a, b), which we have made publicly
available at https://gnomad.broadinstitute.org. We also provide subsets
of the gnomAD datasets, which exclude individuals who are cases in
case–control studies, or who are cases of a few particular disease types
such as cancer and neurological disorders, or who are also aggregated
in the Bravo TOPMed variant browser (https://bravo.sph.umich.edu).
Among these individuals, we discovered 17.2million and 261.9mil-
lion variants in the exome and genome datasets, respectively; these
variants were filtered using a custom random forest process (Supple-
mentary Information) to 14.9 million and 229.9 million high-quality
variants. Comparing our variant calls in two samples for which we had
independent gold-standard variant calls, we found that our filtering
achieves very high precision (more than 99% for single nucleotide
variants (SNVs), over 98.5% for indels in both exomes and genomes)
and recall (over 90% for SNVs and more than 82% for indels for both
exomes and genomes) at the single sample level (Extended Data Fig.2).
In addition, we leveraged data from 4,568 and 212 trios included in
our exome and genome call-sets, respectively, to assess the quality of
our rare variants. We found that our model retains over 97.8% of the
transmitted singletons (singletons in the unrelated individuals that
are transmitted to an offspring) on chromosome 20 (which was not
used for model training) (Extended Data Fig.3a–d). In addition, the
number of putative denovo calls after filtering are in line with expecta-
tions
20
(Extended Data Fig.3e–h), and our model had a recall of 97.3% for
denovo SNVs and 98% for denovo indels based on 375 independently
validated denovo variants in our whole-exome trios (295 SNVs and 80
indels) (Extended Data Fig.3i, j). Altogether, these results indicate that
our filtering strategy produced a call-set with high precision and recall
for both common and rare variants.
These variants reflect the expected patterns based on mutation and
selection: we observe 84.9% of all possible consistently methylated
CpG-to-TpG transitions that would create synonymous variants in the
human exome (Supplementary Table14), which indicates that at this

436 | Nature | Vol 581 | 28 May 2020
Article
sample size, we are beginning to approach mutational saturation of
this highly mutable and weakly negatively selected variant class. How-
ever, we only observe 52% of methylated CpG stop-gained variants,
which illustrates the action of natural selection removing a substantial
fraction of gene-disrupting variants from the population (Fig.1c–h).
Across all mutational contexts, only 11.5% and 3.7% of the possible syn-
onymous and stop-gained variants, respectively, are observed in the
exome dataset, which indicates that current sample sizes remain far
from capturing complete mutational saturation of the human exome
(Extended Data Fig.4).
Identifying loss-of-function variants
Some LoF variants will result in embryonic lethality in humans in a het-
erozygous state, whereas others are benign even at homozygosity, with
a wide spectrum of effects in between. Throughout this manuscript,
we define pLoF variants to be those that introduce a premature stop
(stop-gained), shift-reported transcriptional frame (frameshift), or
alter the two essential splice-site nucleotides immediately to the left
and right of each exon (splice) found in protein-coding transcripts, and
ascertain their presence in the cohort of 125,748 individuals with exome
sequence data. As these variants are enriched for annotation artefacts
1
,
we developed the loss-of-function transcript effect estimator (LOFTEE)
package, which applies stringent filtering criteria from first principles
(such as removing terminal truncation variants, as well as rescued splice
variants, that are predicted to escape nonsense-mediated decay) to
pLoF variants annotated by the variant effect predictor (Extended Data
Fig.5a). Despite not using frequency information, we find that this
method disproportionately removes pLoF variants that are common in
the population, which are known to be enriched for annotation errors
1
,
while retaining rare, probable deleterious variations, as well as reported
pathogenic variation (Fig.2a). LOFTEE distinguishes high-confidence
pLoF variants from annotation artefacts, and identifies a set of putative
splice variants outside the essential splice site. The filtering strategy of
LOFTEE is conservative in the interest of increasing specificity, filtering
some potentially functional variants that display a frequency spectrum
consistent with that of missense variation (Fig.2b). Applying LOFTEE
v1.0, we discover 443,769 high-confidence pLoF variants, of which
413,097 fall on the canonical transcripts of 16,694 genes. The number
of pLoF variants per individual is consistent with previous reports
1
, and
is highly dependent on the frequency filters chosen (Supplementary
Table17).
Aggregating across variants, we created a gene-level pLoF frequency
metric to estimate the proportion of haplotypes that contain an inactive
0
0.05
0.10
0.15
MAPS
0
25
50
75
100
Percentage observed (%)
0
50,000
100,000
150,000
200,000
1,000,000
2,000,000
3,000,000
4,000,000
Total observed
Exomes
Other
Missense
pLoF
Genomes
CpG transition
Non−CpG transition
Transversion
Intron
5UTR
3UTR
Synonymous
Missense
Esse
ntial splice
Nonsense
Intergenic
Intron
5UTR
3UTR
Synonymous
Missense
Essentia
l splice
Nonsense
0.00
0.05
0.10
0.15
0
25
50
75
100
0
250,000
500,000
750,000
1,000,000
1,250,000
1,500,000
20,000,000
40,000,000
60,000,000
cd
ef
gh
African/
African
American
Latino
Ashkenazi
Jewish
Bulgarian
Other
East Asian
Estonian
Finnish
Japanese
Korean
North-
western
European
Other
non-Finnish
European
South Asian
Southern
European
Swedish
a
2,418
76
1,909
Estonian
Japanese
Korean
12,487
17,720
5,185
1,335
12,562
25,410
7,992
16,568
3,614
15,308
5,805
13,067
African/African American
Latino
Ashkenazi
Bulgarian
Finnish
Nor
th−western
European
Other East Asian
Other non-Finnish
European
Other
South Asian
Southern European
Swedish
Non-Finnish European (64,603)
East Asian (9,977)
b
Jewish
Fig. 1 | Aggregation of 141,456 exome and genome sequences. a, Uniform
manifold approximation and projection (UMAP)
46,47
plot depicting the
ancestral diversity of all individuals in gnomAD, using seven principal
components. Note that long-range distances in the UMAP space are not a proxy
for genetic distance. b, The number of individuals by population and
subpopulation in the gnomAD database. Colours representing populations in a
and b are consistent. c, d, The mutability-adjusted proportion of singletons
4
(MAPS) is shown across functional categories for SNVs in exomes (c; xaxis
shared with e and g) and genomes (d; xaxis shared with f and h). Higher values
indicate an enrichment of lower frequency variants, which suggests increased
deleteriousness. e, f, The proportion of possible variants observed for each
functional class for each mutational type for exomes (e) and genomes (f). CpG
transitions are more saturated, except where selection (for example, pLoFs) or
hypomethylation (5′ untranslated region) decreases the number of
observations. g, h, The total number of variants observed in each functional
class for exomes (g) and genomes (h). Error bars in cf represent 95%
confidence intervals (note that in some cases these are fully contained within
the plotted point).

Nature | Vol 581 | 28 May 2020 | 437
copy of each gene. We find that 1,555 genes have an aggregate pLoF
frequency of at least 0.1% across all individuals in the dataset (Extended
Data Fig.5c), and 3,270 genes have an aggregate pLoF frequency of at
least 0.1% in any one population. Furthermore, we characterized the
landscape of genic tolerance to homozygous inactivation, identifying
4,332 pLoF variants that are homozygous in at least one individual.
Given the rarity of true homozygous LoF variants, we expected sub-
stantial enrichment of such variants for sequencing and annotation
errors, and we subjected this set to additional filtering and deep manual
curation before defining a set of 1,815 genes (2,636 high-confidence
variants) that are likely to be tolerant to biallelic inactivation (Sup-
plementary Data7).
The LoF intolerance of human genes
Just as a preponderance of pLoF variants is useful for identifying
LoF-tolerant genes, we can conversely characterize the intolerance of a
gene to inactivation by identifying marked depletions of predicted LoF
variation
4,7
. Here, we present a refined mutational model, which incor-
porates methylation, base-level coverage correction, and LOFTEE (Sup-
plementary Information, Extended Data Fig.6), to predict expected
levels of variation under neutrality. Under this updated model, the
variation in the number of synonymous variants observed is accurately
captured (r=0.979). We then applied this method to detect depletion
of pLoF variation by comparing the number of observed pLoF variants
against our expectation in the gnomAD exome data from 125,748 indi-
viduals—more than doubling the sample size of ExAC, the previously
largest exome collection
4
. For this dataset, we computed a median of
17.9 expected pLoF variants per gene (Fig.2c) and found that 72.1% of
genes have more than 10 pLoF variants (powered to be classified into
the most constrained genes) (Supplementary Information) expected
on the canonical transcript (Fig.2d), an increase from 13.2% and 62.8%,
respectively, in ExAC.
The smaller sample size in ExAC required a transformation of the
observed and expected values for the number of pLoF variants in each
gene into the pLI: this metric estimates the probability that a gene
falls into the class of LoF-haploinsufficient genes (approximately 10%
observed/expected variation) and is ideally used as a dichotomous
metric (producing 3,230 genes with pLI > 0.9). Here, our refined model
and substantially increased sample size enabled us to directly assess the
degree of intolerance to pLoF variation in each gene using the continu-
ous metric of the observed/expected ratio and to estimate a confidence
interval around the ratio. We find that the median observed/expected
ratio is 48%, which indicates that, as noted previously, most genes
ClinVar gnomAD
0
10
20
30
Practice guideline
Reviewed by expert panel
Multiple submitters
Single submitter
No assertion criteria provided
Singleton
Doubleton
<0.01%
0.01–0.1%
0.1–1%
1–10%
>10%
Percentage ltered by LOFTEE (%)
a
0
100,000
200,000
300,000
400,000
500,000
0 40,000 80,000
125,748
Sample size
Total number of pLoF SNVs
Expected
Observed
c
Missense
Synonymous
n = 172
n = 17,305
n = 287
n = 1,461
n = 787
n = 29,381
n = 194,991
0.00
0.05
0.10
0.15
Multiple lters
Terminal truncation
Non-sp
lice disrupting
UTR splice
Rescue splice
Other splice
High condence
MAPS
b
0.0
20.0
40.0
60.0
72.1
0 40,000 80,000
125,748
Sample size
Percentage of genes with
> 10 pLoF SNVs (%)
Expected
Observed
d
Fig. 2 | Generating a high-confidence set of pLoF variants. a, The percentage
of variants filtered by LOFTEE grouped by ClinVar status and gnomAD
frequency. Despite not using frequency information, LOFTEE removes a larger
proportion of common variants, and a very low proportion of reported
disease-causing variation. b, MAPS (see Fig.1c, d) is shown by LOFTEE
designation and filter. Variants filtered out by LOFTEE exhibit frequency
spectra that are similar to those of missense variants; predicted splice variants
outside the essential splice site are more rare, and high-confidence variants are
very likely to be singletons. Only SNVs with at least 80% call rate are included
here. Error bars represent 95% confidence intervals. c, d, The total number of
pLoF variants (c), and proportion of genes with more than ten pLoF variants (d)
observed and expected (in the absence of selection) as a function of sample
size (downsampled from gnomAD). Selection reduces the number of variants
observed, and variant discovery approximately follows a square-root
relationship with the number of samples. At current sample sizes, we would
expect to identify more than 10 pLoF variants for 72.1% of genes in the absence
of selection.

438 | Nature | Vol 581 | 28 May 2020
Article
exhibit at least moderate selection against pLoF variation, and that
the distribution of the observed/expected ratio is not dichotomous,
but continuous (Extended Data Fig.7a). For downstream analyses,
unless otherwise specified, we use the 90% upper bound of this confi-
dence interval, which we term the loss-of-function observed/expected
upper bound fraction (LOEUF) (Extended Data Fig.7b, c), and bin 19,197
genes into deciles of approximately 1,920 genes each. At current sample
sizes, this metric enables the quantitative assessment of constraint
with a built-in confidence value, and distinguishes small genes (for
example, those with observed=0, expected=2; LOEUF=1.34) from
large genes (for example, observed=0, expected=100; LOEUF=0.03),
while retaining the continuous properties of the direct estimate of the
ratio (Supplementary Information). At one extreme of the distribu-
tion, we observe genes with a very strong depletion of pLoF variation
(first LOEUF decile aggregate observed/expectedapproximately 6%)
(Extended Data Fig.7e), including genes previously characterized as
high pLI (Extended Data Fig.7f). By contrast, we find unconstrained
genes that are relatively tolerant of inactivation, including many that
contain homozygous pLoF variants (Extended Data Fig.7g).
We note that the use of the upper bound means that LOEUF is a
conservative metric in one direction: genes with low LOEUF scores
are confidently depleted for pLoF variation, whereas genes with high
LOEUF scores are a mixture of genes without depletion, and genes that
are too small to obtain a precise estimate of the observed/expected
ratio. In general, however, the scale of gnomAD means that gene length
is rarely a substantive confounder for the analyses described here,
and all downstream analyses are adjusted for the length of the coding
sequence or filtered to genes with at least ten expected pLoFs (Sup
-
plementary Information).
Validation of the LoF-intolerance score
The LOEUF metric allows us to place each gene along a continuous
spectrum of tolerance to inactivation. We examined the correlation of
this metric with several independent measures of genic sensitivity to
disruption. First, we found that LOEUF is consistent with the expected
behaviour of well-established gene sets: known haploinsufficient genes
are strongly depleted of pLoF variation, whereas olfactory receptors are
relatively unconstrained, and genes with a known autosomal recessive
mechanism, for which selection against heterozygous disruptive vari-
ants tends to be present but weak
9
, fall in the middle of the distribution
(Fig.3a). In addition, LOEUF is positively correlated with the occur-
rence of 6,735 rare autosomal deletion structural variants overlapping
protein-coding exons identified in a subset of 6,749 individuals with
whole-genome sequencing data in this manuscript
11
(r=0.13; P=9.8 ×
10
−68
) (Fig.3b).
This constraint metric also correlates with results in model sys-
tems: in 389 genes with orthologues that are embryonically lethal
after heterozygous deletion in mouse
21,22
, we find a lower LOEUF
score (mean=0.488), compared with the remaining 18,808 genes
(mean=0.962; t-test P=10
−78
) (Fig.3c). Similarly, the 678 genes that are
essential for human cell viability as characterized by CRISPR screens
23
are also depleted for pLoF variation (mean LOEUF=0.63) in the gen-
eral population compared to background (18,519 genes with mean
LOEUF=0.964; t-test P=9 × 10
−71
), whereas the 777 non-essential genes
are more likely to be unconstrained (mean LOEUF=1.34, compared to
remaining 18,420 genes with mean LOEUF=0.936; t-test P=3 × 10
−92
)
(Fig.3d).
Biological properties of constraint
We investigated the properties of genes and transcripts as a func-
tion of their tolerance to pLoF variation (LOEUF). First, we found
that LOEUF correlates with the degree of connection of a gene in
protein-interaction networks (r=−0.14; P=1.7 × 10
−51
after adjusting
for gene length) (Fig.4a) and functional characterization (Extended
Data Fig.8a). In addition, constrained genes are more likely to be ubiq-
uitously expressed across 38 tissues in theGenotype-Tissue Expression
Haploinsufcient
Autosomal recessive
Olfactory genes
0
20
40
020406080 100
LOEUF decile (%)
Percentage of gene list (%)
a
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
020406080
100
LOEUF decile (%)
Aggregate deletion
SV observed/expected
b
0
10
20
30
020406080 100
LOEUF decile (%)
Percentage of mouse het
lethal knockout genes (%)
c
Cell essential
Cell non-essential
0
5
10
15
20
25
020406080 100
LOEUF decile (%)
Percentage of essential/
non-essential genes (%)
d
Fig. 3 | The functional spectrum of pLoF impact. a, The percentage of genes
in a set of curated gene lists represented in each LOEUF decile.
Haploinsufficient genes are enriched among the most constrained genes,
whereas recessive genes are spread in the middle of the distribution, and
olfactory receptor genes are largely unconstrained. b, The occurrence of 6,735
rare LoF deletion structural variants (SVs) is correlated with LOEUF (computed
from SNVs; linear regression r=0.13; P=9.8 × 10
−68
). Error bars represent 95%
confidence intervals from bootstrapping. c, d, Constrained genes are more
likely to be lethal when heterozygously inactivated in mouse and cause cellular
lethality when disrupted in human cells (c), whereas unconstrained genes are
more likely to be tolerant of disruption in cellular models (d). For all panels,
more constrained genes are shown on the left.

Figures
Citations
More filters
Journal ArticleDOI

The DisGeNET knowledge platform for disease genomics: 2019 update.

TL;DR: The DisGeNET platform, a knowledge management platform integrating and standardizing data about disease associated genes and variants from multiple sources, is an interoperable resource supporting a variety of applications in genomic medicine and drug R&D.
Journal ArticleDOI

Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism

F. Kyle Satterstrom, +201 more
- 06 Feb 2020 - 
TL;DR: The largest exome sequencing study of autism spectrum disorder (ASD) to date, using an enhanced analytical framework to integrate de novo and case-control rare variation, identifies 102 risk genes at a false discovery rate of 0.1 or less, consistent with multiple paths to an excitatory-inhibitory imbalance underlying ASD.
Posted ContentDOI

Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences

TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Posted ContentDOI

A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing

David E. Gordon, +123 more
- 22 Mar 2020 - 
TL;DR: The identification of host dependency factors mediating virus infection may provide key insights into effective molecular targets for developing broadly acting antiviral therapeutics against SARS-CoV-2 and other deadly coronavirus strains.
References
More filters
Journal ArticleDOI

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Journal ArticleDOI

An integrated map of genetic variation from 1,092 human genomes

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Journal ArticleDOI

UMAP: Uniform Manifold Approximation and Projection

TL;DR: Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.
Journal ArticleDOI

Genetic effects on gene expression across human tissues.

TL;DR: It is found that local genetic variation affects gene expression levels for the majority of genes, and inter-chromosomal genetic effects for 93 genes and 112 loci are identified, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.
Journal ArticleDOI

Sequence Variations in PCSK9, Low LDL, and Protection against Coronary Heart Disease

TL;DR: It is indicated that moderate lifelong reduction in the plasma level of LDL cholesterol is associated with a substantial Reduction in the incidence of coronary events, even in populations with a high prevalence of non-lipid-related cardiovascular risk factors.
Related Papers (5)

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 -