1
Linking common and rare disease genetics through gene
regulatory networks
Olivier B. Bakker
1*
, Annique Claringbould
2*
, Harm-Jan Westra
1
, Henry Wiersma
1
, Floranne
Boulogne
1
, Urmo Võsa
3
, Sophie Mulcahy Symmons
1
, Iris H. Jonkers
1
, Lude Franke
1,4#
,
Patrick Deelen
1,4#
* These authors contributed equally
#
These authors contributed equally
1: University of Groningen, University Medical Center Groningen, Department of Genetics,
Groningen, The Netherlands
2: Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
3: Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
4: Oncode Institute, Utrecht, The Netherlands
Abstract
Genetic variants identified through genome-wide association studies (GWAS) are
typically non-coding and exert small regulatory effects on downstream genes, but which
downstream genes are ultimately impacted and how they confer risk remains mostly unclear.
Conversely, variants that cause rare Mendelian diseases are often coding and have a more
direct impact on disease development. We demonstrate that common and rare genetic
diseases can be linked by studying the gene regulatory networks impacted by common
disease-associated variants. We implemented this in the ‘Downstreamer’ method and
applied it to 44 GWAS traits and find that predicted downstream “key genes” are enriched
with Mendelian disease genes, e.g. key genes for height are enriched for genes that cause
skeletal abnormalities and Ehlers-Danlos syndromes. We find that 82% of these key genes
are located outside of GWAS loci, suggesting that they result from complex trans regulation
rather than being impacted by disease-associated variants in cis. Finally, we discuss the
challenges in reconstructing gene regulatory networks and provide a roadmap to improve
identification of these highly connected genes for common traits and diseases.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
2
Introduction
Genetic variation plays a major role in the development of both common and rare
diseases, yet the genetic architectures of these disease types are usually considered quite
different. Rare genetic disorders are thought to primarily be caused by a single, mostly
protein-coding genetic variant that has a large effect on disease risk. As a consequence, the
causal genes for a rare disorder can often be identified by sequencing individual patients or
families. In contrast, the genetic risks for common diseases are modulated by a large
number of mostly non-coding variants that individually exert small effects. These variants are
typically identified through genome-wide association studies (GWASs). However,
identification of the causal variants and genes affected by GWAS loci remains challenging,
in part due to linkage disequilibrium (LD) and small effect-sizes
1,2
.
Despite the differences between rare and complex diseases, it has been shown that
GWAS loci for multiple traits are enriched for genes that can cause related rare diseases
when damaged
3,4
. For instance, common variants associated to PR interval, a
measurement of heart function, have been found within the MYH6 gene
5
, which is known to
harbour mutations in individuals with familial dilated cardiomyopathy
6
. Moreover, eQTL
studies have found examples of rare disease genes that are affected by distal common
variants in trans, such as the immunodeficiency gene ISG15, which is affected by multiple
systemic lupus erythematosus–associated variants
7
. These results indicate that common
and rare diseases can result from damage to or altered regulation of the same genes,
suggesting that the same biological pathways underlie these conditions
4
. However, it is not
fully known to what extent specific genes and pathways are shared between rare and
common diseases.
Over the years, many pathway-enrichment methods have been developed that can
identify which biological pathways are enriched for common diseases
8–10
as well as
highlighting their most likely cellular context(s)
11,12
. In addition, several methods can
prioritize individual genes within GWAS susceptibility loci by studying how they are
functionally related to genes in other susceptibility loci
8,13–16
. However, these methods
confine themselves to genes in GWAS loci, potentially missing relevant trans-regulated up-
or downstream effects. In blood, expression quantitative trait locus (eQTL) mapping has
been successful in identifying the downstream trans regulatory consequences of GWAS-
associated variants (i.e. trans-eQTLs and eQTSs, where polygenic scores are linked to
expression levels)
7
. Unfortunately, large eQTL sample sizes are required to detect such
effects, and such datasets are not yet available for most tissues.
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint
3
Here we build upon the ‘omnigenic model’ hypothesis, which states that the genes that
are most important in complex diseases are those that are modulated by many different
common variants through gene regulatory networks
17,18
. The omnigenic model postulates
that a limited number of core genes exist that drive diseases, but that many peripheral
genes, which contain associated variants, contribute indirectly to disease development by
modulating the activity of the core genes. Since the omnigenic model predicts that many
core genes map outside GWAS loci, these genes will be missed by methods that prioritize
genes within GWAS loci. The omnigenic model hypothesis is supported by recent works
assessing RNA levels of blood cells
19
and molecular traits
20
and a large-scale in vitro
knockdown experiment
21
. However, these studies were performed in blood, limiting their
conclusions to GWAS studies on blood-related traits and immunological disorders.
To take this work further, we integrated (mRNA level) gene regulatory networks with
GWAS summary statistics to prioritize key genes that we suspect are more likely to directly
contribute to disease predisposition than genes in GWAS loci. We have implemented this
strategy in a software package called ‘Downstreamer’ that uses GWAS summary statistics
and gene co-regulation based on 31,499 multi-tissue RNA-seq samples in order to prioritize
these key genes. We also provide pathway, rare disease phenotype (coded by HPO terms)
and tissue enrichments to aid in the comprehensive interpretation of GWAS results.
We applied Downstreamer to 44 GWASs for a wide variety of traits (Table S1) and
show that the identified key genes are enriched for intolerance to loss-of-function (LoF) and
missense (MiS) mutations and for Mendelian disease genes that lead to similar phenotypic
outcomes as the GWAS trait. Specifically, we find that key genes for height are strongly
enriched for severe growth defects and skeletal abnormalities in humans and mice.
Additionally, key genes for auto-immune diseases point to lymphocyte checkpoints and
regulators and those for glomerular filtration rate (GFR; a measure of kidney function) are
transporters for several metabolites.
Key genes that cause Mendelian disease can therefore highlight the molecular
pathways driving the complex disease. Conversely, predicted key genes may aid in
identifying new Mendelian disease genes.
Results
To enable identification of GWAS key genes, we developed Downstreamer (Methods),
a tool that integrates GWAS summary statistics with gene expression–based co-regulation
networks. Downstreamer first converts individual variant associations to gene p-values by
aggregating associations within a 25kb window around the gene body for all protein-coding
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint
4
genes while correcting for LD between variants
9
(Fig. 1A). These gene p-values are then
converted to z-scores. We calculated gene z-scores for 44 GWAS summary statistics
reflecting a wide variety of disorders and complex traits (Table S1).
Association signals for polygenic traits cluster around transcription factors
We observed that the z-scores of individual GWASs were often weakly positively
correlated (Fig. S1A), especially for traits for which many loci have been identified (Fig. S1
B). For instance, the gene-level z-scores for height correlated positively with the gene-level
z-scores of all other traits. To investigate the source of this shared signal, we calculated the
average gene-level significance across all 44 traits while correcting for bias that might be
introduced for traits that are strongly correlated (see Methods).
We observed that 30% of the variation in this ‘average GWAS’ signal could be
explained by both the extent of LD around a gene and by the local gene density (Fig. S2).
The more extensive the LD around a gene, the higher the chance that genetic variants within
the gene are associated, especially for highly polygenic traits
22,23
. Consequently, the gene-
level z-score for these genes increases. Hence, when collapsing GWAS summary statistics
into gene z-scores, some amount of correlation between well-powered GWAS studies is to
be expected. However, this is unwanted when using gene z-scores in a pathway-enrichment
analysis. We next evaluated if the remaining 70% of the average signal was enriched for any
biological processes. After correcting for LD and gene density, we observed that 79 of the
top 500 genes are transcription factors (OR: 2.22, p-value: 4.25×10
-9
). We also saw
enrichment among the top 500 genes for pathways related to DNA binding and transcription,
for example, transcription regulator activity (OR: 1.98, p-value: 2.24×10
-11
) (Table S2).
Additionally, genes with a higher average gene z-score were enriched for intolerance to LoF
(Fig. S3). These enrichments suggest that there is a set of genes, enriched for GWAS hits,
that confer risk to many different types of traits. This is consistent with previous observations
that broad functional categories tend to be enriched for many traits
17
.
However, as these often-associated genes obscure the specific pathways and key
genes for a trait, we corrected the individual gene z-scores for the average gene z-score in
order to get disease-specific gene-level significance scores that were as specific as
possible. Downstreamer then correlates these corrected gene z-scores to gene expression
patterns, pathway memberships and tissue expression through a
generalized least squares
(G
LS) model that accounts for gene–gene correlations resulting from the relationship
between LD and sharing of biological functionality (Fig. 1B, Methods). This results in a z-
score that represents the significance of the association of a gene, pathway or tissue
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint
5
. CC-BY-NC-ND 4.0 International licenseIt is made available under a
perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint
The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint