Linking common and rare disease genetics through gene regulatory networks

doi:10.1101/2021.10.21.21265342

1

Linking common and rare disease genetics through gene

regulatory networks

Olivier B. Bakker

1*

, Annique Claringbould

2*

, Harm-Jan Westra

1

, Henry Wiersma

1

, Floranne

Boulogne

1

, Urmo Võsa

3

, Sophie Mulcahy Symmons

1

, Iris H. Jonkers

1

, Lude Franke

1,4#

,

Patrick Deelen

1,4#

* These authors contributed equally

#

These authors contributed equally

1: University of Groningen, University Medical Center Groningen, Department of Genetics,

Groningen, The Netherlands

2: Structural and Computational Biology Unit, EMBL, Heidelberg, Germany

3: Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia

4: Oncode Institute, Utrecht, The Netherlands

Abstract

Genetic variants identified through genome-wide association studies (GWAS) are

typically non-coding and exert small regulatory effects on downstream genes, but which

downstream genes are ultimately impacted and how they confer risk remains mostly unclear.

Conversely, variants that cause rare Mendelian diseases are often coding and have a more

direct impact on disease development. We demonstrate that common and rare genetic

diseases can be linked by studying the gene regulatory networks impacted by common

disease-associated variants. We implemented this in the ‘Downstreamer’ method and

applied it to 44 GWAS traits and find that predicted downstream “key genes” are enriched

with Mendelian disease genes, e.g. key genes for height are enriched for genes that cause

skeletal abnormalities and Ehlers-Danlos syndromes. We find that 82% of these key genes

are located outside of GWAS loci, suggesting that they result from complex trans regulation

rather than being impacted by disease-associated variants in cis. Finally, we discuss the

challenges in reconstructing gene regulatory networks and provide a roadmap to improve

identification of these highly connected genes for common traits and diseases.

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

2

Introduction

Genetic variation plays a major role in the development of both common and rare

diseases, yet the genetic architectures of these disease types are usually considered quite

different. Rare genetic disorders are thought to primarily be caused by a single, mostly

protein-coding genetic variant that has a large effect on disease risk. As a consequence, the

causal genes for a rare disorder can often be identified by sequencing individual patients or

families. In contrast, the genetic risks for common diseases are modulated by a large

number of mostly non-coding variants that individually exert small effects. These variants are

typically identified through genome-wide association studies (GWASs). However,

identification of the causal variants and genes affected by GWAS loci remains challenging,

in part due to linkage disequilibrium (LD) and small effect-sizes

1,2

.

Despite the differences between rare and complex diseases, it has been shown that

GWAS loci for multiple traits are enriched for genes that can cause related rare diseases

when damaged

3,4

. For instance, common variants associated to PR interval, a

measurement of heart function, have been found within the MYH6 gene

5

, which is known to

harbour mutations in individuals with familial dilated cardiomyopathy

6

. Moreover, eQTL

studies have found examples of rare disease genes that are affected by distal common

variants in trans, such as the immunodeficiency gene ISG15, which is affected by multiple

systemic lupus erythematosus–associated variants

7

. These results indicate that common

and rare diseases can result from damage to or altered regulation of the same genes,

suggesting that the same biological pathways underlie these conditions

4

. However, it is not

fully known to what extent specific genes and pathways are shared between rare and

common diseases.

Over the years, many pathway-enrichment methods have been developed that can

identify which biological pathways are enriched for common diseases

8–10

as well as

highlighting their most likely cellular context(s)

11,12

. In addition, several methods can

prioritize individual genes within GWAS susceptibility loci by studying how they are

functionally related to genes in other susceptibility loci

8,13–16

. However, these methods

confine themselves to genes in GWAS loci, potentially missing relevant trans-regulated up-

or downstream effects. In blood, expression quantitative trait locus (eQTL) mapping has

been successful in identifying the downstream trans regulatory consequences of GWAS-

associated variants (i.e. trans-eQTLs and eQTSs, where polygenic scores are linked to

expression levels)

7

. Unfortunately, large eQTL sample sizes are required to detect such

effects, and such datasets are not yet available for most tissues.

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint

3

Here we build upon the ‘omnigenic model’ hypothesis, which states that the genes that

are most important in complex diseases are those that are modulated by many different

common variants through gene regulatory networks

17,18

. The omnigenic model postulates

that a limited number of core genes exist that drive diseases, but that many peripheral

genes, which contain associated variants, contribute indirectly to disease development by

modulating the activity of the core genes. Since the omnigenic model predicts that many

core genes map outside GWAS loci, these genes will be missed by methods that prioritize

genes within GWAS loci. The omnigenic model hypothesis is supported by recent works

assessing RNA levels of blood cells

19

and molecular traits

20

and a large-scale in vitro

knockdown experiment

21

. However, these studies were performed in blood, limiting their

conclusions to GWAS studies on blood-related traits and immunological disorders.

To take this work further, we integrated (mRNA level) gene regulatory networks with

GWAS summary statistics to prioritize key genes that we suspect are more likely to directly

contribute to disease predisposition than genes in GWAS loci. We have implemented this

strategy in a software package called ‘Downstreamer’ that uses GWAS summary statistics

and gene co-regulation based on 31,499 multi-tissue RNA-seq samples in order to prioritize

these key genes. We also provide pathway, rare disease phenotype (coded by HPO terms)

and tissue enrichments to aid in the comprehensive interpretation of GWAS results.

We applied Downstreamer to 44 GWASs for a wide variety of traits (Table S1) and

show that the identified key genes are enriched for intolerance to loss-of-function (LoF) and

missense (MiS) mutations and for Mendelian disease genes that lead to similar phenotypic

outcomes as the GWAS trait. Specifically, we find that key genes for height are strongly

enriched for severe growth defects and skeletal abnormalities in humans and mice.

Additionally, key genes for auto-immune diseases point to lymphocyte checkpoints and

regulators and those for glomerular filtration rate (GFR; a measure of kidney function) are

transporters for several metabolites.

Key genes that cause Mendelian disease can therefore highlight the molecular

pathways driving the complex disease. Conversely, predicted key genes may aid in

identifying new Mendelian disease genes.

Results

To enable identification of GWAS key genes, we developed Downstreamer (Methods),

a tool that integrates GWAS summary statistics with gene expression–based co-regulation

networks. Downstreamer first converts individual variant associations to gene p-values by

aggregating associations within a 25kb window around the gene body for all protein-coding

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint

4

genes while correcting for LD between variants

9

(Fig. 1A). These gene p-values are then

converted to z-scores. We calculated gene z-scores for 44 GWAS summary statistics

reflecting a wide variety of disorders and complex traits (Table S1).

Association signals for polygenic traits cluster around transcription factors

We observed that the z-scores of individual GWASs were often weakly positively

correlated (Fig. S1A), especially for traits for which many loci have been identified (Fig. S1

B). For instance, the gene-level z-scores for height correlated positively with the gene-level

z-scores of all other traits. To investigate the source of this shared signal, we calculated the

average gene-level significance across all 44 traits while correcting for bias that might be

introduced for traits that are strongly correlated (see Methods).

We observed that 30% of the variation in this ‘average GWAS’ signal could be

explained by both the extent of LD around a gene and by the local gene density (Fig. S2).

The more extensive the LD around a gene, the higher the chance that genetic variants within

the gene are associated, especially for highly polygenic traits

22,23

. Consequently, the gene-

level z-score for these genes increases. Hence, when collapsing GWAS summary statistics

into gene z-scores, some amount of correlation between well-powered GWAS studies is to

be expected. However, this is unwanted when using gene z-scores in a pathway-enrichment

analysis. We next evaluated if the remaining 70% of the average signal was enriched for any

biological processes. After correcting for LD and gene density, we observed that 79 of the

top 500 genes are transcription factors (OR: 2.22, p-value: 4.25×10

-9

). We also saw

enrichment among the top 500 genes for pathways related to DNA binding and transcription,

for example, transcription regulator activity (OR: 1.98, p-value: 2.24×10

-11

) (Table S2).

Additionally, genes with a higher average gene z-score were enriched for intolerance to LoF

(Fig. S3). These enrichments suggest that there is a set of genes, enriched for GWAS hits,

that confer risk to many different types of traits. This is consistent with previous observations

that broad functional categories tend to be enriched for many traits

17

.

However, as these often-associated genes obscure the specific pathways and key

genes for a trait, we corrected the individual gene z-scores for the average gene z-score in

order to get disease-specific gene-level significance scores that were as specific as

possible. Downstreamer then correlates these corrected gene z-scores to gene expression

patterns, pathway memberships and tissue expression through a

generalized least squares

(G

LS) model that accounts for gene–gene correlations resulting from the relationship

between LD and sharing of biological functionality (Fig. 1B, Methods). This results in a z-

score that represents the significance of the association of a gene, pathway or tissue

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint

5

. CC-BY-NC-ND 4.0 International licenseIt is made available under a

perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint

The copyright holder for thisthis version posted October 27, 2021. ; https://doi.org/10.1101/2021.10.21.21265342doi: medRxiv preprint

Linking common and rare disease genetics through gene regulatory networks

Figures

Citations

Brain expression quantitative trait locus and network analyses reveal downstream effects and putative drivers for brain-related diseases

An efficient not-only-linear correlation coefficient based on machine learning

“Mendelian Code” in the Genetic Structure of Common Multifactorial Diseases

A multi-ancestry genome-wide meta-analysis, fine-mapping, and target gene prioritization to characterize the genetic architecture of adiponectin.

“Mendelian Code” in the Genetic Structure of Common Multifactorial Diseases

References

Gene Ontology: tool for the unification of biology

KEGG: Kyoto Encyclopedia of Genes and Genomes

A global reference for human genetic variation.

The Reactome Pathway Knowledgebase.

The mutational constraint spectrum quantified from variation in 141,456 humans

Related Papers (5)

Properties of human disease genes and the role of genes linked to Mendelian disorders in complex disease aetiology

Deregulated Regulators: Disease-Causing cis Variants in Transcription Factor Genes.

A systematic approach to understand the functional consequences of non-protein coding risk regions

Exploring the potential relevance of human-specific genes to complex disease

Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology