A model of compound heterozygous, loss-of-function alleles is broadly consistent with observations from complex-disease GWAS datasets

doi:10.1101/048819

RESEARCH ARTICLE

A Model of Compound Heterozygous, Loss-of-

Function Alleles Is Broadly Consistent with

Observations from Complex-Disease GWAS

Datasets

Jaleal S. Sanjak

1,2

*, Anthony D. Long

1,2

, Kevin R. Thornton

1,2

*

1 Department of Ecology and Evolutionary Biology, University of California, Irvine, Irvine, California, USA,

2 Center for Complex Biological Systems, University of California, Irvine, Irvine, California, USA

* jsanjak@uci.edu (JSS); krthornt@uci.edu (KRT)

Abstract

The genetic component of complex disease risk in humans remains largely unexplained. A

corollary is that the allelic spectrum of genetic variants contributing to complex disease risk

is unknown. Theoretical models that relate population genetic processes to the maintenance

of genetic variation for quantitative traits may suggest profitable avenues for future experi-

mental design. Here we use forward simulation to model a genomic region evolving under a

balance between recurrent deleterious mutation and Gaussian stabilizing selection. We

consider multiple genetic and demographic models, and several different methods for identi-

fying genomic regions harboring variants associated with complex disease risk. We demon-

strate that the model of gene action, relating genotype to phenotype, has a qualitative effect

on several relevant aspects of the population genetic architecture of a complex trait. In par-

ticular, the genetic model impacts genetic variance component partitioning across the allele

frequency spectrum and the power of statistical tests. Models with partial recessivity closely

match the minor allele frequency distribution of significant hits from empirical genome-wide

association studies without requiring homozygous effect sizes to be small. We highlight a

particular gene-based model of incomplete recessivity that is appealing from first principles.

Under that model, deleterious mutations in a genomic region partially fail to complement

one another. This model of gene-based recessivity predicts the empirically observed incon-

sistency between twin and SNP based estimated of dominance heritability. Furthermore,

this model predicts considerable levels of unexplained variance associated with intralocus

epistasis. Our results suggest a need for improved statistical tools for region based genetic

association and heritability estimation.

Author Summary

Gene action determines how mutations affect phenotype. When placed in an evolutionary

context, the details of the genotype-to-phenotype model can impact the maintenance of

genetic variation for complex traits. Likewise, non-equilibrium demographic history may

PLOS Genetics | DOI:10.1371/journal.pgen.1006573 January 19, 2017 1 / 30

a1111111111

OPEN ACCESS

Citation: Sanjak JS, Long AD, Thornton KR (2017)

A Model of Compound Heterozygous, Loss-of-

Function Alleles Is Broadly Consistent with

Observations from Complex-Disease GWAS

Datasets. PLoS Genet 13(1): e1006573.

doi:10.1371/journal.pgen.1006573

Editor: Simon Gravel, McGill University, CANADA

Received: April 18, 2016

Accepted: January 5, 2017

Published: January 19, 2017

access article distributed under the terms of the

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: Our simulation code

and code for downstream analyses are freely

available at: http://github.com/ThorntonLab/

disease_sims, http://github.com/molpopgen/

buRden, http://github.com/molpopgen/fwdpy, and

http://github.com/molpopgen/TennessenEAonly.

Funding: This work was supported by NIH grant

R01-GM115564 to KRT. This work was supported

by NIH grant R01-GM115562 to ADL. This material

is based upon work supported by the National

Science Foundation Graduate Research Fellowship

Program under Grant No. DGE-1321846. Any

affect patterns of genetic variation. Here, we explore the impact of genetic model and pop-

ulation growth on distribution of genetic variance across the allele frequency spectrum

underlying risk for a complex disease. Using forward-in-time population genetic simula-

tions, we show that the genetic model has important impacts on the composition of

variation for complex disease risk in a population. We explicitly simulate genome-wide

association studies (GWAS) and perform heritability estimation on population samples. A

particular model of gene-based partial recessivity, based on allelic non-complementation,

aligns well with empirical results. This model is congruent with the dominance variance

estimates from both SNPs and twins, and the minor allele frequency distribution of

GWAS hits.

Introduction

Risk for complex diseases in humans, such as diabetes and hypertension, is highly heritable yet

the causal DNA sequence variants responsible for that risk remain largely unknown. Genome-

wide association studies (GWAS) have found many genetic markers associated with disease

risk [1]. However, follow-up studies have shown that these markers explain only a small por-

tion of the total heritability for most traits [2, 3].

There are many hypotheses which attempt to explain the ‘missing heritability’ problem [2–

5]. Genetic variance due to epistatic or gene-by-environment interactions is difficult to identify

statistically because of, among other reasons, increased multiple hypothesis testing burden [6,

7], and could artificially inflate estimates of broad-sense heritability [8]. Well-tagged interme-

diate frequency variants may not reach genome-wide significance in an association study if

they have smaller effect sizes [9, 10]. One appealing verbal hypothesis for this ‘missing herita-

bility’ is that there are rare causal alleles of large effect that are difficult to detect [4, 11, 12].

These hypotheses are not mutually exclusive, and it is probable that a combination of models

will be needed to explain all heritable disease risk [13].

The standard GWAS attempts to identify genetic polymorphisms that differ in frequency

between cases and controls. A complementary approach is to estimate the heritability

explained by genotyped (and imputed) markers (SNPs) under different population sampling

schemes [14, 15]. Stratifying markers by minor allele frequency (MAF) prior to performing

SNP-based heritability estimation allows the partitioning of genetic variation across the allele

frequency spectrum to be estimated [16], which is an important summary of the genetic archi-

tecture of a complex trait [16–23]. This approach has inferred a contribution of rare alleles to

genetic variance in both human height and body mass index (BMI) [16], consistent with theo-

retical work showing that rare alleles will have large effect sizes if fitness effects and trait effects

are correlated [18, 20–25]. Yet, simulations of causal loci harboring multiple rare variants with

large additive effects predict an excess of low-frequency significant markers relative to empiri-

cal findings [4, 26].

SNP-based heritability estimates have concluded that there is little missing heritability for

height and BMI, and that the causal loci simply have effect sizes that are too small to reach

genome-wide significance under current GWAS sample sizes [14, 16]. Further, extensions to

these methods decompose genetic variance into additive and dominance components and find

that dominance variance is approximately one fifth of the additive genetic variance on average

across seventy-nine complex traits [27]. When taken into account together with results from

GWAS, these observations can be interpreted as evidence that the genetic architecture of

human traits is best-explained by a model of small additive effects. However, a recent large

Compound Heterozygosity and Complex Traits

PLOS Genetics | DOI:10.1371/journal.pgen.1006573 January 19, 2017 2 / 30

opinions, findings, and conclusions or

recommendations expressed in this material are

those of the authors and do not necessarily reflect

the views of the National Science Foundation. The

funders had no role in study design, data collection

and analysis, decision to publish, or preparation of

the manuscript.

Competing Interests: The authors have declared

that no competing interests exist.

twin study found a substantial contribution of dominance variance for fourteen out of eighteen

traits [28]. The reason for this discrepancy in results remains unclear. One possibility is a sta-

tistical artifact; for example, twin studies may be prone to mistakenly infer non-additive effects

when none exist. Another possibility, which we return to later, is that this apparently contra-

dictory results are expected under a different model of gene action.

The design, analysis, and interpretation of GWAS are heavily influenced by the “standard

model” of quantitative genetics [29]. This model assigns an effect size to a mutant allele, but

formally makes no concrete statement regarding the molecular nature of the allele. Early appli-

cations of this model to the problem of human complex traits include Risch’s work on the

power to detect causal mutations [30, 31] and Pritchard’s work showing that rare alleles under

purifying selection may contribute to heritable variation in complex traits [17]. When applied

to molecular data, such as SNP genotypes in a GWAS, these models treat the SNPs themselves

as the loci of interest. For example, influential power studies informing the design of GWAS

assign effect sizes directly to SNPs and assume Risch’s model of multiplicative epistasis [32].

Similarly, the single-marker logistic regression used as the primary analysis of GWAS data

typically assumes an additive or recessive model at the level of individual SNPs [33]. Finally,

recent methods designed to estimate the heritability of a trait explained by genotyped markers

assigns additive and dominance effects directly to SNPs [14, 16, 27, 34]. Naturally, the results

of such analyses are interpreted in light of the assumed model of gene action.

A weakness of the multiplicative epistasis model [30, 31] when applied to SNPs is that the

concept of a gene, defined as a physical region where loss-of-function mutations have the same

phenotype [35], is lost. Specifically, under the standard model, the genetic concept of a failure

to complement is a property of SNPs and not “gene regions” (see [36] for a detailed discussion

of this issue). We have recently introduced an alternative model of gene action, one in which

risk mutations are unconditionally deleterious and fail to complement at the level of a “gene

region” [36]. This model, influenced by the standard operational definition of a gene [35],

gives rise to the sort of allelic heterogeneity typically observed for human Mendelian diseases

[37], and to a distribution of GWAS “hit” minor allele frequencies [4, 26] consistent with

empirical results [36]. In this article, we explore this “gene-based” model under more complex

demographic scenarios as well as its properties with respect to the estimation of variance com-

ponents using SNP-based approaches [34] and twin studies. We also compare this model to

the standard models of strictly additive co-dominant effects, and multiplicative epistasis with

dominance.

We further explore the power of several association tests to detect a causal gene region

under each genetic and demographic model. We find significant heterogeneity in the perfor-

mance of burden tests [36, 38, 39] across models of the trait and demographic history. We find

that population expansion reduces the power to detect causal gene-regions due to an increase

in rare variation, in agreement with work by [22, 23]. The behavior of the tests under different

models provides us with insight as to the circumstances in which each test is best suited.

In total, our results show that modeling gene action is key to modeling GWAS, and thus

plays an important role in both the design and interpretation of such studies. Further, the

model of gene-based recessivity best explains the differences between estimates of additive and

dominance variance components from SNP-based methods [27] and from twin studies [28]

and is consistent with the distribution of frequencies of significant associations in GWAS [4,

26]. Further, the genetic model plays a much more important role than the demographic

model, which is expected based on previous work on additive models showing that the genetic

load is approximately unaffected by changes in population size over time, [21, 22]. Consistent

with recent work by [23], we find that rapid population growth in the recent past increases the

contribution of rare variants to total genetic variance. However, we show here that different

Compound Heterozygosity and Complex Traits

PLOS Genetics | DOI:10.1371/journal.pgen.1006573 January 19, 2017 3 / 30

models of gene action are qualitatively different with respect to the partitioning of genetic vari-

ance across the allele frequency spectrum. We also show that these conclusions hold under the

more complex demographic models that have been proposed for human populations [21, 40].

Results and Discussion

The models

As in [36],we simulate a 100 kilobase region of human genome, contributing to a complex dis-

ease phenotype and fitness. The region evolves forward in time subject to neutral and deleteri-

ous mutation, recombination, selection, and drift. To perform genetic association and

heritability estimation studies in silico, we need to impose a trait onto simulated individuals. In

doing so, we introduce strong assumptions about the molecular underpinnings of a trait and

its evolutionary context.

How does the molecular genetic basis of a trait under natural selection influence population

genetic signatures in the genome? This question is very broad, and therefore it was necessary

to restrict ourselves to a small subset of molecular and evolutionary scenarios. We analyzed a

set of approaches to modeling a single gene region experiencing recurrent unconditionally-

deleterious mutation contributing to a quantitative trait subject to Gaussian stabilizing selec-

tion. The expected fitness effect of a mutation is always deleterious because trait effects are

sampled from an exponential distribution. Therefore, we do not allow for compensatory muta-

tions that may occur in more general models of stabilizing selection. Specifically, we studied

three different genetic models and two different demographic models, holding the fitness

model as a constant. Parameters are briefly described in Table 1.

We implemented three disease-trait models of the phenotypic form P = G + E. G is the

genetic component, and E ¼ Nð0; s

2

e

Þ is the environmental noise expressed as a Gaussian ran-

dom variable with mean 0 and variance s

2

e

. In this context, s

2

e

should be thought of as both the

contribution from the environment and from the remaining genetic variance at loci in linkage

equilibrium with the simulated 100kb region. The genetic models are named the additive co-

dominant (AC) model, multiplicative recessive (Mult. recessive; MR) model and the gene-

based recessive (GBR) model. The MR model has a parameter, h, that controls the degree of

Table 1. Description of parameters used in the models.

Parameter Description

N Population size

P Phenotype

P

opt

Optimum phenotype

G Genetic contribution to phenotype

E Environmental contribution to phenotype

λ Mean and standard deviation of trait effects

c

i

Speciﬁc trait effect of site i

h Dominance coefﬁcient for trait effects

w Fitness, based on Gaussian function

s

2

s

The total inverse selection intensity

s

2

e

Environmental variance

V

A

Additive genetic variance

V

D

Dominance genetic variance

V

G

Genetic variance

V

A;q  x

Additive variance explained by variance below frequency q

doi:10.1371/journal.pgen.1006573.t001

Compound Heterozygosity and Complex Traits

PLOS Genetics | DOI:10.1371/journal.pgen.1006573 January 19, 2017 4 / 30

recessivity; we call this model the complete MR (cMR) when h = 0 and the incomplete MR

(iMR) when 0 h  1. Here, h = 1 corresponds to co-dominance, which is different from the

typical formulation used when modeling the fitness effects of mutations directly. It is also

important to note that here recessivity is being defined in terms of phenotypic effects; this may

be unusual for those more accustomed to dealing directly with recessivity for fitness effects.

An idealized relationship between dominance for fitness effects and trait effects of a mutation

on an unaffected genetic background is shown in S15 Fig.

The critical conceptual difference between recessive models is whether dominance is a

property of a locus (nucleotide/SNP) in a gene or the gene overall. Mathematically, this

amounts to whether one first determines diploid genotypes at sites (and then multiplies across

sites to get a total genetic effect) or calculates a score for each haplotype (the maternal and

paternal alleles). For completely co-dominant models, this distinction is irrelevant, however

for a model with arbitrary dominance one needs to be more specific. As an example, imagine a

compound heterozygote for two biallelic loci, i.e. genotype Ab/aB. In the case of traditional

multiplicative recessivity the compound heterozygote is wild type for both loci and therefore

wild-type over all; this implies that these loci are in different genes (or independent functional

units of the same gene) because the mutations are complementary. However, in the case of

gene-based recessivity [36], neither haplotype is wild-type and so the individual is not wild-

type; the failure of mutant alleles to complement defines these loci as being in the same gene

[35].

For a diploid with m

i

causative mutations on the i

th

haplotype, we may define the additive

model as

G

AC

¼

X

2

i¼1

X

m

i

j¼1

c

i;j

; ð1Þ

where c

i,j

is the effect size of the j

th

mutation on the i

th

haplotype. Each c

i,j

is sampled from an

exponential distribution with mean of λ, to reflect unconditionally deleterious mutation. In

other words, when a new mutation arises its effect c is drawn from an exponential distribution,

and remains constant throughout its entire sojourn in the population.

The GBR model is the geometric mean of the sum of effect sizes on each haplotype [36].

We sum the causal mutation effects on each allele (paternal and maternal) to obtain a haplo-

type score. We then take the square root of the product of the haplotype scores to determine

the total genetic value of the diploid.

G

GBR

¼

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

X

m

1

j¼1

c

1;j



X

m

2

j¼1

c

2;j

v

u

t

ð2Þ

Finally, the MR model depends on the number of positions for which a diploid is heterozy-

gous (m

Aa

) or homozygous (m

aa

) for causative mutations,

G

MR

¼

Y

m

Aa

j¼1

ð1 þhc

j

Þ

!

Y

m

aa

j¼1

ð1 þ 2c

j

Þ

!

 1: ð3Þ

Thus, h = 0 is a model of multiplicative epistasis with complete recessivity (cMR), and h = 1

closely approximates the additive model when effect sizes are small.

Here, phenotypes are subject to Gaussian stabilizing selection with an optimum at zero and

standard deviation of σ

s

= 1 such that the fitness, w, of a diploid is proportional to a Gaussian

Compound Heterozygosity and Complex Traits

PLOS Genetics | DOI:10.1371/journal.pgen.1006573 January 19, 2017 5 / 30

A model of compound heterozygous, loss-of-function alleles is broadly consistent with observations from complex-disease GWAS datasets

Figures

Citations

Human biochemical genetics

The Genetical Theory of Natural Selection.

Incomplete dominance of deleterious alleles contributes substantially to trait variation and heterosis in maize

Reflections on the Field of Human Genetics: A Call for Increased Disease Genetics Theory.

Genetic association studies of alterations in protein function expose recessive effects on cancer predisposition

References

OpenMx 2.0: Extended Structural Equation and Statistical Modeling

Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

Pooled Association Tests for Rare Variants in Exon-Resequencing Studies

An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People

Demographic history and rare allele sharing among human populations

Related Papers (5)

A Model of Compound Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from Complex-Disease GWAS Datasets

A population genetic interpretation of GWAS findings for human quantitative traits.

An Evolutionary Perspective on Epistasis and the Missing Heritability

Phenotypic evolution from genetic polymorphisms in a radial network architecture

A kernel regression approach to gene-gene interaction detection for case-control studies.

Trending Questions (1)