1
Power analysis of transcriptome-wide association study: implications for 1
practical protocol choice 2
Chen Cao
1,
Âś
, Bowei Ding
2,
Âś
, Qing Li
1
, Devin Kwok
2
, Jingjing Wu
2,*
, Quan Long
1,2,3,4,*
3
4
1
Department of Biochemistry & Molecular Biology, Alberta Childrenâs Hospital Research 5
Institute, University of Calgary, Calgary, Canada. 6
2
Department of Mathematics & Statistics, University of Calgary, Calgary, Canada. 7
3
Department of Medical Genetics, University of Calgary, Calgary, Canada. 8
4
Hotchkiss Brain Institute, OâBrien Institute for Public Health, University of Calgary, Calgary, 9
Canada. 10
11
Âś = Joint first authors: C.C. and B.D. 12
* = Joint corresponding authors: J.W. (jinwu@ucalgary.ca) and Q.L. (quan.long@ucalgary.ca) 13
14
Abstract 15
16
The transcriptome-wide association study (TWAS) has emerged as one of several promising 17
techniques for integrating multi-scale âomicsâ data into traditional genome-wide association 18
studies (GWAS). Unlike GWAS, which associates phenotypic variance directly with genetic 19
variants, TWAS uses a reference dataset to train a predictive model for gene expressions, which 20
allows it to associate phenotype with variants through the mediating effect of expressions. 21
Although effective, this core innovation of TWAS is poorly understood, since the predictive 22
accuracy of the genotype-expression model is generally low and further bounded by expression 23
heritability. This raises the question: to what degree does the accuracy of the expression model 24
affect the power of TWAS? Furthermore, would replacing predictions with actual, 25
experimentally determined expressions improve power? To answer these questions, we 26
compared the power of GWAS, TWAS, and a hypothetical protocol utilizing real expression 27
data. We derived non-centrality parameters (NCPs) for linear mixed models (LMMs) to enable 28
closed-form calculations of statistical power that do not rely on specific protocol 29
implementations. We examined two representative scenarios: causality (genotype contributes to 30
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint
2
phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype 31
and expression), and also tested the effects of various properties including expression 32
heritability. Our analysis reveals two main outcomes: (1) Under pleiotropy, the use of predicted 33
expressions in TWAS is superior to actual expressions. This explains why TWAS can function 34
with weak expression models, and shows that TWAS remains relevant even when real 35
expressions are available. (2) GWAS outperforms TWAS when expression heritability is below a 36
threshold of 0.04 under causality, or 0.06 under pleiotropy. Analysis of existing publications 37
suggests that TWAS has been misapplied in place of GWAS, in situations where expression 38
heritability is low. 39
40
Keywords: Power analysis, GWAS, TWAS, Non-centrality parameter, Expression heritability 41
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint
3
Author Summary 42
We compared the effectiveness of three methods for finding genetic effects on disease in 43
order to quantify their strengths and help researchers choose the best protocol for their data. The 44
genome-wide association study (GWAS) is the standard method for identifying how the genetic 45
differences between individuals relate to disease. Recently, the transcriptome-wide association 46
study (TWAS) has improved GWAS by also estimating the effect of each genetic variant on the 47
activity level (or expression) of genes related to disease. The effectiveness of TWAS is 48
surprising because its estimates of gene expressions are very inaccurate, so we ask if a method 49
using real expression data instead of estimates would perform better. Unlike past studies, which 50
only use simulation to compare these methods, we incorporate novel statistical calculations to 51
make our comparisons more accurate and universally applicable. We discover that depending on 52
the type of relationship between genetics, gene expression, and disease, the estimates used by 53
TWAS could be actually more relevant than real gene expressions. We also find that TWAS is 54
not always better than GWAS when the relationship between genetics and expression is weak 55
and identify specific turning points where past studies have incorrectly used TWAS instead of 56
GWAS. 57
58
Introduction 59
High-throughput sequencing instruments have enabled the rapid profiling of 60
transcriptomes (RNA expression of genes) [1-4], proteomes (proteins) [5-7] and other âomicsâ 61
data [8-10]. These âomicsâ provide insight into the intermediary effects of genotypes on 62
endophenotypes, and can improve the ability of genome-wide association studies (GWAS) to 63
find associations between genetic variants and disease phenotypes. [11-13]. The integration of 64
diverse âomicsâ data sources remains a challenging and active field of research [14-17]. 65
One approach to integrating âomicsâ and GWAS is the transcriptome-wide association 66
study (TWAS), which quantitatively aggregates multiple genetic variants into a single test using 67
transcriptome data. Pioneered by Gamazon et al [18], the TWAS protocol typically has two 68
steps. First, a model is trained to predict gene expressions from local genetic variants near the 69
focal genes, using a reference dataset containing both genotype and expression data. Second, the 70
pretrained model is used to predict expressions from genotypes in the association mapping 71
dataset under study, which contains genotypes and phenotypes (but not expression). The 72
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint
4
predicted expressions are then associated to the phenotype of interest. TWAS can also be 73
conducted with summary statistics from GWAS datasets (i.e. meta-analysis) as first 74
demonstrated by Gusev et al. [19] [20]. TWAS has since achieved significant popularity and 75
success in identifying the genetic basis of complex traits [21-27], inspiring similar protocols for 76
other endophenotypes such as IWAS for images [28] and PWAS for proteins [29]. 77
Despite its demonstrated effectiveness, important questions remain regarding the 78
theoretical conditions under which TWAS is superior to GWAS. First: TWAS mapping relies 79
entirely on predicted expressions, but as shown by many methodological papers, the mean î´´
ďś
80
between predicted and actual expressions is very low (around 0.02 ~ 0.05). This is in part due to 81
low expression heritability [18], which bounds the maximum predictive accuracy attainable by 82
the genotype-expression model. Naturally, one can ask: given sufficiently low expression 83
heritability, is there is a point at which TWAS performs worse than GWAS? Indeed in real data, 84
genes discovered with significant TWAS p-values tend to have a higher î´´
ďś
, and thus expression 85
heritability, than on average [18, 19, 30-32]. We therefore investigate the effect of expression 86
heritability on the power of TWAS, as well as its interactions with trait heritability, phenotypic 87
variance from expressions, number of causal genes, and genetic architecture. Second: as 88
described by Gamazon et al. [18], the key insight of TWAS is that it aggregates sensible genetic 89
variants to estimate âgenetically regulated gene expressionâ, or GReX [18], for use in 90
downstream GWAS. Given this hypothesis, one may ask if actual expression data would further 91
improve the power of downstream GWAS over predicted expressions. This is not a trivial 92
question, as although actual expressions do not suffer from prediction errors, they also include 93
experimental or environmental noise which masks the genetic component of expression. To test 94
this problem, we invent a hypothetical protocol associating real expressions to phenotype, which 95
we call âexpression mediated GWASâ or emGWAS. While emGWAS is not in practical use due 96
to the difficulties of accessing relevant tissues (e.g., in the studies of brain diseases), it can 97
potentially be applied to future analyses of diseases where tissues are routinely available (e.g., 98
blood or cancerous tissues). More importantly, emGWAS serves as a useful benchmark for 99
evaluating the theoretical properties of TWAS-predicted expressions against ground truth 100
expression data. By analyzing the power of TWAS, GWAS, and emGWAS, we develop practical 101
guidelines for choosing each protocol given different expression heritability and genetic 102
architectures. 103
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint
5
While there has been an existing study comparing the power of GWAS, TWAS, and a 104
protocol which integrates eQTLs with GWAS [33], the existing study is purely simulation-based, 105
whereas we determine power directly using traditional closed-form analysis. We derive non-106
centrality parameters (NCPs) for the relevant statistical tests and the linear mixed model (LMM) 107
in particular (Methods). Our derivation uses a novel method to convert an LMM into a linear 108
regression by decorrelating the covariance structure of the LMM response variable (Methods). 109
To our best knowledge, this is the first closed-form derivation of the NCP for LMMs in current 110
literature, with potential for broad applications as LMMs are the dominant models used in 111
GWAS and portions of the TWAS pipeline. 112
Unlike pure simulations, which stochastically resample the alternative hypothesis to 113
estimate statistical power, our closed-form derivation directly calculates power from a particular 114
configuration of association mapping data. As a result, our method saves computational 115
resources, yields more accurate power estimations, and adapts easily to similar protocols such as 116
IWAS [28] and PWAS [29, 34]. Moreover, as the closed-form derivation avoids conducting the 117
actual regression, our power calculations do not depend on specific implementations of GWAS 118
and TWAS, which could otherwise cause our results to vary due to differences in filtering inputs 119
or parameter optimizations. Our work therefore characterizes the theoretical power of the 120
protocols across all LMM-based implementations and datasets, although we are unable to 121
account for power losses due to practical implementation issues. 122
In the following section we describe our novel derivation of NCPs for LMMs and our 123
power analyses of GWAS, TWAS, and emGWAS. We present guidelines on the applicability of 124
each protocol under different input conditions and discuss potential limitations of our approach 125
as well as areas for future research. 126
127
Materials & Methods 128
Mathematical definitions of GWAS, TWAS, and emGWAS protocols 129
While there are many variations of GWAS and TWAS [18, 19, 35-39], in this work we 130
assume that multiple genes contribute to phenotypic variation, and for each causal gene, multiple 131
single nucleotide polymorphisms (SNPs) contribute to both gene expression and phenotype. This 132
setting is motivated by the fact that most complex traits are known to have multiple contributing 133
loci, and TWAS fundamentally assumes that genes have multiple local causal variants. To ensure 134
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint