(Open Access) Power analysis of transcriptome-wide association study (2020) | Bowei Ding

Q: What are the contributions mentioned in the paper "Power analysis of transcriptome-wide association study: implications for practical protocol choice" ?

The authors examined two representative scenarios: causality ( genotype contributes to 30. CC-BY-NC 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Power analysis of transcriptome-wide association study: implications for 1

practical protocol choice 2

Chen Cao

, Bowei Ding

, Qing Li

, Devin Kwok

, Jingjing Wu

2,*

, Quan Long

1,2,3,4,*

Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research 5

Institute, University of Calgary, Calgary, Canada. 6

Department of Mathematics & Statistics, University of Calgary, Calgary, Canada. 7

Department of Medical Genetics, University of Calgary, Calgary, Canada. 8

Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, 9

Canada. 10

¶ = Joint first authors: C.C. and B.D. 12

* = Joint corresponding authors: J.W. (jinwu@ucalgary.ca) and Q.L. (quan.long@ucalgary.ca) 13

Abstract 15

The transcriptome-wide association study (TWAS) has emerged as one of several promising 17

techniques for integrating multi-scale ‘omics’ data into traditional genome-wide association 18

studies (GWAS). Unlike GWAS, which associates phenotypic variance directly with genetic 19

variants, TWAS uses a reference dataset to train a predictive model for gene expressions, which 20

allows it to associate phenotype with variants through the mediating effect of expressions. 21

Although effective, this core innovation of TWAS is poorly understood, since the predictive 22

accuracy of the genotype-expression model is generally low and further bounded by expression 23

heritability. This raises the question: to what degree does the accuracy of the expression model 24

affect the power of TWAS? Furthermore, would replacing predictions with actual, 25

experimentally determined expressions improve power? To answer these questions, we 26

compared the power of GWAS, TWAS, and a hypothetical protocol utilizing real expression 27

data. We derived non-centrality parameters (NCPs) for linear mixed models (LMMs) to enable 28

closed-form calculations of statistical power that do not rely on specific protocol 29

implementations. We examined two representative scenarios: causality (genotype contributes to 30

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype 31

and expression), and also tested the effects of various properties including expression 32

heritability. Our analysis reveals two main outcomes: (1) Under pleiotropy, the use of predicted 33

expressions in TWAS is superior to actual expressions. This explains why TWAS can function 34

with weak expression models, and shows that TWAS remains relevant even when real 35

expressions are available. (2) GWAS outperforms TWAS when expression heritability is below a 36

threshold of 0.04 under causality, or 0.06 under pleiotropy. Analysis of existing publications 37

suggests that TWAS has been misapplied in place of GWAS, in situations where expression 38

heritability is low. 39

Keywords: Power analysis, GWAS, TWAS, Non-centrality parameter, Expression heritability 41

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

Author Summary 42

We compared the effectiveness of three methods for finding genetic effects on disease in 43

order to quantify their strengths and help researchers choose the best protocol for their data. The 44

genome-wide association study (GWAS) is the standard method for identifying how the genetic 45

differences between individuals relate to disease. Recently, the transcriptome-wide association 46

study (TWAS) has improved GWAS by also estimating the effect of each genetic variant on the 47

activity level (or expression) of genes related to disease. The effectiveness of TWAS is 48

surprising because its estimates of gene expressions are very inaccurate, so we ask if a method 49

using real expression data instead of estimates would perform better. Unlike past studies, which 50

only use simulation to compare these methods, we incorporate novel statistical calculations to 51

make our comparisons more accurate and universally applicable. We discover that depending on 52

the type of relationship between genetics, gene expression, and disease, the estimates used by 53

TWAS could be actually more relevant than real gene expressions. We also find that TWAS is 54

not always better than GWAS when the relationship between genetics and expression is weak 55

and identify specific turning points where past studies have incorrectly used TWAS instead of 56

GWAS. 57

Introduction 59

High-throughput sequencing instruments have enabled the rapid profiling of 60

transcriptomes (RNA expression of genes) [1-4], proteomes (proteins) [5-7] and other ‘omics’ 61

data [8-10]. These ‘omics’ provide insight into the intermediary effects of genotypes on 62

endophenotypes, and can improve the ability of genome-wide association studies (GWAS) to 63

find associations between genetic variants and disease phenotypes. [11-13]. The integration of 64

diverse ‘omics’ data sources remains a challenging and active field of research [14-17]. 65

One approach to integrating ‘omics’ and GWAS is the transcriptome-wide association 66

study (TWAS), which quantitatively aggregates multiple genetic variants into a single test using 67

transcriptome data. Pioneered by Gamazon et al [18], the TWAS protocol typically has two 68

steps. First, a model is trained to predict gene expressions from local genetic variants near the 69

focal genes, using a reference dataset containing both genotype and expression data. Second, the 70

pretrained model is used to predict expressions from genotypes in the association mapping 71

dataset under study, which contains genotypes and phenotypes (but not expression). The 72

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

predicted expressions are then associated to the phenotype of interest. TWAS can also be 73

conducted with summary statistics from GWAS datasets (i.e. meta-analysis) as first 74

demonstrated by Gusev et al. [19] [20]. TWAS has since achieved significant popularity and 75

success in identifying the genetic basis of complex traits [21-27], inspiring similar protocols for 76

other endophenotypes such as IWAS for images [28] and PWAS for proteins [29]. 77

Despite its demonstrated effectiveness, important questions remain regarding the 78

theoretical conditions under which TWAS is superior to GWAS. First: TWAS mapping relies 79

entirely on predicted expressions, but as shown by many methodological papers, the mean 



between predicted and actual expressions is very low (around 0.02 ~ 0.05). This is in part due to 81

low expression heritability [18], which bounds the maximum predictive accuracy attainable by 82

the genotype-expression model. Naturally, one can ask: given sufficiently low expression 83

heritability, is there is a point at which TWAS performs worse than GWAS? Indeed in real data, 84

genes discovered with significant TWAS p-values tend to have a higher 



, and thus expression 85

heritability, than on average [18, 19, 30-32]. We therefore investigate the effect of expression 86

heritability on the power of TWAS, as well as its interactions with trait heritability, phenotypic 87

variance from expressions, number of causal genes, and genetic architecture. Second: as 88

described by Gamazon et al. [18], the key insight of TWAS is that it aggregates sensible genetic 89

variants to estimate “genetically regulated gene expression”, or GReX [18], for use in 90

downstream GWAS. Given this hypothesis, one may ask if actual expression data would further 91

improve the power of downstream GWAS over predicted expressions. This is not a trivial 92

question, as although actual expressions do not suffer from prediction errors, they also include 93

experimental or environmental noise which masks the genetic component of expression. To test 94

this problem, we invent a hypothetical protocol associating real expressions to phenotype, which 95

we call “expression mediated GWAS” or emGWAS. While emGWAS is not in practical use due 96

to the difficulties of accessing relevant tissues (e.g., in the studies of brain diseases), it can 97

potentially be applied to future analyses of diseases where tissues are routinely available (e.g., 98

blood or cancerous tissues). More importantly, emGWAS serves as a useful benchmark for 99

evaluating the theoretical properties of TWAS-predicted expressions against ground truth 100

expression data. By analyzing the power of TWAS, GWAS, and emGWAS, we develop practical 101

guidelines for choosing each protocol given different expression heritability and genetic 102

architectures. 103

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

While there has been an existing study comparing the power of GWAS, TWAS, and a 104

protocol which integrates eQTLs with GWAS [33], the existing study is purely simulation-based, 105

whereas we determine power directly using traditional closed-form analysis. We derive non-106

centrality parameters (NCPs) for the relevant statistical tests and the linear mixed model (LMM) 107

in particular (Methods). Our derivation uses a novel method to convert an LMM into a linear 108

regression by decorrelating the covariance structure of the LMM response variable (Methods). 109

To our best knowledge, this is the first closed-form derivation of the NCP for LMMs in current 110

literature, with potential for broad applications as LMMs are the dominant models used in 111

GWAS and portions of the TWAS pipeline. 112

Unlike pure simulations, which stochastically resample the alternative hypothesis to 113

estimate statistical power, our closed-form derivation directly calculates power from a particular 114

configuration of association mapping data. As a result, our method saves computational 115

resources, yields more accurate power estimations, and adapts easily to similar protocols such as 116

IWAS [28] and PWAS [29, 34]. Moreover, as the closed-form derivation avoids conducting the 117

actual regression, our power calculations do not depend on specific implementations of GWAS 118

and TWAS, which could otherwise cause our results to vary due to differences in filtering inputs 119

or parameter optimizations. Our work therefore characterizes the theoretical power of the 120

protocols across all LMM-based implementations and datasets, although we are unable to 121

account for power losses due to practical implementation issues. 122

In the following section we describe our novel derivation of NCPs for LMMs and our 123

power analyses of GWAS, TWAS, and emGWAS. We present guidelines on the applicability of 124

each protocol under different input conditions and discuss potential limitations of our approach 125

as well as areas for future research. 126

127

Materials & Methods 128

Mathematical definitions of GWAS, TWAS, and emGWAS protocols 129

While there are many variations of GWAS and TWAS [18, 19, 35-39], in this work we 130

assume that multiple genes contribute to phenotypic variation, and for each causal gene, multiple 131

single nucleotide polymorphisms (SNPs) contribute to both gene expression and phenotype. This 132

setting is motivated by the fact that most complex traits are known to have multiple contributing 133

loci, and TWAS fundamentally assumes that genes have multiple local causal variants. To ensure 134

.CC-BY-NC 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

Power analysis of transcriptome-wide association study

Figures

Citations

From Transcript to Assembly

Placental genomics mediates genetic associations with complex health traits and disease

Placental genomics mediates genetic associations with complex health traits and disease

Identification of shared and differentiating genetic architecture for autism spectrum disorder, attention-deficit hyperactivity disorder and case subgroups

Statistical power of transcriptome‐wide association studies

References

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

RNA-Seq: a revolutionary tool for transcriptomics

Complement Factor H Polymorphism in Age-Related Macular Degeneration

Population structure and eigenanalysis

Related Papers (5)

Power analysis of transcriptome-wide association study: implications for practical protocol choice

Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution

Testing and controlling for horizontal pleiotropy with the probabilistic Mendelian randomization in transcriptome-wide association studies

The Future of and Beyond GWAS

A model selection approach to genome wide association studies

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Power analysis of transcriptome-wide association study: implications for practical protocol choice" ?