scispace - formally typeset
Open AccessPosted ContentDOI

Power analysis of transcriptome-wide association study

TLDR
It is found that depending on the type of relationship between genetics, gene expression, and disease, the estimates used by TWAS could be actually more relevant than real gene expressions, and that TWAS is not always better than GWAS when the relationship between Genetics and expression is weak.
Abstract: 
Standard Genome-wide association study (GWAS) discovers genetic variants explaining phenotypic variance by directly associate them. With the availability of other omics data such as gene expression, the field is stepping into an exciting era of multi-scale omics integration. An emerging technique is transcriptome-wide association study (TWAS) that conducts association mapping by utilizing gene expression data from a separate reference dataset based on which a model predicting expression by genotype is trained. Despite its success in practice, two fundamental questions have been unaddressed yet. First, in practice, the accuracy of predicting expression by genotype is generally low, which is bounded by the expression heritability. So, the question is whether such a low accuracy may impact the power of TWAS, and what level of accuracy is sufficient. Second, since predicting expression is a critical step in TWAS, one may ask what if we have actual expression assessed by a real experiment, and whether that will improve or deteriorate power. Answering these questions will bring thorough understanding of TWAS and practical guidelines in association mapping. To address the above questions, we conducted power analysis for GWAS, TWAS, and expression medicated GWAS (emGWAS). Specifically, we derived non-centrality parameters (NCPs), enabling closed-form derivation of statistical power to facilitate a thorough power analysis without relying on particular implementations. We assessed the power of the three protocols with respect to two representative scenarios: causality (genotype contributes to phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype and expression). For both scenarios, we tested various properties including expression heritability. Our analysis led to two main outcomes: (1) TWAS utilizing predicted expression enjoys higher power than emGWAS that has actual expressions in the pleiotropy scenario, revealing a deep insight into TWAS models as well as a practical guideline of applying TWAS even in cases when expressions are available in a GWAS dataset. (2) TWAS is suboptimal compared to GWAS when expression heritability is too low. The superiority ordering of TWAS and GWAS disclosed a turn-point in each of the causality and pleiotropy scenarios. Analysis of published discoveries shows the selection of protocols might be improved based on the identified turn-points.

read more

Content maybe subject to copyright    Report

1
Power analysis of transcriptome-wide association study: implications for 1
practical protocol choice 2
Chen Cao
1,
Âś
, Bowei Ding
2,
Âś
, Qing Li
1
, Devin Kwok
2
, Jingjing Wu
2,*
, Quan Long
1,2,3,4,*
3
4
1
Department of Biochemistry & Molecular Biology, Alberta Children’s Hospital Research 5
Institute, University of Calgary, Calgary, Canada. 6
2
Department of Mathematics & Statistics, University of Calgary, Calgary, Canada. 7
3
Department of Medical Genetics, University of Calgary, Calgary, Canada. 8
4
Hotchkiss Brain Institute, O’Brien Institute for Public Health, University of Calgary, Calgary, 9
Canada. 10
11
Âś = Joint first authors: C.C. and B.D. 12
* = Joint corresponding authors: J.W. (jinwu@ucalgary.ca) and Q.L. (quan.long@ucalgary.ca) 13
14
Abstract 15
16
The transcriptome-wide association study (TWAS) has emerged as one of several promising 17
techniques for integrating multi-scale ‘omics’ data into traditional genome-wide association 18
studies (GWAS). Unlike GWAS, which associates phenotypic variance directly with genetic 19
variants, TWAS uses a reference dataset to train a predictive model for gene expressions, which 20
allows it to associate phenotype with variants through the mediating effect of expressions. 21
Although effective, this core innovation of TWAS is poorly understood, since the predictive 22
accuracy of the genotype-expression model is generally low and further bounded by expression 23
heritability. This raises the question: to what degree does the accuracy of the expression model 24
affect the power of TWAS? Furthermore, would replacing predictions with actual, 25
experimentally determined expressions improve power? To answer these questions, we 26
compared the power of GWAS, TWAS, and a hypothetical protocol utilizing real expression 27
data. We derived non-centrality parameters (NCPs) for linear mixed models (LMMs) to enable 28
closed-form calculations of statistical power that do not rely on specific protocol 29
implementations. We examined two representative scenarios: causality (genotype contributes to 30
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

2
phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype 31
and expression), and also tested the effects of various properties including expression 32
heritability. Our analysis reveals two main outcomes: (1) Under pleiotropy, the use of predicted 33
expressions in TWAS is superior to actual expressions. This explains why TWAS can function 34
with weak expression models, and shows that TWAS remains relevant even when real 35
expressions are available. (2) GWAS outperforms TWAS when expression heritability is below a 36
threshold of 0.04 under causality, or 0.06 under pleiotropy. Analysis of existing publications 37
suggests that TWAS has been misapplied in place of GWAS, in situations where expression 38
heritability is low. 39
40
Keywords: Power analysis, GWAS, TWAS, Non-centrality parameter, Expression heritability 41
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

3
Author Summary 42
We compared the effectiveness of three methods for finding genetic effects on disease in 43
order to quantify their strengths and help researchers choose the best protocol for their data. The 44
genome-wide association study (GWAS) is the standard method for identifying how the genetic 45
differences between individuals relate to disease. Recently, the transcriptome-wide association 46
study (TWAS) has improved GWAS by also estimating the effect of each genetic variant on the 47
activity level (or expression) of genes related to disease. The effectiveness of TWAS is 48
surprising because its estimates of gene expressions are very inaccurate, so we ask if a method 49
using real expression data instead of estimates would perform better. Unlike past studies, which 50
only use simulation to compare these methods, we incorporate novel statistical calculations to 51
make our comparisons more accurate and universally applicable. We discover that depending on 52
the type of relationship between genetics, gene expression, and disease, the estimates used by 53
TWAS could be actually more relevant than real gene expressions. We also find that TWAS is 54
not always better than GWAS when the relationship between genetics and expression is weak 55
and identify specific turning points where past studies have incorrectly used TWAS instead of 56
GWAS. 57
58
Introduction 59
High-throughput sequencing instruments have enabled the rapid profiling of 60
transcriptomes (RNA expression of genes) [1-4], proteomes (proteins) [5-7] and other ‘omics’ 61
data [8-10]. These ‘omics’ provide insight into the intermediary effects of genotypes on 62
endophenotypes, and can improve the ability of genome-wide association studies (GWAS) to 63
find associations between genetic variants and disease phenotypes. [11-13]. The integration of 64
diverse ‘omics’ data sources remains a challenging and active field of research [14-17]. 65
One approach to integrating ‘omics’ and GWAS is the transcriptome-wide association 66
study (TWAS), which quantitatively aggregates multiple genetic variants into a single test using 67
transcriptome data. Pioneered by Gamazon et al [18], the TWAS protocol typically has two 68
steps. First, a model is trained to predict gene expressions from local genetic variants near the 69
focal genes, using a reference dataset containing both genotype and expression data. Second, the 70
pretrained model is used to predict expressions from genotypes in the association mapping 71
dataset under study, which contains genotypes and phenotypes (but not expression). The 72
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

4
predicted expressions are then associated to the phenotype of interest. TWAS can also be 73
conducted with summary statistics from GWAS datasets (i.e. meta-analysis) as first 74
demonstrated by Gusev et al. [19] [20]. TWAS has since achieved significant popularity and 75
success in identifying the genetic basis of complex traits [21-27], inspiring similar protocols for 76
other endophenotypes such as IWAS for images [28] and PWAS for proteins [29]. 77
Despite its demonstrated effectiveness, important questions remain regarding the 78
theoretical conditions under which TWAS is superior to GWAS. First: TWAS mapping relies 79
entirely on predicted expressions, but as shown by many methodological papers, the mean î´´

80
between predicted and actual expressions is very low (around 0.02 ~ 0.05). This is in part due to 81
low expression heritability [18], which bounds the maximum predictive accuracy attainable by 82
the genotype-expression model. Naturally, one can ask: given sufficiently low expression 83
heritability, is there is a point at which TWAS performs worse than GWAS? Indeed in real data, 84
genes discovered with significant TWAS p-values tend to have a higher î´´

, and thus expression 85
heritability, than on average [18, 19, 30-32]. We therefore investigate the effect of expression 86
heritability on the power of TWAS, as well as its interactions with trait heritability, phenotypic 87
variance from expressions, number of causal genes, and genetic architecture. Second: as 88
described by Gamazon et al. [18], the key insight of TWAS is that it aggregates sensible genetic 89
variants to estimate “genetically regulated gene expression”, or GReX [18], for use in 90
downstream GWAS. Given this hypothesis, one may ask if actual expression data would further 91
improve the power of downstream GWAS over predicted expressions. This is not a trivial 92
question, as although actual expressions do not suffer from prediction errors, they also include 93
experimental or environmental noise which masks the genetic component of expression. To test 94
this problem, we invent a hypothetical protocol associating real expressions to phenotype, which 95
we call “expression mediated GWAS” or emGWAS. While emGWAS is not in practical use due 96
to the difficulties of accessing relevant tissues (e.g., in the studies of brain diseases), it can 97
potentially be applied to future analyses of diseases where tissues are routinely available (e.g., 98
blood or cancerous tissues). More importantly, emGWAS serves as a useful benchmark for 99
evaluating the theoretical properties of TWAS-predicted expressions against ground truth 100
expression data. By analyzing the power of TWAS, GWAS, and emGWAS, we develop practical 101
guidelines for choosing each protocol given different expression heritability and genetic 102
architectures. 103
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

5
While there has been an existing study comparing the power of GWAS, TWAS, and a 104
protocol which integrates eQTLs with GWAS [33], the existing study is purely simulation-based, 105
whereas we determine power directly using traditional closed-form analysis. We derive non-106
centrality parameters (NCPs) for the relevant statistical tests and the linear mixed model (LMM) 107
in particular (Methods). Our derivation uses a novel method to convert an LMM into a linear 108
regression by decorrelating the covariance structure of the LMM response variable (Methods). 109
To our best knowledge, this is the first closed-form derivation of the NCP for LMMs in current 110
literature, with potential for broad applications as LMMs are the dominant models used in 111
GWAS and portions of the TWAS pipeline. 112
Unlike pure simulations, which stochastically resample the alternative hypothesis to 113
estimate statistical power, our closed-form derivation directly calculates power from a particular 114
configuration of association mapping data. As a result, our method saves computational 115
resources, yields more accurate power estimations, and adapts easily to similar protocols such as 116
IWAS [28] and PWAS [29, 34]. Moreover, as the closed-form derivation avoids conducting the 117
actual regression, our power calculations do not depend on specific implementations of GWAS 118
and TWAS, which could otherwise cause our results to vary due to differences in filtering inputs 119
or parameter optimizations. Our work therefore characterizes the theoretical power of the 120
protocols across all LMM-based implementations and datasets, although we are unable to 121
account for power losses due to practical implementation issues. 122
In the following section we describe our novel derivation of NCPs for LMMs and our 123
power analyses of GWAS, TWAS, and emGWAS. We present guidelines on the applicability of 124
each protocol under different input conditions and discuss potential limitations of our approach 125
as well as areas for future research. 126
127
Materials & Methods 128
Mathematical definitions of GWAS, TWAS, and emGWAS protocols 129
While there are many variations of GWAS and TWAS [18, 19, 35-39], in this work we 130
assume that multiple genes contribute to phenotypic variation, and for each causal gene, multiple 131
single nucleotide polymorphisms (SNPs) contribute to both gene expression and phenotype. This 132
setting is motivated by the fact that most complex traits are known to have multiple contributing 133
loci, and TWAS fundamentally assumes that genes have multiple local causal variants. To ensure 134
.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 26, 2021. ; https://doi.org/10.1101/2020.07.19.211151doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Placental genomics mediates genetic associations with complex health traits and disease

TL;DR: This article performed distal mediator-enriched transcriptome-wide association studies (TWAS) for 40 traits, integrating placental multi-omics from the Extremely Low Gestational Age Newborn Study to identify placental gene-trait associations (GTAs) across the life course.
Journal ArticleDOI

Placental genomics mediates genetic associations with complex health traits and disease

TL;DR: This article performed distal mediator-enriched transcriptome-wide association studies (TWAS) for 40 traits, integrating placental multi-omics from the Extremely Low Gestational Age Newborn Study to identify placental gene-trait associations (GTAs) across the life course.
Journal ArticleDOI

Statistical power of transcriptome‐wide association studies

Ruoyu He, +2 more
- 29 Jun 2022 - 
TL;DR: A general method for sample size/power calculations for two-sample TWAS is outlined and several top genes with large power gains in MV-TWAS were known to be (and in the authors' data more significantly) associated with AD.
References
More filters
Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Journal ArticleDOI

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

TL;DR: The results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.
Journal ArticleDOI

RNA-Seq: a revolutionary tool for transcriptomics

TL;DR: The RNA-Seq approach to transcriptome profiling that uses deep-sequencing technologies provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
Journal ArticleDOI

Complement Factor H Polymorphism in Age-Related Macular Degeneration

TL;DR: A genome-wide screen for polymorphisms associated with age-related macular degeneration revealed a polymorphism in linkage disequilibrium with the risk allele representing a tyrosine-histidine change at amino acid 402 in the complement factor H gene.
Journal ArticleDOI

Population structure and eigenanalysis

TL;DR: An approach to studying population structure (principal components analysis) is discussed that was first applied to genetic data by Cavalli-Sforza and colleagues, and results from modern statistics are used to develop formal significance tests for population differentiation.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Power analysis of transcriptome-wide association study: implications for practical protocol choice" ?

The authors examined two representative scenarios: causality ( genotype contributes to 30. CC-BY-NC 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.Â