scispace - formally typeset
Open AccessPosted ContentDOI

Integrated Analysis of Gene Expression Differences in Twins Discordant for Disease and Binary Phenotypes

Sivateja Tangirala, +1 more
- 28 Nov 2017 - 
- pp 226449
Reads0
Chats0
TLDR
In the integrative analyses, it was found that there may be a common gene expression signature (with small effect sizes) across the phenotypes; however, differences between phenotypes with respect to differentially expressed genes were more prominently featured.
Abstract
While both genes and environment contribute to phenotype, deciphering environmental contributions to phenotype is a challenge. Furthermore, elucidating how different phenotypes may share similar environmental etiologies also is challenging. One way to identify environmental influences is through a discordant monozygotic (MZ) twin study design. Here, we assessed differential gene expression in MZ discordant twin pairs (affected vs. non-affected) for seven phenotypes, including chronic fatigue syndrome, obesity, ulcerative colitis, major depressive disorder, intermittent allergic rhinitis, physical activity, and intelligence quotient, comparing the spectrum of genes differentially expressed across seven phenotypes individually. Second, we performed meta-analysis for each gene to identify commonalities and differences in gene expression signatures between the seven phenotypes. In our integrative analyses, we found that there may be a common gene expression signature (with small effect sizes) across the phenotypes; however, differences between phenotypes with respect to differentially expressed genes were more prominently featured. Therefore, defining common environmentally induced pathways in phenotypes remains elusive. We make our work accessible by providing a new database (DiscTwinExprDB: http://apps.chiragjpgroup.org/disctwinexprdb/) for investigators to study non-genotypic influence on gene expression.

read more

Content maybe subject to copyright    Report

1
Scientific RePoRtS | (2018) 8:17 | DOI:10.1038/s41598-017-18585-3
www.nature.com/scientificreports
Integrated Analysis of Gene
Expression Dierences in Twins
Discordant for Disease and Binary
Phenotypes
Sivateja Tangirala & Chirag J. Patel
While both genes and environment contribute to phenotype, deciphering environmental contributions
to phenotype is a challenge. Furthermore, elucidating how dierent phenotypes may share similar
environmental etiologies also is challenging. One way to identify environmental inuences is through
a discordant monozygotic (MZ) twin study design. Here, we assessed dierential gene expression
in MZ discordant twin pairs (aected vs. non-aected) for seven phenotypes, including chronic
fatigue syndrome, obesity, ulcerative colitis, major depressive disorder, intermittent allergic rhinitis,
physical activity, and intelligence quotient, comparing the spectrum of genes dierentially expressed
across seven phenotypes individually. Second, we performed meta-analysis for each gene to identify
commonalities and dierences in gene expression signatures between the seven phenotypes. In
our integrative analyses, we found that there may be a common gene expression signature (with
small eect sizes) across the phenotypes; however, dierences between phenotypes with respect
to dierentially expressed genes were more prominently featured. Therefore, dening common
environmentally induced pathways in phenotypes remains elusive. We make our work accessible
by providing a new database (DiscTwinExprDB: http://apps.chiragjpgroup.org/disctwinexprdb/) for
investigators to study non-genotypic inuence on gene expression.
Gene expression is inuenced by both inherited and non-inherited (or environmental) factors; however identify-
ing how environment inuences phenotype, such as disease, remains a challenge
1
. A common approach to iden-
tify dierentially expressed genes in disease is the case-control study. Case-control studies involve the matching
of aected individuals with healthy controls to assess the dierences of gene expression in cases versus controls.
However, it is dicult to identify the causes of dierences of gene expression with respect to inherited factors,
environmental, or phenotypic state; further, associations may be biased due to confounding variables.
One way to partition the role of environment and inherited factors in gene expression is to use a family-based
twin-design, whereby twins are discordant for phenotypes. For example, monozygotic (MZ) discordant twins
are twins that share the same genome but are discordant for a phenotype (e.g., one twin has a certain phenotype,
the other does not). e monozygotic discordant twin study design provides a natural study design in order to
identify signicant genes for a particular phenotype aer controlling for non-temporally dependent variables,
such as shared genetics, sex, and age
2
.
Is there a consistent gene expression signature of environmental inuence? Or, how much does gene expres-
sion due to potential environmental inuence vary across phenotypes? We hypothesized that integrating gene
expression data from multiple phenotypes can allow the elucidation of heterogeneity of discordant gene expres-
sion (how gene expression dierences between twins vary) and furthermore, gene signatures across phenotypes.
More specically, we claim it is possible to measure cross-phenotype heterogeneity by meta-analyzing across
mean expression dierences for each gene from discordant twin samples. As of this writing, gene expression data
from discordant twin samples have not been utilized to perform such analyses.
Our study’s goal was to identify signicant dierentially expressed genes between samples of aected and
non-aected MZ twin pairs and integrate mean expression dierences across seven phenotypes. We hypothe-
size that it is possible to detect potential environmentally modulated gene expression values shared and distinct
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA. Correspondence and
requests for materials should be addressed to C.J.P. (email: chirag_patel@hms.harvard.edu)
Received: 5 September 2017
Accepted: 14 December 2017
Published: xx xx xxxx
OPEN

www.nature.com/scientificreports/
2
Scientific RePoRtS | (2018) 8:17 | DOI:10.1038/s41598-017-18585-3
among dierent phenotypes. We further claim that identifying genes that are dierent and shared among numer-
ous phenotypes will shed light on shared environmental etiology in phenotypic variation.
In order to identify genes in discordant twins, we formulated a computational approach in four parts. First,
we queried public repositories such as the Gene Expression Omnibus
3
(GEO) for gene expression studies of
discordant monozygotic twin pairs. Next, we identied signicantly altered genes in twin pairs that are dis-
cordant for each of the seven phenotypes. We then compared gene signatures across the seven phenotypes in
a pairwise fashion and by using a meta-analytic approach. Last, we hypothesized that sex may also play a role
in gene expression variation; thus, we attempted to identify genes in a sex-specic manner across multiple
phenotypes.
Results
Dierential gene expression analysis in each of the seven phenotypes individually. For our
gene expression analyses we used expression data from the Gene Expression Omnibus
3
(GEO), Array Express
4
(AE), and a study from the Database of Genotypes and Phenotypes
5
(dbGAP) study (Fig.1, Table1). e
seven phenotypes we interrogated included 4 diseases such as chronic fatigue (CFS), major depressive disor-
der (MDD), ulcerative colitis (UC), intermittent allergic rhinitis (IAR in vitro), and 3 phenotypes including
physical activity (PA), obesity (OB), intelligence quotient (IQ). To ensure adequate power for detection, we
used seven studies that each had at least 10 twin pair samples. e sample sources (tissues and cell lines)
used by the studies included peripheral blood, lymphoblastoid cell lines, adipose tissue, muscle tissue, and
colon tissue (Table1). All results are accessible via our R Shiny web application (http://apps.chiragjpgroup.
org/disctwinexprdb/).
We performed dierential gene expression analysis on each of the monozygotic discordant twin gene expres-
sion studies, performed meta-analysis of probe-level (transcript-level) values to obtain gene-level values
6
, and
corrected the p-values for each gene-level value using the false discovery rate (FDR) method
7
. In order to mini-
mize the false positive rate (FPR), Sweeney et al. suggested to utilize stringent signicance and eect size thresh-
olds
6
. erefore, we identied signicant dierentially expressed genes for each phenotype that fell under a FDR
threshold of 0.05 and had an eect size threshold greater than the 95th percentile of the absolute value of mean
gene expression dierences in each phenotype (Figs2, S1). FigureS1 shows the empirical cumulative distribution
Figure 1. Analysis Procedure. A schematic diagram depicting the analysis pipeline. (1) Data Selection involved
a ltration process for selecting twin expression datasets. (2) Dierential Expression Analysis was carried out
(using probe or transcript-level values) to nd signicant dierentially expressed transcripts using FDR and
eect size thresholds. (3) Meta-Analytic Gene Level Summarization was carried out to summarize transcript-
level dierences to gene-level dierences.

www.nature.com/scientificreports/
3
Scientific RePoRtS | (2018) 8:17 | DOI:10.1038/s41598-017-18585-3
of mean dierences for each phenotype. e number of signicant genes ranged from a total of three signicant
in chronic fatigue syndrome (CFS) to 677 in intelligence quotient (IQ). Overall, the total number of unique sig-
nicant genes across all the seven datasets was 1,286 out of the 25,154 total number of genes measured across all
of those datasets (5%).
Across the seven studies (phenotypes) incorporated into our analyses, we found that intelligence quotient (IQ)
had 30 of the most signicant genes (with FDR less than 0.05 and mean dierence greater than the absolute value
eect size threshold of the 95th percentile). Out of the disease phenotypes incorporated into our study, intermit-
tent allergic rhinitis (IAR in vitro) had the most signicant gene (COQ5 [Coenzyme Q5, Methyltransferase], a
gene involved in methyltransferase activity) with a FDR value of 2.4E-09 (mean dierence = 105 units, or aected
twins had higher gene expression than their unaected twin pair).
e disease with the highest total number of signicant genes out of the ones included in our analysis was UC
(424 signicant genes) and the one with the least was CFS (three signicant genes). e non-disease phenotype
with the highest total number of signicant genes was IQ (677 signicant genes) and the one with the least was
physical activity (PA, 15 signicant genes).
Little overlap of dierentially expressed genes in discordant twins across seven phenotypes.
Next, we computed the pairwise similarity of gene expression between phenotypes in two ways. First, we com-
puted the intersection between genes found signicant between phenotypes. Second, we correlated the expression
dierences using a nonparametric Spearman correlation.
We report the percentage of the number of overlaps of signicant genes out of the number of overlaps of
measured genes for pairs of phenotypes (Tables2, S1, and S2). We found that the phenotype pair with the highest
number of overlapping signicant genes was UC and IQ (16 genes or ~0.09% of total possible genes that over-
lapped, Table2). e disease-disease pair with the most number of overlapping signicant genes was OB and UC
(13 genes or 0.06% of the total possible genes). e average percent of overlapping genes between phenotypes was
0.009%.
e pairwise Spearman correlations between the mean expression dierences for each of the phenotype pairs
were modest (Table3). e absolute value of Spearman correlation coecients ranged from 7.9E-4 to 1.8E-1.We
found no signicant correlations (with an unadjusted p-value threshold of 0.05) between the mean gene expres-
sion dierences.
Discordant twin gene expression is heterogeneous across seven phenotypes. We hypothe-
sized that it is possible to identify shared environmental etiology between phenotypes by identifying genes
across multiple phenotypes. We performed meta-analysis (using the Dersimonian and Laird meta-analytic
technique
8
) on each gene across all seven possible phenotypes to (1) estimate the overall mean dierence
of each gene across seven phenotypes (genes putatively expressed in greater than one phenotype) and (2)
estimate how each genes mean expression dierence varied across all of the seven studies (gene expression
heterogeneity). e empirical cumulative distribution of meta-analyzed mean dierences is shown in Fig.S2.
e I
2
(heterogeneity) estimates versus the negative log (base 10) of FDR-corrected QEp values (measure of
signicance of I
2
dierent from 0) is depicted in Fig.S3 and the empirical cumulative distribution plot of I
2
values is shown in Fig.3.
First, we discuss genes that were expressed over all phenotypes in discordant twins. We identied 19 out
of the 25,154 total genes (0.08%) that were dierentially expressed in discordant twin samples across multiple
phenotypes (FDR-corrected p-value of mean dierence less than 0.05, mean dierence greater than the absolute
value eect size threshold of the 95th percentile, and measured in more than one study; Fig.S2). e top sig-
nicant dierentially expressed genes (signicant genes that were measured for multiple phenotypes) included
those that are involved in keratinization such as KRTAP19-5 (Keratin Associated Protein 19-5) and KRTAP20-
2 (Keratin Associated Protein 20-2). A third included FGF6 (Fibroblast Growth Factor 6), a gene involved in
Study identier Reference(s)
Number
of Twin
Pairs Phenotype
Number
of Genes Platform
Sample
Source(Tissue
and cell lines) Source
GSE22619
Lepage et al.
20
,
Häsler et al.
21
10 ulcerative colitis (UC) 22836 GPL570
Primary mucosal
tissue, colon
GEO
GSE16059 Byrnes et al.
10
44
chronic fatigue syndrome
(CFS)
22836 GPL570
Peripheral venous
blood
GEO
GSE20319 Leskinen et al.
22
10 physical activity (PA) 19429 GPL6884
Musculus vastus
lateralis
GEO
GSE33476 Yu et al.
23
17 intelligence quotient(IQ) 18638 GPL6244
Lymphoblastoid
cell lines
GEO
GSE37146 Sjogren et al.
24
11
intermittent allergic
rhinitis (in vitro) (IAR_
invitro)
19580 GPL6102
Peripheral blood
mononuclear cells
GEO
MDD(dbGAP) Wright et al.
25
28
major depressive disorder
(MDD)
19284 GPL13667 Peripheral blood dbGAP
E-MEXP-1425 Pietiläinen et al.
26
13 obesity (OB) 22836 GPL570 Adipose tissue Array Express
Table 1. Summary of Datasets. is table shows the phenotype and number of genes being measured, sample
size, platform, tissue, source, and reference paper for each of the seven studies.

www.nature.com/scientificreports/
4
Scientific RePoRtS | (2018) 8:17 | DOI:10.1038/s41598-017-18585-3
Figure 2. Volcano plots for seven phenotypes. e mean dierences versus the negative log (base 10) of FDR
for the seven phenotypes (each with greater than 10 twin pairs). e blue color indicates FDR signicant genes
(FDR < 0.05) and the red color indicates FDR nonsignicant genes. e black lines indicate the eect size
thresholds (95th percentile of absolute value of mean expression dierences for each phenotype).
Phenotype PA UC IAR_invitro CFS IQ MDD OB
PA 0.08 0.01 0.00 0.00 0.00 0.00 0.00
UC 0.01 1.86 0.01 0.00 0.09 0.01 0.06
IAR_invitro 0.00 0.01 0.37 0.00 0.01 0.00 0.00
CFS 0.00 0.00 0.00 0.01 0.00 0.00 0.00
IQ 0.00 0.09 0.01 0.00 3.63 0.00 0.01
MDD 0.00 0.01 0.00 0.00 0.00 0.03 0.01
OB 0.00 0.06 0.00 0.00 0.01 0.01 0.59
Table 2. Percentages of Overlaps of Signicant (FDR < 0.05 and Absolute Value Eect Size reshold of 95th
percentile) Genes. is table shows the percentages of overlapping signicant genes in phenotype pairs out of
the total overlapping measured genes in those pairs.

www.nature.com/scientificreports/
5
Scientific RePoRtS | (2018) 8:17 | DOI:10.1038/s41598-017-18585-3
normal muscle regeneration, all with FDR values less than 3.7E-4. We found no genes that were signicant overall
(FDR-corrected p-value of mean dierence <0.05, mean dierence greater than the absolute value eect size
threshold of the 95th percentile, and measured in more than one study) and that were also signicant in individ-
ual disease phenotypes.
Second, we discuss overall heterogeneity of the dierentially expressed genes. Out of all the 25,154 genes
measured, 2,401 genes (10%) were found to have FDR-corrected QEp (measure of signicance of I
2
) values less
than 0.05, corresponding with I
2
values of greater than 68%. None of the overall signicant (FDR-corrected
p-value of mean dierence less than 0.05, mean dierence greater than the absolute value eect size threshold
of the 95th percentile, and measured in more than one study) genes were found to also have FDR-corrected
QEp values less than 0.05. Also,we found 40% of all measured genes to have I
2
values of 0. In fact, 11 out of the
19 signicant genes were found to have an I
2
values equal to 0. e gene with the highest I
2
estimate (47%) was
Phenotype PA UC IAR_invitro CFS MDD IQ OB
PA 1.00 0.00 0.02 0.02 0.01 0.02 0.00
UC 0.00 1.00 0.01 0.18 0.04 0.07 0.04
IAR_invitro 0.02 0.01 1.00 0.00 0.03 0.04 0.00
CFS 0.02 0.18 0.00 1.00 0.18 0.09 0.13
MDD 0.01 0.04 0.03 0.18 1.00 0.03 0.04
IQ 0.02 0.07 0.04 0.09 0.03 1.00 0.02
OB 0.00 0.04 0.00 0.13 0.04 0.02 1.00
Table 3. Spearman correlations of mean gene expression dierences between phenotypes. is table shows the
Spearman correlations between seven phenotypes in each phenotype pair.
Figure 3. Empirical Cumulative Distribution Function Plot of I
2
values. e distribution of all measured genes
(from the seven studies) among their I
2
values.

References
More filters
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Journal ArticleDOI

Measuring inconsistency in meta-analyses

TL;DR: A new quantity is developed, I 2, which the authors believe gives a better measure of the consistency between trials in a meta-analysis, which is susceptible to the number of trials included in the meta- analysis.
Journal ArticleDOI

On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other

TL;DR: In this paper, the authors show that the limit distribution is normal if n, n$ go to infinity in any arbitrary manner, where n = m = 8 and n = n = 8.
Journal ArticleDOI

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Integrated analysis of gene expression differences in twins discordant for disease and binary phenotypes" ?

One way to identify environmental influences is through a discordant monozygotic ( MZ ) twin study design. Second, the authors performed meta-analysis for each gene to identify commonalities and differences in gene expression signatures between the seven phenotypes. The authors make their work accessible by providing a new database ( DiscTwinExprDB: http: //apps. chiragjpgroup. org/disctwinexprdb/ ) for investigators to study non-genotypic influence on gene expression. 

In the future, the authors aim to collect twin data in an unbiased manner to ascertain the role of reverse causation in expression to deconvolve the role of phenotype on gene expression change. The authors hope that their work will inspire future studies to further understand the role of the environment in multiple phenotypes, eventually leading to the identification of environment-specific influences in multiple disease phenotypes. 

In order to minimize the false positive rate (FPR), Sweeney et al. suggested to utilize stringent significance and effect size thresholds6. 

The disease-disease pair with the most number of overlapping significant genes was OB and UC (13 genes or 0.06% of the total possible genes). 

the total number of unique significant genes across all the seven datasets was 1,286 out of the 25,154 total number of genes measured across all of those datasets (5%). 

The authors identified 19 out of the 25,154 total genes (0.08%) that were differentially expressed in discordant twin samples across multiple phenotypes (FDR-corrected p-value of mean difference less than 0.05, mean difference greater than the absolute value effect size threshold of the 95th percentile, and measured in more than one study; Fig. S2). 

Across the seven studies (phenotypes) incorporated into their analyses, the authors found that intelligence quotient (IQ) had 30 of the most significant genes (with FDR less than 0.05 and mean difference greater than the absolute value effect size threshold of the 95th percentile). 

The authors hypothesize that the use of this technique provides better power than alternative transcript-level methods; in fact, the authors showed that the authors were able to detect more genes (found total of 10 significant genes) than the Byrnes et al.10 investigation (this study detected none). 

To enhance comparison among the seven studies, the authors also computed the rank order of the expression differences in each study (available in the R Shiny web application: http://apps.chiragjpgroup.org/disctwinexprdb/).Last, the authors also performed FDR (False Discovery Rate [Benjamini-Hochberg]7) correction on the p-values for each gene for each study. 

The non-disease phenotype with the highest total number of significant genes was IQ (677 significant genes) and the one with the least was physical activity (PA, 15 significant genes).