Capturing heterogeneity in gene expression studies by surrogate variable analysis.

doi:10.1371/JOURNAL.PGEN.0030161

Home
/
Papers
/
Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Journal Article•DOI•

Capturing heterogeneity in gene expression studies by surrogate variable analysis.

Jeffrey T. Leek¹, John D. Storey¹•Institutions (1)

University of Washington¹

01 Jan 2005-PLOS Genetics (Public Library of Science)-Vol. 3, Iss: 9, pp 1724-1735

TL;DR: This work introduces “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies and shows that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

read less

Abstract: It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The sva package for removing batch effects and other unwanted variation in high-throughput experiments

[...]

Jeffrey T. Leek¹, W. Evan Johnson², Hilary S. Parker¹, Andrew E. Jaffe¹, John D. Storey¹ - Show less +1 more•Institutions (2)

Johns Hopkins University¹, Boston University²

01 Mar 2012-Bioinformatics

TL;DR: The sva package is described, which supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function.

...read moreread less

Abstract: Summary: Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic experiments are batch effects—when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function. Availability: The R package sva is freely available from http://www.bioconductor.org. Contact: jleek@jhsph.edu Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

3,343 citations

Cites background from "Capturing heterogeneity in gene exp..."

...Removing batch effects and using surrogate variables in differential expression analysis have been shown to reduce dependence, stabilize error rate estimates, and improve reproducibility (see [2, 3, 4] for more detailed information)....
[...]

Journal Article•DOI•

Genetic effects on gene expression across human tissues.

[...]

Enhancing GTEx (eGTEx) groups¹, Nih Common Fund², Nhgri, Biospecimen Core Resource—VARI, Elsi study, Genome Browser Data Integration Visualization—EBI, Lead analysts, Alexis Battle³, Christopher D. Brown⁴, Barbara E. Engelhardt¹, Stephen B. Montgomery² - Show less +7 more•Institutions (4)

Princeton University¹, Stanford University², Johns Hopkins University³, University of Pennsylvania⁴

12 Oct 2017-Nature

TL;DR: It is found that local genetic variation affects gene expression levels for the majority of genes, and inter-chromosomal genetic effects for 93 genes and 112 loci are identified, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.

...read moreread less

Abstract: Characterization of the molecular function of the human genome and its variation across individuals is essential for identifying the cellular mechanisms that underlie human genetic traits and diseases. The Genotype-Tissue Expression (GTEx) project aims to characterize variation in gene expression levels across individuals and diverse tissues of the human body, many of which are not easily accessible. Here we describe genetic effects on gene expression levels across 44 human tissues. We find that local genetic variation affects gene expression levels for the majority of genes, and we further identify inter-chromosomal genetic effects for 93 genes and 112 loci. On the basis of the identified genetic effects, we characterize patterns of tissue specificity, compare local and distal effects, and evaluate the functional properties of the genetic effects. We also demonstrate that multi-tissue, multi-individual data can be used to identify genes and pathways affected by human disease-associated variation, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.

...read moreread less

3,289 citations

Journal Article•DOI•

Fast, sensitive and accurate integration of single-cell data with Harmony.

[...]

Ilya Korsunsky, Nghia Millard, Jean Fan¹, Kamil Slowikowski, Fan Zhang, Kevin Wei², Yuriy Baglaenko, Michael B. Brenner², Po-Ru Loh³, Po-Ru Loh¹, Po-Ru Loh², Soumya Raychaudhuri - Show less +8 more•Institutions (3)

Harvard University¹, Brigham and Women's Hospital², Broad Institute³

18 Nov 2019-Nature Methods

TL;DR: Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.

...read moreread less

Abstract: The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony ( https://github.com/immunogenomics/harmony ), an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~106 cells on a personal computer. We apply Harmony to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells, mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data. Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.

...read moreread less

2,459 citations

Journal Article•DOI•

DNA methylation arrays as surrogate measures of cell mixture distribution

[...]

Eugene Andres Houseman¹, William P. Accomando², Devin C. Koestler³, Brock C. Christensen³, Carmen J. Marsit³, Heather H. Nelson⁴, John K. Wiencke⁵, Karl T. Kelsey² - Show less +4 more•Institutions (5)

Oregon State University¹, Brown University², Dartmouth College³, University of Minnesota⁴, University of California, San Francisco⁵

08 May 2012-BMC Bioinformatics

TL;DR: This work presents a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples.

...read moreread less

Abstract: Background: There has been a long-standing need in biomedical research for a method that quantifies the normally mixed composition of leukocytes beyond what is possible by simple histological or flow cytometric assessments. The latter is restricted by the labile nature of protein epitopes, requirements for cell processing, and timely cell analysis. In a diverse array of diseases and following numerous immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most chronic medical conditions. Emerging research demonstrates that DNA methylation is responsible for cellular differentiation, and when measured in whole peripheral blood, serves to distinguish cancer cases from controls. Results: Here we present a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples. We validate the fundamental idea in a cell mixture reconstruction experiment, then demonstrate our method on DNA methylation data sets from several studies, including data from a Head and Neck Squamous Cell Carcinoma (HNSCC) study and an ovarian cancer study. Our method produces results consistent with prior biological findings, thereby validating the approach. Conclusions: Our method, in combination with an appropriate external validation set, promises new opportunities for large-scale immunological studies of both disease states and noxious exposures.

...read moreread less

2,431 citations

Cites methods from "Capturing heterogeneity in gene exp..."

...ordinary least squares, linear mixed effects models [16], limma [17], or surrogate variable analysis [18,19], to obtain estimates B̂0 and B̂1....
[...]

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

[...]

Orly Alter¹, Patrick O. Brown, David Botstein•Institutions (1)

Stanford University¹

01 Mar 2001

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.

...read moreread less

Abstract: ‡We describe the use of singular value decomposition in transforming genome-wide expression data from genes 3 arrays space to reduced diagonalized ‘‘eigengenes’’ 3 ‘‘eigenarrays’’ space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.

...read moreread less

1,815 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Cluster analysis and display of genome-wide expression patterns

[...]

Michael B. Eisen¹, Paul T. Spellman¹, Patrick O. Brown¹, David Botstein¹•Institutions (1)

Stanford University¹

08 Dec 1998-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.

...read moreread less

Abstract: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is de- scribed that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be inter- preted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly charac- terized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.

...read moreread less

16,371 citations

Journal Article•DOI•

Generalized Additive Models.

[...]

R. A. Brown, Trevor Hastie, Robert Tibshirani

01 Jun 1991-Biometrics

9,941 citations

Journal Article•DOI•

Principal components analysis corrects for stratification in genome-wide association studies

[...]

Alkes L. Price¹, Alkes L. Price², Nick Patterson², Robert M. Plenge², Robert M. Plenge³, Michael E. Weinblatt³, Nancy A. Shadick³, David Reich², David Reich¹ - Show less +5 more•Institutions (3)

Harvard University¹, Broad Institute², Brigham and Women's Hospital³

23 Jul 2006-Nature Genetics

TL;DR: This work describes a method that enables explicit detection and correction of population stratification on a genome-wide scale and uses principal components analysis to explicitly model ancestry differences between cases and controls.

...read moreread less

Abstract: Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker’s variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers. Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies 1‐8 . Because the effects of stratification vary in proportion to the number of samples 9 , stratification will be an increasing problem in the large-scale association studies of the future, which will analyze thousands of samples in an effort to detect common genetic variants of weak effect. The two prevailing methods for dealing with stratification are genomic control and structured association 9‐14 . Although genomic control and structured association have proven useful in a variety of contexts, they have limitations. Genomic control corrects for stratification by adjusting association statistics at each marker by a uniform overall inflation factor. However, some markers differ in their allele frequencies across ancestral populations more than others. Thus, the uniform adjustment applied by genomic control may be insufficient at markers having unusually strong differentiation across ancestral populations and may be superfluous at markers devoid of such differentiation, leading to a loss in power. Structured association uses a program such as STRUCTURE 15 to assign the samples to discrete subpopulation clusters and then aggregates evidence of association within each cluster. If fractional membership in more than one cluster is allowed, the method cannot currently be applied to genome-wide association studies because of its intensive computational cost on large data sets. Furthermore, assignments of individuals to clusters are highly sensitive to the number of clusters, which is not well defined 14,16 .

...read moreread less

9,387 citations

"Capturing heterogeneity in gene exp..." refers methods in this paper

...To demonstrate these issues, we considered two straightforward significance analysis applications of the well-established singular value decomposition approach previously utilized in genomics [40,41]....
[...]
...[41] also performed this singular value decomposition of whole-genome SNP genotypes (coded as 0, 1, or 2) in order to account for systematic sources of variation due to population substructure....
[...]

Journal Article•DOI•

The control of the false discovery rate in multiple testing under dependency

[...]

Yoav Benjamini, Daniel Yekutieli

01 Aug 2001-Annals of Statistics

TL;DR: In this paper, it was shown that a simple FDR controlling procedure for independent test statistics can also control the false discovery rate when test statistics have positive regression dependency on each of the test statistics corresponding to the true null hypotheses.

...read moreread less

Abstract: Benjamini and Hochberg suggest that the false discovery rate may be the appropriate error rate to control in many applied multiple testing problems. A simple procedure was given there as an FDR controlling procedure for independent test statistics and was shown to be much more powerful than comparable procedures which control the traditional familywise error rate. We prove that this same procedure also controls the false discovery rate when the test statistics have positive regression dependency on each of the test statistics corresponding to the true null hypotheses. This condition for positive dependency is general enough to cover many problems of practical interest, including the comparisons of many treatments with a single control, multivariate normal test statistics with positive correlation matrix and multivariate $t$. Furthermore, the test statistics may be discrete, and the tested hypotheses composite without posing special difficulties. For all other forms of dependency, a simple conservative modification of the procedure controls the false discovery rate. Thus the range of problems for which a procedure with proven FDR control can be offered is greatly increased.

...read moreread less

9,335 citations

Journal Article•DOI•

Statistical significance for genomewide studies

[...]

John D. Storey, Robert Tibshirani¹•Institutions (1)

Stanford University¹

05 Aug 2003-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This work proposes an approach to measuring statistical significance in genomewide studies based on the concept of the false discovery rate, which offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted.

...read moreread less

Abstract: With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

...read moreread less

9,239 citations

"Capturing heterogeneity in gene exp..." refers background in this paper

...The overall goal of SVA is to provide a more accurate and reproducible parsing of signal and noise in the analysis of an expression study when EH is present. One way in which signal is commonly quantified is through a significance analysis [...
[...]