Open AccessJournal ArticleDOI

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

Q: What is the main goal of the Swirl experiment?

The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.

Q: What is the advantage of the moderated t statistic?

The moderated t has the advantage over the B-statistic that Bgj depends on hyperparameters v0j and pj for all j as well as d0 and s 2 0 whereas t̃gj depends only on d0 and s20.

Q: What is the unscaled variance for the contrasts of interest?

The unscaled variance for the contrasts of interest is estimated to be v0 = 3.4 meaning that the typical fold change for differentially expressed genes is estimated to be about 1.3.

Q: What is the estimate of v0?

0. Restricting to those values of r for which (r− 0.5)/(2G) < p ensures also that ptarget < 1 so that the estimator of v0 is defined.

Q: What is the cumulative distribution function of tg?

The cumulative distribution function of t̃g isF (t̃g; vg, v0, d0 + dg) = pF t̃g {vg vg + v0}1/2 ; d0 + dg + (1− p)F (t̃g; d0 + dg)where F (·; k) is the cumulative distribution function of the t-distribution on k degrees of freedom.

Q: What is the unscaled variance for the contrast?

The estimated unscaled variance for the contrast is v0 = 22.7, meaning that the standard deviation of the log-ratio for a typical gene is (0.0509)1/2(22.7)1/2 = 1.07, i.e., genes which are differentially expressed typically change by about two-fold.

Gordon K. Smyth

- 12 Feb 2004 -

Statistical Applications in Genetics and...

- Vol. 3, Iss: 1, pp 1-25

Chats0

TLDR

The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.

Abstract:

The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

Content maybe subject to copyright Report

Statistical Applications in Genetics

and Molecular Biology

Volume 3, Issue 1 2004 Article 3

Linear Models and Empirical Bayes Methods

for Assessing Differential Expression in

Microarray Experiments

Gordon K. Smyth

∗

Walter and Eliza Hall Institute, smyth@wehi.edu.au

produced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,

mechanical, photocopying, recording, or otherwise, without the prior written permission of the

publisher, bepress, which has been given certain exclusive rights by the author. Statistical Applica-

tions in Genetics and Molecular Biology is produced by The Berkeley Electronic Press (bepress).

http://www.bepress.com/sagmb

Linear Models and Empirical Bayes Methods

for Assessing Differential Expression in

Microarray Experiments

∗

Gordon K. Smyth

Abstract

The problem of identifying differentially expressed genes in designed microarray experiments

is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differen-

tial expression in a replicated two-color experiment using a simple hierarchical parametric model.

The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into

a practical approach for general microarray experiments with arbitrary numbers of treatments and

RNA samples. The model is reset in the context of general linear models with arbitrary coefﬁcients

and contrasts of interest. The approach applies equally well to both single channel and two color

microarray experiments. Consistent, closed form estimators are derived for the hyperparameters

in the model. The estimators proposed have robust behavior even for small numbers of arrays and

allow for incomplete data arising from spot ﬁltering or spot quality weights. The posterior odds

statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard

deviations are used in place of ordinary standard deviations. The empirical Bayes approach is

equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in

far more stable inference when the number of arrays is small. The use of moderated t-statistics has

the advantage over the posterior odds that the number of hyperparameters which need to estimated

is reduced; in particular, knowledge of the non-null prior for the fold changes are not required.

The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.

The moderated t inferential approach extends to accommodate tests of composite null hypotheses

through the use of moderated F-statistics. The performance of the methods is demonstrated in a

simulation study. Results are presented for two publicly available data sets.

KEYWORDS: microarrays, empirical Bayes, linear models, hyperparameters, differential ex-

pression

∗

Walter and Eliza Hall Institute, 1G Royal Parade, Melbourne 3050, Australia,

smyth@wehi.edu.au

1 Introduction

Microarrays are a technology for comparing the expression proﬁles of genes on a genomic

scale across two or more RNA samples. Recent reviews of microarray data analysis

include the Nature Genetics supplement (2003), Smyth et al (2003), Parmigiani et

al (2003) and Speed (2003). This paper considers the problem of identifying genes

which are diﬀerentially expressed across speciﬁed conditions in designed microarray

experiments. This is a massive multiple testing problem in which one or more tests

are conducted for each of tens of thousands of genes. The problem is complicated by

the fact that the measured expression levels are often non-normally distributed and

have non-identical and dependent distributions between genes. This paper addresses

particularly the fact that the variability of the expression values diﬀers between genes.

It is well e stablished that allowance needs to be made in the analysis of microarray

experiments for the amount of multiple testing, perhaps by controlling the familywise

error rate or the false discovery rate, even though this reduces the power available

to detect changes in expression for individual genes (Ge et al, 2002). On the other

hand, the parallel nature of the inference in microarrays allows some compensating

possibilities for borrowing information from the ensemble of genes which can assist in

inference about each gene individually. One way that this can be done is through the

application of Bayes or empirical Bayes methods (Efron, 2001, 2003). Efron et al (2001)

used a non-parametric empirical Bayes approach for the analysis of factorial data with

high density oligonucleotide microarray data. This approach has much p otential but can

be diﬃcult to apply in practical situations especially by less experienced practitioners.

L¨onnstedt and Sp e ed (2002), considering replicated two-color microarray experiments,

took instead a parametric empirical Bayes approach using a simple mixture of normal

models and a conjugate prior and derived a pleasingly simple expression for the posterior

odds of diﬀerential expression for each gene. The posterior odds expression has proved

to be a useful means of ranking genes in terms of evidence for diﬀerential expression.

The purpose of this paper is to develop the hierarchical model of L¨onnstedt and

Speed (2002) into a practical approach for general microarray experiments with arbi-

trary numbers of treatments and RNA samples. The ﬁrst step is to reset it in the

context of general linear models with arbitrary coeﬃcients and contrasts of interest.

The approach applies to both single channel and two color microarrays. All of the

commonly used microarray platforms such as cDNA, long-oligos and Aﬀymetrix are

therefore accommodated. The second step is to derive consistent, closed form estima-

tors for the hyperparameters using the marginal distributions of the observed statistics.

The estimators proposed here have robust behavior even for small numbers of arrays and

allow for incomplete data arising from spot ﬁltering or spot quality weights. The third

step is to reformulate the posterior odds statistic in terms of a moderated t-statistic

in which p osterior residual standard deviations are used in place of ordinary standard

deviations. This approach makes explicit what was implicit in L¨onnstedt and Speed

(2002), that the hierarchical model results in a shrinkage of the gene-wise residual sam-

ple variances towards a common value, resulting in far more stable inference when the

number of arrays is small. The use of moderated t-statistic has the advantage over

Smyth: Empirical Bayes Methods for Differential Expression

Published by The Berkeley Electronic Press, 2004

the posterior odds of reducing the number of hyperparameters which need to estimated

under the hierarchical model; in particular, knowledge of the non-null prior for the fold

changes are not required. T he moderated t-statistic is shown to follow a t-distribution

with augmented degrees of freedom. The moderated t inferential approach extends

to accommodate tests involving two or more contrasts through the use of moderated

F -statistics.

The idea of using a t-statistic with a Bayesian adjusted denominator was also pro-

posed by Baldi and Long (2001) who developed the useful cyberT program. Their work

was limited though to two-sample control versus treatment designs and their model did

not distinguish between diﬀerentially and non-diﬀerentially expressed genes. They also

did not develop consistent estimators for the hyperparameters. The degrees of freedom

associated with the prior distribution of the variances was set to a default value while

the prior variance was simply equated to locally pooled sample variances.

Tusher et al (2001), Efron et al (2001) and Broberg (2003) have used t statistics

with oﬀset standard deviations. This is similar in principle to the moderated t-statistics

used here but the oﬀset t-statistics are not motivated by a model and do not have an

associated distributional theory. Tusher et al (2001) estimated the oﬀset by minimizing

a coeﬃcient of variation while Efron et al (2001) used a percentile of the distribution

of sample standard deviations. Broberg (2003) considered the two sample problem and

proposed a computationally intensive method of de termining the oﬀset by minimizing a

combination of estimated false positive and false negative rates over a grid of signiﬁcance

levels and oﬀsets. Cui and Churchill (2003) give a review of test statistics for diﬀerential

expression for microarray experiments.

Newton et al (2001), Newton and Kendziorski (2003) and Kendziorski et al (2003)

have considered empirical Bayes models for expression based on gamma and log-normal

distributions. Other authors have used Bayesian methods for other purposes in mi-

croarray data analysis. Ibrahim et al (2002) for example propose Bayesian models with

correlated priors to model gene expression and to classify between normal and tumor

tissues.

Other approaches to linear models for microarray data analysis have been described

by Kerr et al (2000), Jin et al (2001), Wolﬁnger et al (2001), Chu et al (2002), Yang

and Speed (2003) and L¨onnstedt et al (2003). Kerr et al (2000) propose a single linear

model for an entire microarray experiment whereas in this paper a separate linear model

is ﬁtted for each gene. The single linear model approach assumes all equal variances

across genes whereas the current paper is designed to accommodate diﬀerent variances.

Jin et al (2001) and Wolﬁnger et al (2001) ﬁt separate models for each gene but model

the individual channels of two color microarray data requiring the use of mixed linear

models to accommo date the correlation between observations on the same spot. Chu

et al (2002) propose mixed models for single channel oligonucleotide array experiments

with multiple probes per gene. The methods of the current paper assume linear models

with a single component of variance and so do not apply direc tly to the mixed model

approach, although ideas similar to those used here could be developed. Yang and

Speed (2003) and L¨onnstedt et al (2003) take an linear modeling approach similar to

that of the current paper.

Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 3

http://www.bepress.com/sagmb/vol3/iss1/art3

A B A B

Ref

A B

(a) (b)

Figure 1: Example designs for two color microarrays.

The plan of this paper is as follows. Section 2 explains the linear modelling approach

to the analysis of designed experiments and speciﬁes the response model and distribu-

tional assumptions. Section 3 sets out the prior assumptions and deﬁnes the posterior

variances and moderated t-statistics. Section 4 derives marginal distributions under the

hierarchical mode l for the observed statistics. Section 5 derives the posterior odds of

diﬀerential expression and relates it to the t-statistic. The inferential approach based

on moderated t and F statistics is elaborated in Section 6. Section 7 derives estimators

for the hyperparameters. Section 8 compares the estimators with earlier statistics in a

simulation study. Section 9 illustrates the methodology on two publicly available data

sets. Finally, Section 10 makes some remarks on available software.

2 Linear Models for Microarray Data

This section describes how gene-wise linear models arise from experimental designs and

states the distributional assumptions about the data which will used in the remainder

of the paper. The design of any microarray experiment can be represented in terms of

a linear model for each gene. Figure 1 displays some examples of simple designs with

two-color arrays using arrow notation as in Kerr and Churchill (2001). Each arrow

represents a microarray. The arrow points towards the RNA sample which is labelled

red and the sample at the base of the arrow is labelled green. The symbols A, B and C

represent RNA sources to be compared. In experiment (a) there is only one microarray

which compares RNA sample A and B. For this experiment one can only compute the

log-ratios of expression y

= log

)−log

) where R

and G

are the red and green

intensities for gene g. Design (b) is a dye-swap experiment leading to a very simple

linear model with responses y

and y

which are log-ratios from the two microarrays

and design matrix

X =

−1

Smyth: Empirical Bayes Methods for Differential Expression

Published by The Berkeley Electronic Press, 2004

HTML Viewer

Figures

Figure 1: Example designs for two color microarrays.

Table 3: Top 30 genes from the Swirl data

Table 1: Area under the Receiver Operating Curve for five statistics and three simulation scenarios.

Figure 4: False discovery rates for different gene selection statistics when the true variances are somewhat different, i.e., the prior and residual degrees of freedom are balanced. The rates are means of actual false discovery rates for 100 simulated data sets.

Table 4: Top 15 genes from the ApoAI data

Table 2: Means (standard deviations) of hyperparameter estimates for the simulated data sets. True values are d0/(d0 + dg) = 0.2, 0.5, 0.9960, s20 = 4, v0 = 2.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love, +3 more

- 05 Dec 2014 -

Genome Biology

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

...read moreread less

Journal ArticleDOI

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Mark D. Robinson, +2 more

- 01 Jan 2010 -

Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Journal ArticleDOI

limma powers differential expression analyses for RNA-sequencing and microarray studies

Matthew E. Ritchie, +7 more

- 20 Apr 2015 -

Nucleic Acids Research

TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

Posted ContentDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love, +2 more

- 17 Nov 2014 -

bioRxiv

...read moreread less

Journal ArticleDOI

Differential expression analysis for sequence count data.

Simon Anders, +1 more

- 27 Oct 2010 -

Genome Biology

TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Significance analysis of microarrays applied to the ionizing radiation response

Virginia Goss Tusher, +2 more

- 24 Apr 2001 -

Proceedings of the National Academy of S...

TL;DR: A method that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements is described, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

...read moreread less

Book ChapterDOI

limma: Linear Models for Microarray Data

Gordon K. Smyth

TL;DR: This chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments with technical as well as biological replication.

...read moreread less

Journal ArticleDOI

Summaries of Affymetrix GeneChip probe level data

Rafael A. Irizarry, +6 more

- 15 Feb 2003 -

Nucleic Acids Research

TL;DR: It is found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models.

...read moreread less

Journal ArticleDOI

Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

Cheng Li, +1 more

- 02 Jan 2001 -

Proceedings of the National Academy of S...

TL;DR: A statistical model is proposed for the probe-level data, and model-based estimates for gene expression indexes are developed, which help to identify and handle cross-hybridizing probes and contaminating array regions.

...read moreread less

Journal ArticleDOI

Variance stabilization applied to microarray data calibration and to the quantification of differential expression.

Wolfgang Huber, +4 more

TL;DR: A statistical model for microarray gene expression data that comprises data calibration, the quantifying of differential expression, and the quantification of measurement error is introduced, and a difference statistic Deltah whose variance is approximately constant along the whole intensity range is derived.

...read moreread less

Collapse

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

Bioconductor: open software development for computational biology and bioinformatics

Robert Gentleman, +24 more

- 15 Sep 2004 -

Genome Biology

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Aravind Subramanian, +10 more

- 25 Oct 2005 -

Proceedings of the National Academy of S...

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Da-Wei Huang, +2 more

- 01 Jan 2009 -

Nature Protocols

Significance analysis of microarrays applied to the ionizing radiation response

Virginia Goss Tusher, +2 more

- 24 Apr 2001 -

Proceedings of the National Academy of S...

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Linear models and empirical bayes methods for assessing differential expression in microarray experiments" ?

The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed ( 2002 ) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The performance of the methods is demonstrated in a simulation study.

Q2. What is the simplest way to solve f(y) = x?

To avoid overflow or underflow in floating point arithmetic, the authors can set y = 1/ √ x when x > 107 and y = 1/x when x < 10−6 instead of performing the iteration.

Q3. What is the third step to reformulate the posterior odds statistic?

The third step is to reformulate the posterior odds statistic in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations.

Q4. What is the main goal of the Swirl experiment?

The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.

Q5. What is the advantage of the moderated t statistic?

The moderated t has the advantage over the B-statistic that Bgj depends on hyperparameters v0j and pj for all j as well as d0 and s 2 0 whereas t̃gj depends only on d0 and s20.

Q6. What is the unscaled variance for the contrasts of interest?

The unscaled variance for the contrasts of interest is estimated to be v0 = 3.4 meaning that the typical fold change for differentially expressed genes is estimated to be about 1.3.

Q7. What is the advantage of the moderated t inferential approach?

The moderated t inferential approach extends to accommodate tests involving two or more contrasts through the use of moderated F -statistics.

Q8. What is the estimate of v0?

0. Restricting to those values of r for which (r− 0.5)/(2G) < p ensures also that ptarget < 1 so that the estimator of v0 is defined.

Q9. What is the cumulative distribution function of tg?

The cumulative distribution function of t̃g isF (t̃g; vg, v0, d0 + dg) = pF t̃g {vg vg + v0}1/2 ; d0 + dg + (1− p)F (t̃g; d0 + dg)where F (·; k) is the cumulative distribution function of the t-distribution on k degrees of freedom.

Q10. What is the unscaled variance for the contrast?

The estimated unscaled variance for the contrast is v0 = 22.7, meaning that the standard deviation of the log-ratio for a typical gene is (0.0509)1/2(22.7)1/2 = 1.07, i.e., genes which are differentially expressed typically change by about two-fold.

Q11. Why is the posterior variance s2g offset?

This is because the posterior variance s̃2g offsets the small sample variances heavily in a relative sense while larger sample variances are moderated to a lesser relative degree.

Q12. is the covariance matrix assumed to be dependent on g?

If so, the covariance matrix is assumedSmyth: Empirical Bayes Methods for Differential ExpressionPublished by The Berkeley Electronic Press, 2004to be evaluated at α̂g and is the dependence is assumed to be such that it can be ignored to a first order approximation.

Q13. What is the purpose of this paper?

The purpose of this paper is to develop the hierarchical model of Lönnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples.

Q14. How many different facilities have been used to test the Limma software?

The Limma software has been tested on a wide range of microarray data sets from many different facilities and has been used routinely at the author’s institution since the middle of 2002.

Q15. What is the coefficient of contrast for design (a)?

The regression coefficient here estimates the contrast B − A on the log-scale, just as for design (a), but with two arrays there is one degree of freedom for error.

Q16. What is the current paper designed to do?

The single linear model approach assumes all equal variances across genes whereas the current paper is designed to accommodate different variances.

Q17. What is the limit for v1/2 0j s0?

In the software packageLimma which implements the methods in this paper, the user is allowed to place limits on the possible values for v1/2 0j s0.

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

Figures

Citations

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

limma powers differential expression analyses for RNA-sequencing and microarray studies

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Differential expression analysis for sequence count data.

References

Significance analysis of microarrays applied to the ionizing radiation response

limma: Linear Models for Microarray Data

Summaries of Affymetrix GeneChip probe level data

Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

Variance stabilization applied to microarray data calibration and to the quantification of differential expression.

Related Papers (5)

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Bioconductor: open software development for computational biology and bioinformatics

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Significance analysis of microarrays applied to the ionizing radiation response

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Linear models and empirical bayes methods for assessing differential expression in microarray experiments" ?

Q2. What is the simplest way to solve f(y) = x?

Q3. What is the third step to reformulate the posterior odds statistic?

Q4. What is the main goal of the Swirl experiment?

Q5. What is the advantage of the moderated t statistic?

Q6. What is the unscaled variance for the contrasts of interest?

Q7. What is the advantage of the moderated t inferential approach?

Q8. What is the estimate of v0?

Q9. What is the cumulative distribution function of tg?

Q10. What is the unscaled variance for the contrast?

Q11. Why is the posterior variance s2g offset?

Q12. is the covariance matrix assumed to be dependent on g?

Q13. What is the purpose of this paper?

Q14. How many different facilities have been used to test the Limma software?

Q15. What is the coefficient of contrast for design (a)?

Q16. What is the current paper designed to do?

Q17. What is the limit for v1/2 0j s0?