scispace - formally typeset
Open AccessJournal ArticleDOI

Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments

Reads0
Chats0
TLDR
The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
Abstract
The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

read more

Content maybe subject to copyright    Report

Statistical Applications in Genetics
and Molecular Biology
Volume 3, Issue 1 2004 Article 3
Linear Models and Empirical Bayes Methods
for Assessing Differential Expression in
Microarray Experiments
Gordon K. Smyth
Walter and Eliza Hall Institute, smyth@wehi.edu.au
Copyright
c
2004 by the authors. All rights reserved. No part of this publication may be re-
produced, stored in a retrieval system, or transmitted, in any form or by any means, electronic,
mechanical, photocopying, recording, or otherwise, without the prior written permission of the
publisher, bepress, which has been given certain exclusive rights by the author. Statistical Applica-
tions in Genetics and Molecular Biology is produced by The Berkeley Electronic Press (bepress).
http://www.bepress.com/sagmb

Linear Models and Empirical Bayes Methods
for Assessing Differential Expression in
Microarray Experiments
Gordon K. Smyth
Abstract
The problem of identifying differentially expressed genes in designed microarray experiments
is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differen-
tial expression in a replicated two-color experiment using a simple hierarchical parametric model.
The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into
a practical approach for general microarray experiments with arbitrary numbers of treatments and
RNA samples. The model is reset in the context of general linear models with arbitrary coefficients
and contrasts of interest. The approach applies equally well to both single channel and two color
microarray experiments. Consistent, closed form estimators are derived for the hyperparameters
in the model. The estimators proposed have robust behavior even for small numbers of arrays and
allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds
statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard
deviations are used in place of ordinary standard deviations. The empirical Bayes approach is
equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in
far more stable inference when the number of arrays is small. The use of moderated t-statistics has
the advantage over the posterior odds that the number of hyperparameters which need to estimated
is reduced; in particular, knowledge of the non-null prior for the fold changes are not required.
The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
The moderated t inferential approach extends to accommodate tests of composite null hypotheses
through the use of moderated F-statistics. The performance of the methods is demonstrated in a
simulation study. Results are presented for two publicly available data sets.
KEYWORDS: microarrays, empirical Bayes, linear models, hyperparameters, differential ex-
pression
Walter and Eliza Hall Institute, 1G Royal Parade, Melbourne 3050, Australia,
smyth@wehi.edu.au

1 Introduction
Microarrays are a technology for comparing the expression profiles of genes on a genomic
scale across two or more RNA samples. Recent reviews of microarray data analysis
include the Nature Genetics supplement (2003), Smyth et al (2003), Parmigiani et
al (2003) and Speed (2003). This paper considers the problem of identifying genes
which are differentially expressed across specified conditions in designed microarray
experiments. This is a massive multiple testing problem in which one or more tests
are conducted for each of tens of thousands of genes. The problem is complicated by
the fact that the measured expression levels are often non-normally distributed and
have non-identical and dependent distributions between genes. This paper addresses
particularly the fact that the variability of the expression values differs between genes.
It is well e stablished that allowance needs to be made in the analysis of microarray
experiments for the amount of multiple testing, perhaps by controlling the familywise
error rate or the false discovery rate, even though this reduces the power available
to detect changes in expression for individual genes (Ge et al, 2002). On the other
hand, the parallel nature of the inference in microarrays allows some compensating
possibilities for borrowing information from the ensemble of genes which can assist in
inference about each gene individually. One way that this can be done is through the
application of Bayes or empirical Bayes methods (Efron, 2001, 2003). Efron et al (2001)
used a non-parametric empirical Bayes approach for the analysis of factorial data with
high density oligonucleotide microarray data. This approach has much p otential but can
be difficult to apply in practical situations especially by less experienced practitioners.
onnstedt and Sp e ed (2002), considering replicated two-color microarray experiments,
took instead a parametric empirical Bayes approach using a simple mixture of normal
models and a conjugate prior and derived a pleasingly simple expression for the posterior
odds of differential expression for each gene. The posterior odds expression has proved
to be a useful means of ranking genes in terms of evidence for differential expression.
The purpose of this paper is to develop the hierarchical model of onnstedt and
Speed (2002) into a practical approach for general microarray experiments with arbi-
trary numbers of treatments and RNA samples. The first step is to reset it in the
context of general linear models with arbitrary coefficients and contrasts of interest.
The approach applies to both single channel and two color microarrays. All of the
commonly used microarray platforms such as cDNA, long-oligos and Affymetrix are
therefore accommodated. The second step is to derive consistent, closed form estima-
tors for the hyperparameters using the marginal distributions of the observed statistics.
The estimators proposed here have robust behavior even for small numbers of arrays and
allow for incomplete data arising from spot filtering or spot quality weights. The third
step is to reformulate the posterior odds statistic in terms of a moderated t-statistic
in which p osterior residual standard deviations are used in place of ordinary standard
deviations. This approach makes explicit what was implicit in onnstedt and Speed
(2002), that the hierarchical model results in a shrinkage of the gene-wise residual sam-
ple variances towards a common value, resulting in far more stable inference when the
number of arrays is small. The use of moderated t-statistic has the advantage over
1
Smyth: Empirical Bayes Methods for Differential Expression
Published by The Berkeley Electronic Press, 2004

the posterior odds of reducing the number of hyperparameters which need to estimated
under the hierarchical model; in particular, knowledge of the non-null prior for the fold
changes are not required. T he moderated t-statistic is shown to follow a t-distribution
with augmented degrees of freedom. The moderated t inferential approach extends
to accommodate tests involving two or more contrasts through the use of moderated
F -statistics.
The idea of using a t-statistic with a Bayesian adjusted denominator was also pro-
posed by Baldi and Long (2001) who developed the useful cyberT program. Their work
was limited though to two-sample control versus treatment designs and their model did
not distinguish between differentially and non-differentially expressed genes. They also
did not develop consistent estimators for the hyperparameters. The degrees of freedom
associated with the prior distribution of the variances was set to a default value while
the prior variance was simply equated to locally pooled sample variances.
Tusher et al (2001), Efron et al (2001) and Broberg (2003) have used t statistics
with offset standard deviations. This is similar in principle to the moderated t-statistics
used here but the offset t-statistics are not motivated by a model and do not have an
associated distributional theory. Tusher et al (2001) estimated the offset by minimizing
a coefficient of variation while Efron et al (2001) used a percentile of the distribution
of sample standard deviations. Broberg (2003) considered the two sample problem and
proposed a computationally intensive method of de termining the offset by minimizing a
combination of estimated false positive and false negative rates over a grid of significance
levels and offsets. Cui and Churchill (2003) give a review of test statistics for differential
expression for microarray experiments.
Newton et al (2001), Newton and Kendziorski (2003) and Kendziorski et al (2003)
have considered empirical Bayes models for expression based on gamma and log-normal
distributions. Other authors have used Bayesian methods for other purposes in mi-
croarray data analysis. Ibrahim et al (2002) for example propose Bayesian models with
correlated priors to model gene expression and to classify between normal and tumor
tissues.
Other approaches to linear models for microarray data analysis have been described
by Kerr et al (2000), Jin et al (2001), Wolfinger et al (2001), Chu et al (2002), Yang
and Speed (2003) and onnstedt et al (2003). Kerr et al (2000) propose a single linear
model for an entire microarray experiment whereas in this paper a separate linear model
is fitted for each gene. The single linear model approach assumes all equal variances
across genes whereas the current paper is designed to accommodate different variances.
Jin et al (2001) and Wolfinger et al (2001) fit separate models for each gene but model
the individual channels of two color microarray data requiring the use of mixed linear
models to accommo date the correlation between observations on the same spot. Chu
et al (2002) propose mixed models for single channel oligonucleotide array experiments
with multiple probes per gene. The methods of the current paper assume linear models
with a single component of variance and so do not apply direc tly to the mixed model
approach, although ideas similar to those used here could be developed. Yang and
Speed (2003) and onnstedt et al (2003) take an linear modeling approach similar to
that of the current paper.
2
Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 3
http://www.bepress.com/sagmb/vol3/iss1/art3

A B A B
A
Ref
B
A B
C
(a) (b)
(c) (d)
Figure 1: Example designs for two color microarrays.
The plan of this paper is as follows. Section 2 explains the linear modelling approach
to the analysis of designed experiments and specifies the response model and distribu-
tional assumptions. Section 3 sets out the prior assumptions and defines the posterior
variances and moderated t-statistics. Section 4 derives marginal distributions under the
hierarchical mode l for the observed statistics. Section 5 derives the posterior odds of
differential expression and relates it to the t-statistic. The inferential approach based
on moderated t and F statistics is elaborated in Section 6. Section 7 derives estimators
for the hyperparameters. Section 8 compares the estimators with earlier statistics in a
simulation study. Section 9 illustrates the methodology on two publicly available data
sets. Finally, Section 10 makes some remarks on available software.
2 Linear Models for Microarray Data
This section describes how gene-wise linear models arise from experimental designs and
states the distributional assumptions about the data which will used in the remainder
of the paper. The design of any microarray experiment can be represented in terms of
a linear model for each gene. Figure 1 displays some examples of simple designs with
two-color arrays using arrow notation as in Kerr and Churchill (2001). Each arrow
represents a microarray. The arrow points towards the RNA sample which is labelled
red and the sample at the base of the arrow is labelled green. The symbols A, B and C
represent RNA sources to be compared. In experiment (a) there is only one microarray
which compares RNA sample A and B. For this experiment one can only compute the
log-ratios of expression y
g
= log
2
(R
g
)log
2
(G
g
) where R
g
and G
g
are the red and green
intensities for gene g. Design (b) is a dye-swap experiment leading to a very simple
linear model with responses y
g1
and y
g2
which are log-ratios from the two microarrays
and design matrix
X =
1
1
!
.
3
Smyth: Empirical Bayes Methods for Differential Expression
Published by The Berkeley Electronic Press, 2004

Citations
More filters
Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Journal ArticleDOI

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Journal ArticleDOI

limma powers differential expression analyses for RNA-sequencing and microarray studies

TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
Posted ContentDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Journal ArticleDOI

Differential expression analysis for sequence count data.

Simon Anders, +1 more
- 27 Oct 2010 - 
TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.
References
More filters
Journal ArticleDOI

Significance analysis of microarrays applied to the ionizing radiation response

TL;DR: A method that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements is described, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
Book ChapterDOI

limma: Linear Models for Microarray Data

TL;DR: This chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments with technical as well as biological replication.
Journal ArticleDOI

Summaries of Affymetrix GeneChip probe level data

TL;DR: It is found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models.
Journal ArticleDOI

Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

TL;DR: A statistical model is proposed for the probe-level data, and model-based estimates for gene expression indexes are developed, which help to identify and handle cross-hybridizing probes and contaminating array regions.
Journal ArticleDOI

Variance stabilization applied to microarray data calibration and to the quantification of differential expression.

TL;DR: A statistical model for microarray gene expression data that comprises data calibration, the quantifying of differential expression, and the quantification of measurement error is introduced, and a difference statistic Deltah whose variance is approximately constant along the whole intensity range is derived.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What have the authors contributed in "Linear models and empirical bayes methods for assessing differential expression in microarray experiments" ?

The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed ( 2002 ) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The performance of the methods is demonstrated in a simulation study. 

To avoid overflow or underflow in floating point arithmetic, the authors can set y = 1/ √ x when x > 107 and y = 1/x when x < 10−6 instead of performing the iteration. 

The third step is to reformulate the posterior odds statistic in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. 

The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish. 

The moderated t has the advantage over the B-statistic that Bgj depends on hyperparameters v0j and pj for all j as well as d0 and s 2 0 whereas t̃gj depends only on d0 and s20. 

The unscaled variance for the contrasts of interest is estimated to be v0 = 3.4 meaning that the typical fold change for differentially expressed genes is estimated to be about 1.3. 

The moderated t inferential approach extends to accommodate tests involving two or more contrasts through the use of moderated F -statistics. 

0. Restricting to those values of r for which (r− 0.5)/(2G) < p ensures also that ptarget < 1 so that the estimator of v0 is defined. 

The cumulative distribution function of t̃g isF (t̃g; vg, v0, d0 + dg) = pF t̃g {vg vg + v0}1/2 ; d0 + dg + (1− p)F (t̃g; d0 + dg)where F (·; k) is the cumulative distribution function of the t-distribution on k degrees of freedom. 

The estimated unscaled variance for the contrast is v0 = 22.7, meaning that the standard deviation of the log-ratio for a typical gene is (0.0509)1/2(22.7)1/2 = 1.07, i.e., genes which are differentially expressed typically change by about two-fold. 

This is because the posterior variance s̃2g offsets the small sample variances heavily in a relative sense while larger sample variances are moderated to a lesser relative degree. 

If so, the covariance matrix is assumedSmyth: Empirical Bayes Methods for Differential ExpressionPublished by The Berkeley Electronic Press, 2004to be evaluated at α̂g and is the dependence is assumed to be such that it can be ignored to a first order approximation. 

The purpose of this paper is to develop the hierarchical model of Lönnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. 

The Limma software has been tested on a wide range of microarray data sets from many different facilities and has been used routinely at the author’s institution since the middle of 2002. 

The regression coefficient here estimates the contrast B − A on the log-scale, just as for design (a), but with two arrays there is one degree of freedom for error. 

The single linear model approach assumes all equal variances across genes whereas the current paper is designed to accommodate different variances. 

In the software packageLimma which implements the methods in this paper, the user is allowed to place limits on the possible values for v1/2 0j s0.