scispace - formally typeset
Search or ask a question
Book ChapterDOI

limma: Linear Models for Microarray Data

01 Jan 2005-pp 397-420
TL;DR: This chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments with technical as well as biological replication.
Abstract: A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.

Content maybe subject to copyright    Report

limma:
Linear Models for Microarray and RNA-Seq Data
User’s Guide
Gordon K. Smyth, Matthew Ritchie, Natalie Thorne,
James Wettenhall, Wei Shi and Yifang Hu
Bioinformatics Division, The Walter and Eliza Hall Institute
of Medical Research, Melbourne, Australia
First edition 2 December 2002
Last revised 14 November 2021
This free open-source software implements academic research
by the authors and co-workers. If you use it, please support
the project by citing the appropriate journal articles listed in
Section 2.1.

Contents
1 Introduction 5
2 Preliminaries 7
2.1 Citing limma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 How to get help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Quick Start 11
3.1 A brief introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sample limma Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Data Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Reading Microarray Data 15
4.1 Scope of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Recommended Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 The Targets Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Reading Two-Color Intensity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Reading Single-Channel Agilent Intensity Data . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Reading Illumina BeadChip Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.7 Image-derived Spot Quality Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.8 Reading Probe Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.9 Printer Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.10 The Spot Types File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Quality Assessment 24
6 Pre-Processing Two-Color Data 26
6.1 Background Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Within-Array Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Between-Array Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 Using Objects from the marray Package . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Filtering unexpressed probes 34
1

8 Linear Models Overview 36
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2 Single-Channel Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.3 Common Reference Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.4 Direct Two-Color Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9 Single-Channel Experimental Designs 41
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2 Two Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Several Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.4 Additive Models and Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.4.1 Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.4.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.5 Interaction Models: 2 × 2 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . 44
9.5.1 Questions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.5.2 Analysing as for a Single Factor . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.5.3 A Nested Interaction Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.5.4 Classic Interaction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.6 Time Course Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.6.1 Replicated time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.6.2 Many time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.7 Multi-level Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10 Two-Color Experiments with a Common Reference 52
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.2 Two Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.3 Several Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11 Direct Two-Color Experimental Designs 55
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.2 Simple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.2.1 Replicate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.2.2 Dye Swaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11.3 A Correlation Approach to Technical Replication . . . . . . . . . . . . . . . . . . . . . 57
12 Separate Channel Analysis of Two-Color Data 59
13 Statistics for Differential Expression 61
13.1 Summary Top-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
13.2 Fitted Model Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
13.3 Multiple Testing Across Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14 Array Quality Weights 65
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.3 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.4 When to Use Array Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2

15 RNA-Seq Data 70
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15.2 Making a count matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15.3 Normalization and filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15.4 Differential expression: limma-trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.5 Differential expression: voom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.6 Voom with sample quality weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.7 Differential splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
16 Two-Color Case Studies 75
16.1 Swirl Zebrafish: A Single-Group Experiment . . . . . . . . . . . . . . . . . . . . . . . 75
16.2 Apoa1 Knockout Mice: A Two-Group Common-Reference Experiment . . . . . . . . . 86
16.3 Weaver Mutant Mice: A Composite 2x2 Factorial Experiment . . . . . . . . . . . . . . 89
16.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
16.3.2 Sample Preparation and Hybridizations . . . . . . . . . . . . . . . . . . . . . . 89
16.3.3 Data input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
16.3.4 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
16.3.5 Quality Assessment and Normalization . . . . . . . . . . . . . . . . . . . . . . . 91
16.3.6 Setting Up the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
16.3.7 Probe Filtering and Array Quality Weights . . . . . . . . . . . . . . . . . . . . 94
16.3.8 Differential expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
16.4 Bob1 Mutant Mice: Arrays With Duplicate Spots . . . . . . . . . . . . . . . . . . . . . 95
17 Single-Channel Case Studies 99
17.1 Lrp Mutant E. Coli Strain with Affymetrix Arrays . . . . . . . . . . . . . . . . . . . . 99
17.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.1.2 Downloading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17.1.3 Background correction and normalization . . . . . . . . . . . . . . . . . . . . . 100
17.1.4 Gene annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
17.1.5 Differential expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
17.2 Effect of Estrogen on Breast Cancer Tumor Cells: A 2x2 Factorial Experiment with
Affymetrix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
17.3 Comparing Mammary Progenitor Cell Populations with Illumina BeadChips . . . . . . 107
17.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
17.3.2 The target RNA samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
17.3.3 The expression profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
17.3.4 How many probes are truly expressed? . . . . . . . . . . . . . . . . . . . . . . . 110
17.3.5 Normalization and filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
17.3.6 Within-patient correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
17.3.7 Differential expression between cell types . . . . . . . . . . . . . . . . . . . . . 111
17.3.8 Signature genes for luminal progenitor cells . . . . . . . . . . . . . . . . . . . . 112
17.4 Time Course Effects of Corn Oil on Rat Thymus with Agilent 4x44K Arrays . . . . . 113
17.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
17.4.2 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
17.4.3 Reading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
17.4.4 Gene annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3

17.4.5 Background correction and normalize . . . . . . . . . . . . . . . . . . . . . . . 115
17.4.6 Gene filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
17.4.7 Differential expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
17.4.8 Gene ontology analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
18 RNA-Seq Case Studies 119
18.1 Profiles of Yoruba HapMap Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
18.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
18.1.2 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
18.1.3 Yoruba Individuals and FASTQ Files . . . . . . . . . . . . . . . . . . . . . . . 119
18.1.4 Mapping reads to the reference genome . . . . . . . . . . . . . . . . . . . . . . 121
18.1.5 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
18.1.6 DGEList object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
18.1.7 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
18.1.8 Scale normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
18.1.9 Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
18.1.10 Gene set testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
18.1.11 Session information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
18.1.12 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
18.2 Differential Splicing after Pasilla Knockdown . . . . . . . . . . . . . . . . . . . . . . . 133
18.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
18.2.2 GEO samples and SRA Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
18.2.3 Mapping reads to the reference genome . . . . . . . . . . . . . . . . . . . . . . 134
18.2.4 Read counts for exons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
18.2.5 Assemble DGEList and sum counts for technical replicates . . . . . . . . . . . 135
18.2.6 Gene annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
18.2.7 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
18.2.8 Scale normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
18.2.9 Linear modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
18.2.10 Alternate splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
18.2.11 Session information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
18.2.12 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4

Citations
More filters
Journal ArticleDOI
TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

22,147 citations


Cites methods from "limma: Linear Models for Microarray..."

  • ...The limma package is a core component of Bioconductor, an R-based open-source software development project in statistical genomics [16, 64]....

    [...]

  • ...Both observation-level [56, 64, 24] and sample-specific weights [54] can be used in an analysis....

    [...]

Journal ArticleDOI
TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

13,356 citations


Cites methods from "limma: Linear Models for Microarray..."

  • ...An empirical Bayes procedure, similar to the one originally developed for the limma package [ 24-26 ], determines how to combine these two sources of information optimally....

    [...]

Journal ArticleDOI
TL;DR: The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
Abstract: The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

11,864 citations


Cites background or methods from "limma: Linear Models for Microarray..."

  • ...The responses are assumed to be suitably normalized to remove dye-bias and other technological artifacts; see for example Huber et al (2002) or Smyth and Speed (2003). In the case of high density oligonucleotide array, the probes are assumed to have been normalized to produce an expression summary, represented here as ygi, for each gene on each array as in Li and Wong (2001) or Irizarry et al (2003)....

    [...]

  • ...In many gene discovery experiments for which microarrays are used the primary aim is to rank the genes in order of evidence against H0 rather than to assign absolute p-values (Smyth et al, 2003)....

    [...]

  • ...Recent reviews of microarray data analysis include the Nature Genetics supplement (2003), Smyth et al (2003), Parmigiani et al (2003) and Speed (2003). This paper considers the problem of identifying genes which are differentially expressed across specified conditions in designed microarray experiments....

    [...]

  • ...The methods described in this paper, including linear models and contrasts as well as moderated t and F statistics and posterior odds, are implemented in the software package Limma for the R computing environment (Smyth et al, 2003)....

    [...]

  • ...Recent reviews of microarray data analysis include the Nature Genetics supplement (2003), Smyth et al (2003), Parmigiani et al (2003) and Speed (2003)....

    [...]

Journal ArticleDOI
12 Aug 2015-eLife
TL;DR: It is shown that recently reported non-canonical sites do not mediate repression despite binding the miRNA, which indicates that the vast majority of functional sites are canonical.
Abstract: Proteins are built by using the information contained in molecules of messenger RNA (mRNA). Cells have several ways of controlling the amounts of different proteins they make. For example, a so-called ‘microRNA’ molecule can bind to an mRNA molecule to cause it to be more rapidly degraded and less efficiently used, thereby reducing the amount of protein built from that mRNA. Indeed, microRNAs are thought to help control the amount of protein made from most human genes, and biologists are working to predict the amount of control imparted by each microRNA on each of its mRNA targets. All RNA molecules are made up of a sequence of bases, each commonly known by a single letter—‘A’, ‘U’, ‘C’ or ‘G’. These bases can each pair up with one specific other base—‘A’ pairs with ‘U’, and ‘C’ pairs with ‘G’. To direct the repression of an mRNA molecule, a region of the microRNA known as a ‘seed’ binds to a complementary sequence in the target mRNA. ‘Canonical sites’ are regions in the mRNA that contain the exact sequence of partner bases for the bases in the microRNA seed. Some canonical sites are more effective at mRNA control than others. ‘Non-canonical sites’ also exist in which the pairing between the microRNA seed and mRNA does not completely match. Previous work has suggested that many non-canonical sites can also control mRNA degradation and usage. Agarwal et al. first used large experimental datasets from many sources to investigate microRNA activity in more detail. As expected, when mRNAs had canonical sites that matched the microRNA, mRNA levels and usage tended to drop. However, no effect was observed when the mRNAs only had recently identified non-canonical sites. This suggests that microRNAs primarily bind to canonical sites to control protein production. Based on these results, Agarwal et al. further developed a statistical model that predicts the effects of microRNAs binding to canonical sites. The updated model considers 14 different features of the microRNA, microRNA site, or mRNA—including the mRNA sequence around the site—to predict which sites within mRNAs are most effectively targeted by microRNAs. Tests showed that Agarwal et al.'s model was as good as experimental approaches at identifying the effective target sites, and was better than existing computational models. The model has been used to power the latest version of a freely available resource called TargetScan, and so could prove a valuable resource for researchers investigating the many important roles of microRNAs in controlling protein production.

5,365 citations


Cites background from "limma: Linear Models for Microarray..."

  • ...6.9 (Smyth, 2004, 2005), computing differential expression information with the provided eBayes function....

    [...]

Journal ArticleDOI
TL;DR: Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.
Abstract: We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

5,034 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"limma: Linear Models for Microarray..." refers methods in this paper

  • ...The most popular form of adjustment is "fdr" which is Benjamini and Hochberg’s method to control the false discovery rate [5]....

    [...]

Journal ArticleDOI
TL;DR: The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
Abstract: The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

11,864 citations


"limma: Linear Models for Microarray..." refers methods in this paper

  • ...[1] "Dye" "mu1" "mu2" "mu3" "wt2" "wt3"...

    [...]

  • ...Empirical Bayes and other shrinkage methods are used to borrow information across genes making the analyses stable even for experiments with small number of arrays [1, 2]....

    [...]

  • ...The moderated t-statistic is t̃jk = β̂jk ujks̃j This can be shown to follow a t-distribution on f0 +fj degrees of freedom if βjk = 0 [1]....

    [...]

  • ...inference about each individual gene [1]....

    [...]

  • ...Limma uses linear models to analyze designed microarray experiments [3, 1]....

    [...]

Journal ArticleDOI
TL;DR: This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments.
Abstract: There are many sources of systematic variation in cDNA microarray experiments which affect the measured gene expression levels (e.g. differences in labeling efficiency between the two fluorescent dyes). The term normalization refers to the process of removing such variation. A constant adjustment is often used to force the distribution of the intensity log ratios to have a median of zero for each slide. However, such global normalization approaches are not adequate in situations where dye biases can depend on spot overall intensity and/or spatial location within the array. This article proposes normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. The selection of appropriate controls for normalization is discussed and a novel set of controls (microarray sample pool, MSP) is introduced to aid in intensity-dependent normalization. Lastly, to allow for comparisons of expression levels across slides, a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides.

3,605 citations


"limma: Linear Models for Microarray..." refers background or methods in this paper

  • ...The idea of up-weighting the titration spots is in the same spirit as the composite normalization method proposed by [40] but is more flexible and generally applicable....

    [...]

  • ...A whole-library-pool means that one makes a pool of a library of probes, and prints spots from the pool at various concentrations [40]....

    [...]

Journal ArticleDOI
01 Jul 2002
TL;DR: A statistical model for microarray gene expression data that comprises data calibration, the quantifying of differential expression, and the quantification of measurement error is introduced, and a difference statistic Deltah whose variance is approximately constant along the whole intensity range is derived.
Abstract: We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quan- tification of measurement error. In particular, we derive a transformation h for intensity measurements, and a difference statistich whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multi- variate analyses. For the transformation h, the parametric form h(x) = arsinh(a + bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, andh with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental plat- forms, including two-colour cDNA arrays and a series of Affymetrix oligonucleotide arrays. Availability: Software is freely available for academic use as an R package at http://www.dkfz.de/abt0840/whuber

2,323 citations


"limma: Linear Models for Microarray..." refers methods in this paper

  • ...Another option is "vsn" normalization, a model-based method of stabilizing the variances which includes background correction [8, 9]....

    [...]

Journal ArticleDOI
01 Dec 2003-Methods
TL;DR: The print-tip loess normalization as mentioned in this paper is a well-tested general purpose normalization method which has given good results on a wide range of arrays and can be refined by using quality weights for individual spots.

2,084 citations

Related Papers (5)
Frequently Asked Questions (10)
Q1. What is the way to estimate the residual variability of a gene?

Including the dye-effect in the model in this way uses up one degree of freedom which might otherwise be used to estimate the residual variability, but it is valuable if many genes show non-negligible dye-effects. 

With Affymetrix or single-channel data, or with two-color with a common reference, you will need as many coefficients as you have distinct RNA sources, no more and no less. 

Oshlack et al [23] show that loess normalization can tolerate up to about 30% asymmetric differential expression while still giving good results. 

Spatial heterogeneity on individual arrays can be highlighted by examining imageplots of the background intensities, for example> imageplot(log2(RG$Gb[,1]),RG$printer)plots the green background for the first array. 

Marray provides some normalization methods which are not in limma including 2-D loess normalization and print-tip-scale normalization. 

If there are at least two arrays with each dye-orientation, then it is possible to estimate and adjust for any probe-specific dye effects. 

In these cases one should either use global "loess" normalization or else use robust spline normalization> MA <- normalizeWithinArrays(RG, method="robustspline")which is an empirical Bayes compromise between print-tip and global loess normalization, with 5- parameter regression splines used in place of the loess curves. 

If you use the read.ilmn, nec or neqc functions to process Illumina BeadChip data, please cite:Shi, W, Oshlack, A, and Smyth, GK (2010). 

If the same channel has been used for the common reference throughout the experiment, then the expression log-ratios may be analysed exactly as if they were log-expression values from a single channel experiment. 

The TIFF images have then been processed using an image analysis program such a ArrayVision, ImaGene, GenePix, QuantArray or SPOT to acquire the red and green foreground and background intensities for each spot.