Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

doi:10.1186/S13059-014-0550-8

Home
/
Papers
/
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

read less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A survey of best practices for RNA-seq data analysis

[...]

Ana Conesa¹, Pedro Madrigal², Pedro Madrigal³, Sonia Tarazona⁴, David Gomez-Cabrero, Alejandra Cervera⁵, Andrew McPherson⁶, Michał Wojciech Szcześniak⁷, Daniel J. Gaffney³, Laura L. Elo⁸, Xuegong Zhang⁹, Ali Mortazavi¹⁰ - Show less +8 more•Institutions (10)

University of Florida¹, University of Cambridge², Wellcome Trust Sanger Institute³, Polytechnic University of Valencia⁴, University of Helsinki⁵, Simon Fraser University⁶, Adam Mickiewicz University in Poznań⁷, Åbo Akademi University⁸, Tsinghua University⁹, University of California, Irvine¹⁰

26 Jan 2016-Genome Biology

TL;DR: All of the major steps in RNA-seq data analysis are reviewed, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping.

...read moreread less

Abstract: RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping. We highlight the challenges associated with each step. We discuss the analysis of small RNAs and the integration of RNA-seq with other functional genomics techniques. Finally, we discuss the outlook for novel technologies that are changing the state of the art in transcriptomics.

...read moreread less

1,963 citations

Cites background or methods from "Moderated estimation of fold change..."

...[58], and maSigPro [213] can perform multiple comparisons,...
[...]
...DESeq2, like edgeR, uses the negative binomial as the reference distribution and provides its own normalization approach [48, 58]....
[...]

Journal Article•DOI•

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

[...]

Christoph Hafemeister, Rahul Satija¹•Institutions (1)

New York University¹

23 Dec 2019-Genome Biology

TL;DR: It is proposed that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.

...read moreread less

Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

...read moreread less

1,898 citations

Journal Article•DOI•

Visualization and analysis of gene expression in tissue sections by spatial transcriptomics

[...]

Patrik L. Ståhl¹, Patrik L. Ståhl², Fredrik Salmén¹, Sanja Vickovic¹, Anna Lundmark¹, Anna Lundmark², José Fernández Navarro², José Fernández Navarro¹, Jens P. Magnusson², Stefania Giacomello¹, Michaela Asp¹, Jakub Orzechowski Westholm³, Mikael Huss³, Annelie Mollbrink¹, Sten Linnarsson², Simone Codeluppi², Åke Borg⁴, Fredrik Pontén⁵, Paul I. Costea¹, Pelin Sahlén¹, Jan Mulder³, Olaf Bergmann², Joakim Lundeberg¹, Jonas Frisén² - Show less +20 more•Institutions (5)

Royal Institute of Technology¹, Karolinska Institutet², Science for Life Laboratory³, Lund University⁴, Uppsala University⁵

01 Jul 2016-Science

TL;DR: By positioning histological sections on arrayed reverse transcription primers with unique positional barcodes, this work demonstrates high-quality RNA-sequencing data with maintained two-dimensional positional information from the mouse brain and human breast cancer.

...read moreread less

Abstract: Analysis of the pattern of proteins or messengerRNAs (mRNAs) in histological tissue sections is a cornerstone in biomedical research and diagnostics. This typically involves the visualization of a few proteins or expressed genes at a time. We have devised a strategy, which we call “spatial transcriptomics,” that allows visualization and quantitative analysis of the transcriptome with spatial resolution in individual tissue sections. By positioning histological sections on arrayed reverse transcription primers with unique positional barcodes, we demonstrate high-quality RNA-sequencing data with maintained two-dimensional positional information from the mouse brain and human breast cancer. Spatial transcriptomics provides quantitative gene expression data and visualization of the distribution of mRNAs within tissue sections and enables novel types of bioinformatics analyses, valuable in research and diagnostics.

...read moreread less

1,741 citations

Journal Article•DOI•

An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues

[...]

M. Ryan Corces¹, Alexandro E. Trevino¹, Emily G. Hamilton¹, Peyton Greenside¹, Nicholas A Sinnott-Armstrong¹, Sam Vesuna¹, Ansuman T. Satpathy¹, Adam J. Rubin¹, Kathleen S. Montine¹, Beijing Wu¹, Arwa Kathiria¹, Seung Woo Cho¹, Maxwell R. Mumbach¹, Ava C. Carter¹, Maya Kasowski¹, Lisa A. Orloff¹, Viviana I. Risca¹, Anshul Kundaje¹, Paul A. Khavari¹, Thomas J. Montine¹, William J. Greenleaf¹, Howard Y. Chang¹ - Show less +18 more•Institutions (1)

Stanford University¹

28 Aug 2017-Nature Methods

TL;DR: The Omni-ATAC protocol generates chromatin accessibility profiles from archival frozen tissue samples and 50-μm sections, revealing the activities of disease-associated DNA elements in distinct human brain structures.

...read moreread less

Abstract: We present Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility profiling that works across multiple applications with substantial improvement of signal-to-background ratio and information content. The Omni-ATAC protocol generates chromatin accessibility profiles from archival frozen tissue samples and 50-μm sections, revealing the activities of disease-associated DNA elements in distinct human brain structures. The Omni-ATAC protocol enables the interrogation of personal regulomes in tissue context and translational studies.

...read moreread less

1,452 citations

Journal Article•DOI•

The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads.

[...]

Yang Liao¹, Yang Liao², Gordon K. Smyth¹, Gordon K. Smyth², Wei Shi¹, Wei Shi² - Show less +2 more•Institutions (2)

Walter and Eliza Hall Institute of Medical Research¹, University of Melbourne²

07 May 2019-Nucleic Acids Research

TL;DR: Rsubread is presented, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads that integrates read mapping and quantification in a single package and has no software dependencies other than R itself.

...read moreread less

Abstract: We present Rsubread, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads. Rsubread is based on the successful Subread suite with the added ease-of-use of the R programming environment, creating a matrix of read counts directly as an R object ready for downstream analysis. It integrates read mapping and quantification in a single package and has no software dependencies other than R itself. We demonstrate Rsubread's ability to detect exon-exon junctions de novo and to quantify expression at the level of either genes, exons or exon junctions. The resulting read counts can be input directly into a wide range of downstream statistical analyses using other Bioconductor packages. Using SEQC data and simulations, we compare Rsubread to TopHat2, STAR and HTSeq as well as to counting functions in the Bioconductor infrastructure packages. We consider the performance of these tools on the combined quantification task starting from raw sequence reads through to summary counts, and in particular evaluate the performance of different combinations of alignment and counting algorithms. We show that Rsubread is faster and uses less memory than competitor tools and produces read count summaries that more accurately correlate with true values.

...read moreread less

1,420 citations

Cites methods from "Moderated estimation of fold change..."

...Rsubread can work with Bio- conductor packages limma, edgeR and DESeq2 to complete an entire RNA-seq analysis in R from read mapping through to the discovery of genes that exhibit significant expression changes (21,22)....
[...]
...featureCounts produces a matrix of genewise counts suitable for input to gene expression analysis packages such as limma (13), edgeR (15) or DESeq2 (16)....
[...]
...Bioconductor contains many highly cited packages for the analysis of RNA-seq read counts, including limma (13,14), edgeR (15) and DESeq2 (16) for differential expression analyses and DEXSeq (4) for analysis of differential splicing....
[...]
...SAF is a Simplified Annotation Format with columns GeneID, Chr, Start, End and Strand. featureCounts produces a matrix of genewise counts suitable for input to gene expression analysis packages such as limma (13), edgeR (15) or DESeq2 (16)....
[...]

1
2
3
4
5
6
…
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Controlling the false discovery rate: a practical and powerful approach to multiple testing

[...]

Yoav Benjamini, Yosef Hochberg

01 Jan 1995-Journal of the royal statistical society series b-methodological

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

...read moreread less

83,420 citations

"Moderated estimation of fold change..." refers methods in this paper

...TheWald test P values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....
[...]
...The Wald test p-values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....
[...]
...For all algorithms returning P values, the P values from genes with non-zero sum of read counts across samples were adjusted using the Benjamini–Hochberg procedure [21]....
[...]
...TheWald test P values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....
[...]
...The Wald test p-values from the subset of genes which pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....
[...]

Journal Article•DOI•

Handbook of Mathematical Functions

[...]

Milton Abramowitz, Irene A. Stegun, Donald A. McQuarrie

01 Feb 1966-American Journal of Physics

46,339 citations

Journal Article•DOI•

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

[...]

Mark D. Robinson¹, Davis J. McCarthy¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Jan 2010-Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read moreread less

29,413 citations

"Moderated estimation of fold change..." refers methods in this paper

...The Negative Binomial based approaches compared were DESeq (old) [4], edgeR [32], edgeR with the robust option [33], DSS [6] and EBSeq [34]....
[...]

Book•

Generalized Linear Models

[...]

Peter McCullagh¹, John A. Nelder•Institutions (1)

Imperial College London¹

01 Jan 1983

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).

...read moreread less

Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

...read moreread less

23,215 citations

Book•

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

[...]

Trevor Hastie¹, Robert Tibshirani, Jerome H. Friedman•Institutions (1)

University of New South Wales¹

28 Jul 2013

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.

...read moreread less

Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

...read moreread less

19,261 citations