Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

doi:10.1186/S13059-014-0550-8

Home
/
Papers
/
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

read less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Genomic analyses identify molecular subtypes of pancreatic cancer

[...]

Peter Bailey¹, David K. Chang², Katia Nones¹, Katia Nones³, Amber L. Johns⁴, Ann-Marie Patch³, Ann-Marie Patch¹, Marie-Claude Gingras⁵, David Miller⁴, David Miller¹, Angelika N. Christ¹, Timothy J. C. Bruxner¹, Michael C.J. Quinn¹, Michael C.J. Quinn³, Craig Nourse², Craig Nourse¹, Murtaugh Lc⁶, Ivon Harliwong¹, Senel Idrisoglu¹, Suzanne Manning¹, Ehsan Nourbakhsh¹, Shivangi Wani³, Shivangi Wani¹, J. Lynn Fink¹, Oliver Holmes³, Oliver Holmes¹, Chin⁴, Matthew J. Anderson¹, Stephen H. Kazakoff³, Stephen H. Kazakoff¹, Conrad Leonard³, Conrad Leonard¹, Felicity Newell¹, Nicola Waddell¹, Scott Wood¹, Scott Wood³, Qinying Xu³, Qinying Xu¹, Peter J. Wilson¹, Nicole Cloonan¹, Nicole Cloonan³, Karin S. Kassahn⁷, Karin S. Kassahn¹, Karin S. Kassahn⁸, Darrin Taylor¹, Kelly Quek¹, Alan J. Robertson¹, Lorena Pantano⁹, Laura Mincarelli², Luis Navarro Sanchez², Lisa Evers², Jianmin Wu⁴, Mark Pinese⁴, Mark J. Cowley⁴, Jones², Jones⁴, Emily K. Colvin⁴, Adnan Nagrial⁴, Emily S. Humphrey⁴, Lorraine A. Chantrill⁴, Lorraine A. Chantrill¹⁰, Amanda Mawson⁴, Jeremy L. Humphris⁴, Angela Chou¹¹, Angela Chou⁴, Marina Pajic⁴, Marina Pajic¹², Christopher J. Scarlett¹³, Christopher J. Scarlett⁴, Andreia V. Pinho⁴, Marc Giry-Laterriere⁴, Ilse Rooman⁴, Jaswinder S. Samra¹⁴, James G. Kench¹⁵, James G. Kench¹⁶, James G. Kench⁴, Jessica A. Lovell⁴, Neil D. Merrett¹², Christopher W. Toon⁴, Krishna Epari¹⁷, Nam Q. Nguyen¹⁸, Andrew Barbour¹⁹, Nikolajs Zeps²⁰, Kim Moran-Jones², Nigel B. Jamieson², Janet Graham², Janet Graham²¹, Fraser Duthie²², Karin A. Oien⁴, Karin A. Oien²², Hair J²², Robert Grützmann²³, Anirban Maitra²⁴, Christine A. Iacobuzio-Donahue²⁵, Christopher L. Wolfgang²⁶, Richard A. Morgan²⁶, Rita T. Lawlor, Corbo, Claudio Bassi, Borislav Rusev, Paola Capelli²⁷, Roberto Salvia, Giampaolo Tortora, Debabrata Mukhopadhyay²⁸, Gloria M. Petersen²⁸, Munzy Dm⁵, William E. Fisher⁵, Saadia A. Karim, Eshleman²⁶, Ralph H. Hruban²⁶, Christian Pilarsky²³, Jennifer P. Morton, Owen J. Sansom², Aldo Scarpa²⁷, Elizabeth A. Musgrove², Ulla-Maja Bailey², Oliver Hofmann⁹, Oliver Hofmann², R. L. Sutherland⁴, David A. Wheeler⁵, Anthony J. Gill⁴, Anthony J. Gill¹⁶, Richard A. Gibbs⁵, John V. Pearson¹, John V. Pearson³, Andrew V. Biankin, Sean M. Grimmond², Sean M. Grimmond¹, Sean M. Grimmond²⁹ - Show less +125 more•Institutions (29)

University of Queensland¹, University of Glasgow², QIMR Berghofer Medical Research Institute³, Garvan Institute of Medical Research⁴, Baylor College of Medicine⁵, University of Utah⁶, University of Adelaide⁷, South Australia Pathology⁸, Harvard University⁹, Campbelltown Hospital¹⁰, St. Vincent's Health System¹¹, University of New South Wales¹², University of Newcastle¹³, Royal North Shore Hospital¹⁴, Royal Prince Alfred Hospital¹⁵, University of Sydney¹⁶, Fiona Stanley Hospital¹⁷, Royal Adelaide Hospital¹⁸, Princess Alexandra Hospital¹⁹, University of Western Australia²⁰, Beatson West of Scotland Cancer Centre²¹, Southern General Hospital²², Dresden University of Technology²³, University of Texas MD Anderson Cancer Center²⁴, Memorial Sloan Kettering Cancer Center²⁵, Johns Hopkins University School of Medicine²⁶, University of Verona²⁷, Mayo Clinic²⁸, University of Melbourne²⁹

03 Mar 2016-Nature

TL;DR: Detailed genomic analysis of 456 pancreatic ductal adenocarcinomas identified 32 recurrently mutated genes that aggregate into 10 pathways: KRAS, TGF-β, WNT, NOTCH, ROBO/SLIT signalling, G1/S transition, SWI-SNF, chromatin modification, DNA repair and RNA processing.

...read moreread less

Abstract: Integrated genomic analysis of 456 pancreatic ductal adenocarcinomas identified 32 recurrently mutated genes that aggregate into 10 pathways: KRAS, TGF-β, WNT, NOTCH, ROBO/SLIT signalling, G1/S transition, SWI-SNF, chromatin modification, DNA repair and RNA processing. Expression analysis defined 4 subtypes: (1) squamous; (2) pancreatic progenitor; (3) immunogenic; and (4) aberrantly differentiated endocrine exocrine (ADEX) that correlate with histopathological characteristics. Squamous tumours are enriched for TP53 and KDM6A mutations, upregulation of the TP63∆N transcriptional network, hypermethylation of pancreatic endodermal cell-fate determining genes and have a poor prognosis. Pancreatic progenitor tumours preferentially express genes involved in early pancreatic development (FOXA2/3, PDX1 and MNX1). ADEX tumours displayed upregulation of genes that regulate networks involved in KRAS activation, exocrine (NR5A2 and RBPJL), and endocrine differentiation (NEUROD1 and NKX2-2). Immunogenic tumours contained upregulated immune networks including pathways involved in acquired immune suppression. These data infer differences in the molecular evolution of pancreatic cancer subtypes and identify opportunities for therapeutic development.

...read moreread less

2,443 citations

Journal Article•DOI•

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

[...]

Charlotte Soneson¹, Charlotte Soneson², Michael I. Love³, Mark D. Robinson², Mark D. Robinson¹ - Show less +1 more•Institutions (3)

Swiss Institute of Bioinformatics¹, University of Zurich², Harvard University³

30 Dec 2015-F1000Research

TL;DR: It is illustrated that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets.

...read moreread less

Abstract: High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

...read moreread less

2,420 citations

Journal Article•DOI•

A pathology atlas of the human cancer transcriptome

[...]

Mathias Uhlén¹, Mathias Uhlén², Cheng Zhang¹, Sunjae Lee¹, Evelina Sjöstedt¹, Evelina Sjöstedt³, Linn Fagerberg¹, Gholamreza Bidkhori¹, Rui Benfeitas¹, Muhammad Arif¹, Zhengtao Liu¹, Fredrik Edfors¹, Kemal Sanli¹, Kalle von Feilitzen¹, Per Oksvold¹, Emma Lundberg¹, Sophia Hober¹, Peter Nilsson¹, Johanna Sofia Margareta Mattsson³, Jochen M. Schwenk¹, Hans Brunnström⁴, Bengt Glimelius³, Tobias Sjöblom³, Per-Henrik Edqvist³, Dijana Djureinovic³, Patrick Micke³, Cecilia Lindskog³, Adil Mardinoglu¹, Adil Mardinoglu⁵, Fredrik Pontén³ - Show less +26 more•Institutions (5)

Royal Institute of Technology¹, Technical University of Denmark², Uppsala University³, Lund University⁴, Chalmers University of Technology⁵

18 Aug 2017-Science

TL;DR: A Human Pathology Atlas has been created as part of the Human Protein Atlas program to explore the prognostic role of each protein-coding gene in 17 different cancers, and reveals that gene expression of individual tumors within a particular cancer varied considerably and could exceed the variation observed between distinct cancer types.

...read moreread less

Abstract: Cancer is one of the leading causes of death, and there is great interest in understanding the underlying molecular mechanisms involved in the pathogenesis and progression of individual tumors. We used systems-level approaches to analyze the genome-wide transcriptome of the protein-coding genes of 17 major cancer types with respect to clinical outcome. A general pattern emerged: Shorter patient survival was associated with up-regulation of genes involved in cell growth and with down-regulation of genes involved in cellular differentiation. Using genome-scale metabolic models, we show that cancer patients have widespread metabolic heterogeneity, highlighting the need for precise and personalized medicine for cancer treatment. All data are presented in an interactive open-access database (www.proteinatlas.org/pathology) to allow genome-wide exploration of the impact of individual proteins on clinical outcomes.

...read moreread less

2,276 citations

Journal Article•DOI•

Shifting the limits in wheat research and breeding using a fully annotated reference genome

[...]

Rudi Appels¹, Rudi Appels², Kellye Eversole, Nils Stein³ +204 more•Institutions (45)

17 Aug 2018-Science

TL;DR: This annotated reference sequence of wheat is a resource that can now drive disruptive innovation in wheat improvement, as this community resource establishes the foundation for accelerating wheat research and application through improved understanding of wheat biology and genomics-assisted breeding.

...read moreread less

Abstract: An annotated reference sequence representing the hexaploid bread wheat genome in 21 pseudomolecules has been analyzed to identify the distribution and genomic context of coding and noncoding elements across the A, B, and D subgenomes. With an estimated coverage of 94% of the genome and containing 107,891 high-confidence gene models, this assembly enabled the discovery of tissue- and developmental stage-related coexpression networks by providing a transcriptome atlas representing major stages of wheat development. Dynamics of complex gene families involved in environmental adaptation and end-use quality were revealed at subgenome resolution and contextualized to known agronomic single-gene or quantitative trait loci. This community resource establishes the foundation for accelerating wheat research and application through improved understanding of wheat biology and genomics-assisted breeding.

...read moreread less

2,118 citations

Cites background from "Moderated estimation of fold change..."

...Supplementary Materials: Materials and Methods Figures S1-S59 Tables S1-S43 External Databases S1-S6 15 References (54-184)...
[...]
...S1 to S59 Tables S1 to S43 References (56–186) Databases S1 to S5 13 December 2017; accepted 11 July 2018 10.1126/science.aar7191 International Wheat Genome Sequencing Consortium (IWGSC), Science 361, eaar7191 (2018) 17 August 2018 13 of 13...
[...]

Posted Content•DOI•

Comprehensive integration of single cell data

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +5 more•Institutions (2)

New York University¹, Harvard University²

02 Nov 2018-bioRxiv

TL;DR: This work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets, and demonstrates how anchoring can harmonize in-situ gene expression and scRNA-seq datasets.

...read moreread less

Abstract: Single cell transcriptomics (scRNA-seq) has transformed our ability to discover and annotate cell types and states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, including high-dimensional immunophenotypes, chromatin accessibility, and spatial positioning, a key analytical challenge is to integrate these datasets into a harmonized atlas that can be used to better understand cellular identity and function. Here, we develop a computational strategy to "anchor" diverse datasets together, enabling us to integrate and compare single cell measurements not only across scRNA-seq technologies, but different modalities as well. After demonstrating substantial improvement over existing methods for data integration, we anchor scRNA-seq experiments with scATAC-seq datasets to explore chromatin differences in closely related interneuron subsets, and project single cell protein measurements onto a human bone marrow atlas to annotate and characterize lymphocyte populations. Lastly, we demonstrate how anchoring can harmonize in-situ gene expression and scRNA-seq datasets, allowing for the transcriptome-wide imputation of spatial gene expression patterns, and the identification of spatial relationships between mapped cell types in the visual cortex. Our work presents a strategy for comprehensive integration of single cell data, including the assembly of harmonized references, and the transfer of information across datasets. Availability: Installation instructions, documentation, and tutorials are available at: https://www.satijalab.org/seurat

...read moreread less

2,037 citations

Cites methods from "Moderated estimation of fold change..."

...To identify differentially-expressed genes between the CD69+ and CD69-sorted populations, we used DESeq288 and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.0189....
[...]
...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1....
[...]

1
2
3
4
5
…
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Controlling the false discovery rate: a practical and powerful approach to multiple testing

[...]

Yoav Benjamini, Yosef Hochberg

01 Jan 1995-Journal of the royal statistical society series b-methodological

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

...read moreread less

83,420 citations

"Moderated estimation of fold change..." refers methods in this paper

...TheWald test P values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....
[...]
...The Wald test p-values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....
[...]
...For all algorithms returning P values, the P values from genes with non-zero sum of read counts across samples were adjusted using the Benjamini–Hochberg procedure [21]....
[...]
...TheWald test P values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....
[...]
...The Wald test p-values from the subset of genes which pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....
[...]

Journal Article•DOI•

Handbook of Mathematical Functions

[...]

Milton Abramowitz, Irene A. Stegun, Donald A. McQuarrie

01 Feb 1966-American Journal of Physics

46,339 citations

Journal Article•DOI•

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

[...]

Mark D. Robinson¹, Davis J. McCarthy¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Jan 2010-Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read moreread less

29,413 citations

"Moderated estimation of fold change..." refers methods in this paper

...The Negative Binomial based approaches compared were DESeq (old) [4], edgeR [32], edgeR with the robust option [33], DSS [6] and EBSeq [34]....
[...]

Book•

Generalized Linear Models

[...]

Peter McCullagh¹, John A. Nelder•Institutions (1)

Imperial College London¹

01 Jan 1983

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).

...read moreread less

Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

...read moreread less

23,215 citations

Book•

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

[...]

Trevor Hastie¹, Robert Tibshirani, Jerome H. Friedman•Institutions (1)

University of New South Wales¹

28 Jul 2013

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.

...read moreread less

Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

...read moreread less

19,261 citations