Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

doi:10.1186/S13059-014-0550-8

Home
/
Papers
/
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

read less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

HTSeq—a Python framework to work with high-throughput sequencing data

[...]

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

15 Jan 2015-Bioinformatics

TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

...read moreread less

Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read moreread less

15,744 citations

Journal Article•DOI•

Comprehensive Integration of Single-Cell Data.

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Yuhan Hao¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +6 more•Institutions (2)

New York University¹, Harvard University²

13 Jun 2019-Cell

TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

...read moreread less

7,892 citations

Cites methods from "Moderated estimation of fold change..."

...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.01 [Storey and Tibshirani, 2003]....
[...]
...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1....
[...]

Journal Article•DOI•

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

[...]

Mihaela Pertea¹, Daehwan Kim¹, Geo Pertea¹, Jeffrey T. Leek¹, Steven L. Salzberg¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Sep 2016-Nature Protocols

TL;DR: This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts.

...read moreread less

Abstract: High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

...read moreread less

3,755 citations

Journal Article•DOI•

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19.

[...]

Daniel Blanco-Melo¹, Benjamin E. Nilsson-Payant¹, Wen-Chun Liu¹, Skyler Uhl¹, Daisy A. Hoagland¹, Rasmus Møller¹, Tristan X. Jordan¹, Kohei Oishi¹, Maryline Panis¹, David H. Sachs¹, Taia T. Wang², Robert E. Schwartz³, Jean K. Lim¹, Randy A. Albrecht¹, Benjamin R. tenOever¹ - Show less +11 more•Institutions (3)

Icahn School of Medicine at Mount Sinai¹, Stanford University², Cornell University³

28 May 2020-Cell

TL;DR: It is proposed that reduced innate antiviral defenses coupled with exuberant inflammatory cytokine production are the defining and driving features of COVID-19.

...read moreread less

3,286 citations

Cites background or methods from "Moderated estimation of fold change..."

...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009 https://cran.r-project.org/web/ packages/PMA/index.html ggplot2 Tidyverse https://ggplot2.tidyverse.org/ Bowtie2 Langmead and Salzberg, 2012 http://bowtie-bio.sourceforge.net/ bowtie2/index.shtml ImmGen Yoshida et al., 2019 http://www.immgen.org/ ll...
[...]
...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009…...
[...]
...Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2....
[...]
...Raw reads were aligned to the human genome (hg19) using the RNA-Seq Aligment App on Basespace (Illumina, CA), following differential expression analysis using DESeq2 (Love et al., 2014)....
[...]

Journal Article•DOI•

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

[...]

Enis Afgan¹, Dannon Baker¹, Bérénice Batut², Marius van den Beek³, Dave Bouvier⁴, Martin Čech⁴, John Chilton⁴, Dave Clements¹, Nate Coraor⁴, Björn Grüning², Aysam Guerler¹, Jennifer Hillman-Jackson⁴, Saskia Hiltemann⁵, Vahid Jalili⁶, Helena Rasche², Nicola Soranzo⁷, Jeremy Goecks⁶, James Taylor¹, Anton Nekrutenko⁴, Daniel Blankenberg⁸ - Show less +16 more•Institutions (8)

Johns Hopkins University¹, University of Freiburg², PSL Research University³, Pennsylvania State University⁴, Erasmus University Rotterdam⁵, Oregon Health & Science University⁶, Norwich Research Park⁷, Cleveland Clinic Lerner Research Institute⁸

02 Jul 2018-Nucleic Acids Research

TL;DR: Improvements to Galaxy's core framework, user interface, tools, and training materials enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed.

...read moreread less

Abstract: Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

...read moreread less

2,601 citations

Cites background from "Moderated estimation of fold change..."

...Examples of new tools include: GEMINI for exploring genetic variation (12); mothur for analyzing rRNA gene sequences (13); QIIME for quantitative microbiome analysis from raw DNA sequencing data (14); deepTools for explorative analysis of deeply sequence data (15,16); HiCexplorer (17) for analysis and visualization of Hi-C data; ChemicalToolBox for comprehensive access to cheminformatics libraries and drug discovery tools (18); minimap2 (https://arxiv.org/abs/ 1708.01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]
...01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Moderated statistical tests for assessing differences in tag abundance

[...]

Mark D. Robinson¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Nov 2007-Bioinformatics

TL;DR: Testing of differential expression for replicated DGE data using the negative binomial distribution to model overdispersion relative to the Poisson, and using conditional weighted likelihood to moderate the level of over Dispersion across genes is developed.

...read moreread less

Abstract: Motivation: Digital gene expression (DGE) technologies measure gene expression by counting sequence tags. They are sensitive technologies for measuring gene expression on a genomic scale, without the need for prior knowledge of the genome sequence. As the cost of sequencing DNA decreases, the number of DGE datasets is expected to grow dramatically. Various tests of differential expression have been proposed for replicated DGE data using binomial, Poisson, negative binomial or pseudo-likelihood (PL) models for the counts, but none of the these are usable when the number of replicates is very small. Results: We develop tests using the negative binomial distribution to model overdispersion relative to the Poisson, and use conditional weighted likelihood to moderate the level of overdispersion across genes. Not only is our strategy applicable even with the smallest number of libraries, but it also proves to be more powerful than previous strategies when more libraries are available. The methodology is equally applicable to other counting technologies, such as proteomic spectral counts. Availability: An R package can be accessed from http://bioinf.wehi.edu.au/resources/ Contact: smyth@wehi.edu.au Supplementary information: http://bioinf.wehi.edu.au/resources/

...read moreread less

856 citations

"Moderated estimation of fold change..." refers methods in this paper

...edgeR [2, 3] moderates the dispersion estimate for each gene toward a common estimate across all genes, or toward a local estimate from genes with similar expression strength, using a weighted conditional likelihood....
[...]

Journal Article•DOI•

On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.

[...]

Michael A. Newton¹, Christina Kendziorski, Craig S. Richmond, Frederick R. Blattner, Kam-Wah Tsui - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2001-Journal of Computational Biology

TL;DR: This work considers the problem of inferring fold changes in gene expression from cDNA microarray data and derives estimates of gene expression changes within a simple hierarchical model that accounts for measurement error and fluctuations in absolute gene expression levels.

...read moreread less

Abstract: We consider the problem of inferring fold changes in gene expression from cDNA microarray data. Standard procedures focus on the ratio of measured fluorescent intensities at each spot on the microarray, but to do so is to ignore the fact that the variation of such ratios is not constant. Estimates of gene expression changes are derived within a simple hierarchical model that accounts for measurement error and fluctuations in absolute gene expression levels. Significant gene expression changes are identified by deriving the posterior odds of change within a similar model. The methods are tested via simulation and are applied to a panel of Escherichia coli microarrays.

...read moreread less

795 citations

Journal Article•DOI•

baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data

[...]

Thomas J. Hardcastle¹, Krystyna A. Kelly¹•Institutions (1)

University of Cambridge¹

10 Aug 2010-BMC Bioinformatics

TL;DR: A framework for defining patterns of differential expression is proposed and a novel algorithm, baySeq, is developed, which uses an empirical Bayes approach to detect these patternsof differential expression within a set of sequencing samples.

...read moreread less

Abstract: High throughput sequencing has become an important technology for studying expression levels in many types of genomic, and particularly transcriptomic, data. One key way of analysing such data is to look for elements of the data which display particular patterns of differential expression in order to take these forward for further analysis and validation. We propose a framework for defining patterns of differential expression and develop a novel algorithm, baySeq, which uses an empirical Bayes approach to detect these patterns of differential expression within a set of sequencing samples. The method assumes a negative binomial distribution for the data and derives an empirically determined prior distribution from the entire dataset. We examine the performance of the method on real and simulated data. Our method performs at least as well, and often better, than existing methods for analyses of pairwise differential expression in both real and simulated data. When we compare methods for the analysis of data from experimental designs involving multiple sample groups, our method again shows substantial gains in performance. We believe that this approach thus represents an important step forward for the analysis of count data from sequencing experiments.

...read moreread less

792 citations

"Moderated estimation of fold change..." refers methods in this paper

...BaySeq [7] and ShrinkBayes [8] estimate priors for a Bayesian model over all genes, and then provide posterior probabilities or false discovery rates for the case of differential expression....
[...]

Journal Article•DOI•

Therapeutic targeting of BET bromodomain proteins in castration-resistant prostate cancer

[...]

Irfan A. Asangani¹, Vijaya L. Dommeti¹, Xiaoju Wang¹, Rohit Malik¹, Marcin Cieslik¹, Rendong Yang², June Escara-Wilke¹, Kari Wilder-Romans¹, Sudheer Dhanireddy¹, Carl G. Engelke¹, Mathew K. Iyer¹, Xiaojun Jing¹, Yi-Mi Wu¹, Xuhong Cao¹, Zhaohui S. Qin², Shaomeng Wang¹, Felix Y. Feng¹, Arul M. Chinnaiyan¹ - Show less +14 more•Institutions (2)

University of Michigan¹, Emory University²

12 Jun 2014-Nature

TL;DR: It is shown that AR-signalling-competent human CRPC cell lines are preferentially sensitive to bromodomain and extraterminal (BET) inhibition, which provides a novel epigenetic approach for the concerted blockade of oncogenic drivers in advanced prostate cancer.

...read moreread less

Abstract: Men who develop metastatic castration-resistant prostate cancer (CRPC) invariably succumb to the disease. Progression to CRPC after androgen ablation therapy is predominantly driven by deregulated androgen receptor (AR) signalling. Despite the success of recently approved therapies targeting AR signalling, such as abiraterone and second-generation anti-androgens including MDV3100 (also known as enzalutamide), durable responses are limited, presumably owing to acquired resistance. Recently, JQ1 and I-BET762 two selective small-molecule inhibitors that target the amino-terminal bromodomains of BRD4, have been shown to exhibit anti-proliferative effects in a range of malignancies. Here we show that AR-signalling-competent human CRPC cell lines are preferentially sensitive to bromodomain and extraterminal (BET) inhibition. BRD4 physically interacts with the N-terminal domain of AR and can be disrupted by JQ1 (refs 11, 13). Like the direct AR antagonist MDV3100, JQ1 disrupted AR recruitment to target gene loci. By contrast with MDV3100, JQ1 functions downstream of AR, and more potently abrogated BRD4 localization to AR target loci and AR-mediated gene transcription, including induction of the TMPRSS2-ERG gene fusion and its oncogenic activity. In vivo, BET bromodomain inhibition was more efficacious than direct AR antagonism in CRPC xenograft mouse models. Taken together, these studies provide a novel epigenetic approach for the concerted blockade of oncogenic drivers in advanced prostate cancer.

...read moreread less

784 citations

"Moderated estimation of fold change..." refers background in this paper

..., [39]; see also the DiffBind package [40, 41]), barcode-based assays (e....
[...]

Journal Article•

Replicated microarray data

[...]

Ingrid Lönnstedt

01 Jan 2001-Statistica Sinica

TL;DR: This paper presents an empirical Bayes method for analysing replicated microarray data and presents the results of a simulation study estimating the ROC curve of B and three other statistics for determining differential expression: the average and two simple modifications of the usual t-statistic.

...read moreread less

Abstract: cDNA microarrays permit us to study the expression of thousands of genes simultaneously. They are now used in many different contexts to compare mRNA levels between two or more samples of cells. Microarray experiments typically give us expression measurements on a large number of genes, say 10,000-20,000, but with few, if any, replicates for each gene. Traditional methods using means and standard deviations to detect differential expression are not completely satisfactory in this context, and so a different approach seems desirable. In this paper we present an empirical Bayes method for analysing replicated microarray data. Data from all the genes in a replicate set of experiments are combined into estimates of parameters of a prior distribution. These parameter estimates are then combined at the gene level with means and standard deviations to form a statistic B which can be used to decide whether differential expression has occurred. The statistic B avoids the problems of using averages or t-statistics. The method is illustrated using data from an experiment comparing the expression of genes in the livers of SR-BI transgenic mice with that of the corresponding wild-type mice. In addition we present the results of a simulation study estimating the ROC curve of B and three other statistics for determining differential expression: the average and two simple modifications of the usual t-statistic. B was found to be the most powerful of the four, though the margin was not great. The data were simulated to resemble the SR-BI data.

...read moreread less

737 citations

"Moderated estimation of fold change..." refers background in this paper

...In high-throughput assays, this limitation can be overcome by pooling information across genes; specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment [1]....
[...]