Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

doi:10.1186/S13059-014-0550-8

Home
/
Papers
/
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

read less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

HTSeq—a Python framework to work with high-throughput sequencing data

[...]

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

15 Jan 2015-Bioinformatics

TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

...read moreread less

Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read moreread less

15,744 citations

Journal Article•DOI•

Comprehensive Integration of Single-Cell Data.

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Yuhan Hao¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +6 more•Institutions (2)

New York University¹, Harvard University²

13 Jun 2019-Cell

TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

...read moreread less

7,892 citations

Cites methods from "Moderated estimation of fold change..."

...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.01 [Storey and Tibshirani, 2003]....
[...]
...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1....
[...]

Journal Article•DOI•

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

[...]

Mihaela Pertea¹, Daehwan Kim¹, Geo Pertea¹, Jeffrey T. Leek¹, Steven L. Salzberg¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Sep 2016-Nature Protocols

TL;DR: This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts.

...read moreread less

Abstract: High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

...read moreread less

3,755 citations

Journal Article•DOI•

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19.

[...]

Daniel Blanco-Melo¹, Benjamin E. Nilsson-Payant¹, Wen-Chun Liu¹, Skyler Uhl¹, Daisy A. Hoagland¹, Rasmus Møller¹, Tristan X. Jordan¹, Kohei Oishi¹, Maryline Panis¹, David H. Sachs¹, Taia T. Wang², Robert E. Schwartz³, Jean K. Lim¹, Randy A. Albrecht¹, Benjamin R. tenOever¹ - Show less +11 more•Institutions (3)

Icahn School of Medicine at Mount Sinai¹, Stanford University², Cornell University³

28 May 2020-Cell

TL;DR: It is proposed that reduced innate antiviral defenses coupled with exuberant inflammatory cytokine production are the defining and driving features of COVID-19.

...read moreread less

3,286 citations

Cites background or methods from "Moderated estimation of fold change..."

...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009 https://cran.r-project.org/web/ packages/PMA/index.html ggplot2 Tidyverse https://ggplot2.tidyverse.org/ Bowtie2 Langmead and Salzberg, 2012 http://bowtie-bio.sourceforge.net/ bowtie2/index.shtml ImmGen Yoshida et al., 2019 http://www.immgen.org/ ll...
[...]
...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009…...
[...]
...Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2....
[...]
...Raw reads were aligned to the human genome (hg19) using the RNA-Seq Aligment App on Basespace (Illumina, CA), following differential expression analysis using DESeq2 (Love et al., 2014)....
[...]

Journal Article•DOI•

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

[...]

Enis Afgan¹, Dannon Baker¹, Bérénice Batut², Marius van den Beek³, Dave Bouvier⁴, Martin Čech⁴, John Chilton⁴, Dave Clements¹, Nate Coraor⁴, Björn Grüning², Aysam Guerler¹, Jennifer Hillman-Jackson⁴, Saskia Hiltemann⁵, Vahid Jalili⁶, Helena Rasche², Nicola Soranzo⁷, Jeremy Goecks⁶, James Taylor¹, Anton Nekrutenko⁴, Daniel Blankenberg⁸ - Show less +16 more•Institutions (8)

Johns Hopkins University¹, University of Freiburg², PSL Research University³, Pennsylvania State University⁴, Erasmus University Rotterdam⁵, Oregon Health & Science University⁶, Norwich Research Park⁷, Cleveland Clinic Lerner Research Institute⁸

02 Jul 2018-Nucleic Acids Research

TL;DR: Improvements to Galaxy's core framework, user interface, tools, and training materials enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed.

...read moreread less

Abstract: Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

...read moreread less

2,601 citations

Cites background from "Moderated estimation of fold change..."

...Examples of new tools include: GEMINI for exploring genetic variation (12); mothur for analyzing rRNA gene sequences (13); QIIME for quantitative microbiome analysis from raw DNA sequencing data (14); deepTools for explorative analysis of deeply sequence data (15,16); HiCexplorer (17) for analysis and visualization of Hi-C data; ChemicalToolBox for comprehensive access to cheminformatics libraries and drug discovery tools (18); minimap2 (https://arxiv.org/abs/ 1708.01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]
...01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Generalized Linear Models

[...]

John H. Schuenemeyer, Peter McCullagh, John A. Nelder

25 Aug 2003

10,034 citations

Journal Article•DOI•

voom: precision weights unlock linear model analysis tools for RNA-seq read counts

[...]

Charity W. Law¹, Charity W. Law², Yunshun Chen¹, Yunshun Chen², Wei Shi¹, Wei Shi², Gordon K. Smyth², Gordon K. Smyth¹ - Show less +4 more•Institutions (2)

Walter and Eliza Hall Institute of Medical Research¹, University of Melbourne²

03 Feb 2014-Genome Biology

TL;DR: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments, and the voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.

...read moreread less

Abstract: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.

...read moreread less

4,475 citations

"Moderated estimation of fold change..." refers methods in this paper

...Other methods compared were the voom normalization method followed by linear modeling using the limma package [35] and the SAMseq permutation method of the samr package [23]....
[...]

Journal Article•DOI•

Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation

[...]

Davis J. McCarthy¹, Yunshun Chen¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 May 2012-Nucleic Acids Research

TL;DR: A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies, and parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical.

...read moreread less

Abstract: A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies. It provides the ability to analyse complex experiments involving multiple treatment conditions and blocking variables while still taking full account of biological variation. Biological variation between RNA samples is estimated separately from the technical variation associated with sequencing technologies. Novel empirical Bayes methods allow each gene to have its own specific variability, even when there are relatively few biological replicates from which to estimate such variability. The pipeline is implemented in the edgeR package of the Bioconductor project. A case study analysis of carcinoma data demonstrates the ability of generalized linear model methods (GLMs) to detect differential expression in a paired design, and even to detect tumour-specific expression changes. The case study demonstrates the need to allow for gene-specific variability, rather than assuming a common dispersion across genes or a fixed relationship between abundance and variability. Genewise dispersions de-prioritize genes with inconsistent results and allow the main analysis to focus on changes that are consistent between biological replicates. Parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical. Simulations demonstrate the ability of adjusted profile likelihood estimators to return accurate estimators of biological variability in complex situations. When variation is gene-specific, empirical Bayes estimators provide an advantageous compromise between the extremes of assuming common dispersion or separate genewise dispersion. The methods developed here can also be applied to count data arising from DNA-Seq applications, including ChIP-Seq for epigenetic marks and DNA methylation analyses.

...read moreread less

4,127 citations

Journal Article•DOI•

Differential analysis of gene regulation at transcript resolution with RNA-seq

[...]

Cole Trapnell¹, David G. Hendrickson², David G. Hendrickson¹, Martin Sauvageau², Martin Sauvageau¹, Loyal A. Goff², Loyal A. Goff¹, John L. Rinn², John L. Rinn¹, Lior Pachter³ - Show less +6 more•Institutions (3)

Harvard University¹, Massachusetts Institute of Technology², University of California, Berkeley³

01 Jan 2013-Nature Biotechnology

TL;DR: Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries, robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes.

...read moreread less

Abstract: Differential analysis of gene and transcript expression using high-throughput RNA sequencing (RNA-seq) is complicated by several sources of measurement variability and poses numerous statistical challenges. We present Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries. Cuffdiff 2 robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes. We demonstrate the accuracy of our approach through differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1, which we show is required for lung fibroblast and HeLa cell cycle progression. Loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle. Cuffdiff 2 performs robust differential analysis in RNA-seq experiments at transcript resolution, revealing a layer of regulation not readily observable with other high-throughput technologies.

...read moreread less

3,319 citations

"Moderated estimation of fold change..." refers background or methods in this paper

...For the benchmarks using real data, the Cuffdiff 2 [27] method of the Cufflinks suite was included....
[...]
...In contrast, several algorithms [27, 28] work with probabilistic assignments of reads to transcripts, where multiple, overlapping transcripts can originate from each gene....
[...]
...It has been noted that the total read count approach can result in false detection of differential expression when in fact only transcript isoform lengths change, and even in a wrong sign of LFCs in extreme cases [27]....
[...]

Journal Article•DOI•

Software for computing and annotating genomic ranges.

[...]

Michael F. Lawrence¹, Wolfgang Huber², Hervé Pagès³, Patrick Aboyoun³, Marc R. J. Carlson³, Robert Gentleman¹, Martin Morgan³, Vincent J. Carey⁴ - Show less +4 more•Institutions (4)

Genentech¹, European Bioinformatics Institute², Fred Hutchinson Cancer Research Center³, Brigham and Women's Hospital⁴

08 Aug 2013-PLOS Computational Biology

TL;DR: This work describes Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions, including those for sequence analysis, differential expression analysis and visualization.

...read moreread less

Abstract: We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

...read moreread less

3,005 citations

"Moderated estimation of fold change..." refers methods in this paper

...The count matrix and metadata including the gene model and sample information are stored in an S4 class derived from the SummarizedExperiment class of the GenomicRanges package [59]....
[...]