Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

doi:10.1186/S13059-014-0550-8

Home
/
Papers
/
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

read less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

HTSeq—a Python framework to work with high-throughput sequencing data

[...]

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

15 Jan 2015-Bioinformatics

TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

...read moreread less

Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read moreread less

15,744 citations

Journal Article•DOI•

Comprehensive Integration of Single-Cell Data.

[...]

Tim Stuart, Andrew Butler¹, Paul J. Hoffman, Christoph Hafemeister, Efthymia Papalexi¹, William M. Mauck¹, Yuhan Hao¹, Marlon Stoeckius², Peter Smibert², Rahul Satija¹ - Show less +6 more•Institutions (2)

New York University¹, Harvard University²

13 Jun 2019-Cell

TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

...read moreread less

7,892 citations

Cites methods from "Moderated estimation of fold change..."

...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.01 [Storey and Tibshirani, 2003]....
[...]
...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1....
[...]

Journal Article•DOI•

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

[...]

Mihaela Pertea¹, Daehwan Kim¹, Geo Pertea¹, Jeffrey T. Leek¹, Steven L. Salzberg¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Sep 2016-Nature Protocols

TL;DR: This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts.

...read moreread less

Abstract: High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

...read moreread less

3,755 citations

Journal Article•DOI•

Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19.

[...]

Daniel Blanco-Melo¹, Benjamin E. Nilsson-Payant¹, Wen-Chun Liu¹, Skyler Uhl¹, Daisy A. Hoagland¹, Rasmus Møller¹, Tristan X. Jordan¹, Kohei Oishi¹, Maryline Panis¹, David H. Sachs¹, Taia T. Wang², Robert E. Schwartz³, Jean K. Lim¹, Randy A. Albrecht¹, Benjamin R. tenOever¹ - Show less +11 more•Institutions (3)

Icahn School of Medicine at Mount Sinai¹, Stanford University², Cornell University³

28 May 2020-Cell

TL;DR: It is proposed that reduced innate antiviral defenses coupled with exuberant inflammatory cytokine production are the defining and driving features of COVID-19.

...read moreread less

3,286 citations

Cites background or methods from "Moderated estimation of fold change..."

...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009 https://cran.r-project.org/web/ packages/PMA/index.html ggplot2 Tidyverse https://ggplot2.tidyverse.org/ Bowtie2 Langmead and Salzberg, 2012 http://bowtie-bio.sourceforge.net/ bowtie2/index.shtml ImmGen Yoshida et al., 2019 http://www.immgen.org/ ll...
[...]
...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009…...
[...]
...Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2....
[...]
...Raw reads were aligned to the human genome (hg19) using the RNA-Seq Aligment App on Basespace (Illumina, CA), following differential expression analysis using DESeq2 (Love et al., 2014)....
[...]

Journal Article•DOI•

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

[...]

Enis Afgan¹, Dannon Baker¹, Bérénice Batut², Marius van den Beek³, Dave Bouvier⁴, Martin Čech⁴, John Chilton⁴, Dave Clements¹, Nate Coraor⁴, Björn Grüning², Aysam Guerler¹, Jennifer Hillman-Jackson⁴, Saskia Hiltemann⁵, Vahid Jalili⁶, Helena Rasche², Nicola Soranzo⁷, Jeremy Goecks⁶, James Taylor¹, Anton Nekrutenko⁴, Daniel Blankenberg⁸ - Show less +16 more•Institutions (8)

Johns Hopkins University¹, University of Freiburg², PSL Research University³, Pennsylvania State University⁴, Erasmus University Rotterdam⁵, Oregon Health & Science University⁶, Norwich Research Park⁷, Cleveland Clinic Lerner Research Institute⁸

02 Jul 2018-Nucleic Acids Research

TL;DR: Improvements to Galaxy's core framework, user interface, tools, and training materials enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed.

...read moreread less

Abstract: Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

...read moreread less

2,601 citations

Cites background from "Moderated estimation of fold change..."

...Examples of new tools include: GEMINI for exploring genetic variation (12); mothur for analyzing rRNA gene sequences (13); QIIME for quantitative microbiome analysis from raw DNA sequencing data (14); deepTools for explorative analysis of deeply sequence data (15,16); HiCexplorer (17) for analysis and visualization of Hi-C data; ChemicalToolBox for comprehensive access to cheminformatics libraries and drug discovery tools (18); minimap2 (https://arxiv.org/abs/ 1708.01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]
...01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

GC-content normalization for RNA-Seq data.

[...]

Davide Risso¹, Katja Schwartz², Gavin Sherlock², Sandrine Dudoit³•Institutions (3)

University of Padua¹, Stanford University², University of California, Berkeley³

17 Dec 2011-BMC Bioinformatics

TL;DR: The authors' within-lane normalization procedures, followed by between-lanenormalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression.

...read moreread less

Abstract: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.

...read moreread less

714 citations

"Moderated estimation of fold change..." refers background or methods in this paper

..., using cqn [12] or EDASeq [13]), which may differ from gene to gene....
[...]
...However, it can be advantageous to calculate gene-specific normalization factors sij to account for further sources of technical biases such as GC content, gene length or the like, using published methods [12, 13], and these can be supplied instead....
[...]
...Alternatively, the user can supply normalization constants sij calculated using other methods (e.g., using cqn [13] or EDASeq [14]), which may differ from gene to gene....
[...]

Journal Article•DOI•

High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells

[...]

Yuexin Zhou¹, Shiyou Zhu¹, Changzu Cai¹, Pengfei Yuan¹, Chunmei Li¹, Yanyi Huang¹, Wensheng Wei¹ - Show less +3 more•Institutions (1)

Peking University¹

22 May 2014-Nature

TL;DR: The development of a focused CRISPR/Cas-based (clustered regularly interspaced short palindromic repeats/CRISPR-associated) lentiviral library in human cells and a method of gene identification based on functional screening and high-throughput sequencing analysis are reported.

...read moreread less

Abstract: Targeted genome editing technologies are powerful tools for studying biology and disease, and have a broad range of research applications. In contrast to the rapid development of toolkits to manipulate individual genes, large-scale screening methods based on the complete loss of gene expression are only now beginning to be developed. Here we report the development of a focused CRISPR/Cas-based (clustered regularly interspaced short palindromic repeats/CRISPR-associated) lentiviral library in human cells and a method of gene identification based on functional screening and high-throughput sequencing analysis. Using knockout library screens, we successfully identified the host genes essential for the intoxication of cells by anthrax and diphtheria toxins, which were confirmed by functional validation. The broad application of this powerful genetic screening strategy will not only facilitate the rapid identification of genes important for bacterial toxicity but will also enable the discovery of genes that participate in other biological processes.

...read moreread less

695 citations

"Moderated estimation of fold change..." refers background in this paper

..., [43]), ribosome profiling [44] and CRISPR/Cas-library assays [45]....
[...]

Journal Article•DOI•

Independent filtering increases detection power for high-throughput experiments

[...]

Richard Bourgon¹, Robert Gentleman, Wolfgang Huber•Institutions (1)

European Bioinformatics Institute¹

25 May 2010-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: In an application to microarray data, it was found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%, and it was shown that this particular statistic pair induces a lower bound on fold-change among the set of discoveries.

...read moreread less

Abstract: With high-dimensional data, variable-by-variable statistical testing is often used to select variables whose behavior differs across conditions. Such an approach requires adjustment for multiple testing, which can result in low statistical power. A two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables which pass the filter, can provide higher power. We show that use of some filter/test statistics pairs presented in the literature may, however, lead to loss of type I error control. We describe other pairs which avoid this problem. In an application to microarray data, we found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%. We also show that this particular statistic pair induces a lower bound on fold-change among the set of discoveries. Independent filtering—using filter/test pairs that are independent under the null hypothesis but correlated under the alternative—is a general approach that can substantially increase the efficiency of experiments.

...read moreread less

693 citations

"Moderated estimation of fold change..." refers background in this paper

...However, the loss can be reduced if genes are omitted from the testing that have little or no chance of being detected as differentially expressed, provided that the criterion for omission is independent of the test statistic under the null [21] (see Methods)....
[...]
...Independent filtering Independent filtering does not compromise type-I error control as long as the distribution of the test statistic is marginally independent of the filter statistic under the null hypothesis [21], and we argue in the following that this is the case in our application....
[...]

Journal Article•DOI•

Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms

[...]

Rob Patro¹, Stephen M. Mount², Carl Kingsford¹•Institutions (2)

Carnegie Mellon University¹, University of Maryland, College Park²

01 May 2014-Nature Biotechnology

TL;DR: Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data, exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.

...read moreread less

Abstract: A new algorithm speeds up the quantification of transcripts from RNA-seq data by doing away with read mapping.

...read moreread less

612 citations

"Moderated estimation of fold change..." refers background or methods in this paper

...12% ## low counts [2] : 3152, 27% ## (mean count < 6) ## [1] see 'cooksCutoff' argument of ?results ## [2] see 'independentFiltering' argument of ?results...
[...]
...## function (q) ## coefs[1] + coefs[2]/q ## <environment: 0xe210658> ## attr(,"coefficients") ## asymptDisp extraPois ## 0....
[...]
...This workflow allows users to import transcript abundance estimates from a variety of external software, including the following methods: • Sailfish [2] • Salmon [3] • kallisto [4] • RSEM [5] Some advantages of using the above methods for transcript abundance estimation are: (i) this approach corrects for potential changes in gene length across samples (e....
[...]
...data <- plotPCA(rld, intgroup=c("condition", "type"), returnData=TRUE) percentVar <- round(100 * attr(data, "percentVar")) ggplot(data, aes(PC1, PC2, color=condition, shape=type)) + geom_point(size=3) + xlab(paste0("PC1: ",percentVar[1],"% variance")) + ylab(paste0("PC2: ",percentVar[2],"% variance")) + coord_fixed()...
[...]
...12% ## [1] see 'cooksCutoff' argument of ?results ## [2] see metadata(res)$ihwResult on hypothesis weighting...
[...]

Journal Article•DOI•

Removing technical variability in RNA-seq data using conditional quantile normalization

[...]

Kasper D. Hansen¹, Rafael A. Irizarry¹, Zhijin Wu²•Institutions (2)

Johns Hopkins University¹, Brown University²

01 Apr 2012-Biostatistics

TL;DR: A statistical methodology is described that improves precision by 42% without loss of accuracy and combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.

...read moreread less

Abstract: The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade's worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.

...read moreread less

566 citations

"Moderated estimation of fold change..." refers background or methods in this paper

..., using cqn [12] or EDASeq [13]), which may differ from gene to gene....
[...]
...However, it can be advantageous to calculate gene-specific normalization factors sij to account for further sources of technical biases such as GC content, gene length or the like, using published methods [12, 13], and these can be supplied instead....
[...]