scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.
Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

15,744 citations

Journal ArticleDOI
13 Jun 2019-Cell
TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

7,892 citations


Cites methods from "Moderated estimation of fold change..."

  • ...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1.5 and a q-value of less than 0.01 [Storey and Tibshirani, 2003]....

    [...]

  • ...To identify differentially-expressed genes between the CD69+ and CD69- sorted populations, we used DESeq2 [Love et al., 2014] and filtered for significant genes with a log2-fold change in expression greater than 1....

    [...]

Journal ArticleDOI
TL;DR: This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts.
Abstract: High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

3,755 citations

Journal ArticleDOI
28 May 2020-Cell
TL;DR: It is proposed that reduced innate antiviral defenses coupled with exuberant inflammatory cytokine production are the defining and driving features of COVID-19.

3,286 citations


Cites background or methods from "Moderated estimation of fold change..."

  • ...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009 https://cran.r-project.org/web/ packages/PMA/index.html ggplot2 Tidyverse https://ggplot2.tidyverse.org/ Bowtie2 Langmead and Salzberg, 2012 http://bowtie-bio.sourceforge.net/ bowtie2/index.shtml ImmGen Yoshida et al., 2019 http://www.immgen.org/ ll...

    [...]

  • ...1.10 Ilumina http://basespace.illumina.com/ dashboard DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html STRING Szklarczyk et al., 2019 https://string-db.org/ gplots CRAN https://cran.r-project.org/web/ packages/gplots/index.html PMA Witten et al., 2009…...

    [...]

  • ...Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2....

    [...]

  • ...Raw reads were aligned to the human genome (hg19) using the RNA-Seq Aligment App on Basespace (Illumina, CA), following differential expression analysis using DESeq2 (Love et al., 2014)....

    [...]

Journal ArticleDOI
TL;DR: Improvements to Galaxy's core framework, user interface, tools, and training materials enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed.
Abstract: Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

2,601 citations


Cites background from "Moderated estimation of fold change..."

  • ...Examples of new tools include: GEMINI for exploring genetic variation (12); mothur for analyzing rRNA gene sequences (13); QIIME for quantitative microbiome analysis from raw DNA sequencing data (14); deepTools for explorative analysis of deeply sequence data (15,16); HiCexplorer (17) for analysis and visualization of Hi-C data; ChemicalToolBox for comprehensive access to cheminformatics libraries and drug discovery tools (18); minimap2 (https://arxiv.org/abs/ 1708.01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....

    [...]

  • ...01492) and poretools for long read sequencing analysis (19); MultiQC (20) to aggregate multiple results into a single report; a new RNA-seq analysis tool suite with modern analysis tools such as Kallisto (21), Salmon (22), Deseq2 (23) and STAR-Fusion (24), and GenomeSpace (25), a cloud-based interoperability tool....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments, and the voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.
Abstract: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.

4,475 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...Other methods compared were the voom normalization method followed by linear modeling using the limma package [35] and the SAMseq permutation method of the samr package [23]....

    [...]

Journal ArticleDOI
TL;DR: A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies, and parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical.
Abstract: A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies. It provides the ability to analyse complex experiments involving multiple treatment conditions and blocking variables while still taking full account of biological variation. Biological variation between RNA samples is estimated separately from the technical variation associated with sequencing technologies. Novel empirical Bayes methods allow each gene to have its own specific variability, even when there are relatively few biological replicates from which to estimate such variability. The pipeline is implemented in the edgeR package of the Bioconductor project. A case study analysis of carcinoma data demonstrates the ability of generalized linear model methods (GLMs) to detect differential expression in a paired design, and even to detect tumour-specific expression changes. The case study demonstrates the need to allow for gene-specific variability, rather than assuming a common dispersion across genes or a fixed relationship between abundance and variability. Genewise dispersions de-prioritize genes with inconsistent results and allow the main analysis to focus on changes that are consistent between biological replicates. Parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical. Simulations demonstrate the ability of adjusted profile likelihood estimators to return accurate estimators of biological variability in complex situations. When variation is gene-specific, empirical Bayes estimators provide an advantageous compromise between the extremes of assuming common dispersion or separate genewise dispersion. The methods developed here can also be applied to count data arising from DNA-Seq applications, including ChIP-Seq for epigenetic marks and DNA methylation analyses.

4,127 citations

Journal ArticleDOI
TL;DR: Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries, robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes.
Abstract: Differential analysis of gene and transcript expression using high-throughput RNA sequencing (RNA-seq) is complicated by several sources of measurement variability and poses numerous statistical challenges. We present Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries. Cuffdiff 2 robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes. We demonstrate the accuracy of our approach through differential analysis of lung fibroblasts in response to loss of the developmental transcription factor HOXA1, which we show is required for lung fibroblast and HeLa cell cycle progression. Loss of HOXA1 results in significant expression level changes in thousands of individual transcripts, along with isoform switching events in key regulators of the cell cycle. Cuffdiff 2 performs robust differential analysis in RNA-seq experiments at transcript resolution, revealing a layer of regulation not readily observable with other high-throughput technologies.

3,319 citations


"Moderated estimation of fold change..." refers background or methods in this paper

  • ...For the benchmarks using real data, the Cuffdiff 2 [27] method of the Cufflinks suite was included....

    [...]

  • ...In contrast, several algorithms [27, 28] work with probabilistic assignments of reads to transcripts, where multiple, overlapping transcripts can originate from each gene....

    [...]

  • ...It has been noted that the total read count approach can result in false detection of differential expression when in fact only transcript isoform lengths change, and even in a wrong sign of LFCs in extreme cases [27]....

    [...]

Journal ArticleDOI
TL;DR: This work describes Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions, including those for sequence analysis, differential expression analysis and visualization.
Abstract: We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

3,005 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...The count matrix and metadata including the gene model and sample information are stored in an S4 class derived from the SummarizedExperiment class of the GenomicRanges package [59]....

    [...]