scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is reported in a longitudinal human study that infants at risk of asthma have transient gut microbial dysbiosis during the first 100 days of life, and certain bacterial genera were decreased in these children, suggesting a potential causative role of the loss of these microbes.
Abstract: Asthma is the most prevalent pediatric chronic disease and affects more than 300 million people worldwide. Recent evidence in mice has identified a "critical window" early in life where gut microbial changes (dysbiosis) are most influential in experimental asthma. However, current research has yet to establish whether these changes precede or are involved in human asthma. We compared the gut microbiota of 319 subjects enrolled in the Canadian Healthy Infant Longitudinal Development (CHILD) Study, and show that infants at risk of asthma exhibited transient gut microbial dysbiosis during the first 100 days of life. The relative abundance of the bacterial genera Lachnospira, Veillonella, Faecalibacterium, and Rothia was significantly decreased in children at risk of asthma. This reduction in bacterial taxa was accompanied by reduced levels of fecal acetate and dysregulation of enterohepatic metabolites. Inoculation of germ-free mice with these four bacterial taxa ameliorated airway inflammation in their adult progeny, demonstrating a causal role of these bacterial taxa in averting asthma development. These results enhance the potential for future microbe-based diagnostics and therapies, potentially in the form of probiotics, to prevent the development of asthma and other related allergic diseases in children.

1,195 citations

Journal ArticleDOI
TL;DR: The steps of a typical single‐cell RNA‐seq analysis, including pre‐processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell‐ and gene‐level downstream analysis, are detailed.
Abstract: Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.

1,180 citations

Posted ContentDOI
14 Mar 2019-bioRxiv
TL;DR: It is proposed that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.
Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

1,175 citations


Cites background or result from "Moderated estimation of fold change..."

  • ...The first set aims to identify ’size factors’ for individual cells, as is commonly performed for bulk RNA-seq [Love et al., 2014]....

    [...]

  • ...This is consistent with previous observations in both bulk and single cell RNA-seq that count data is overdispersed [Risso et al., 2018, Grün et al., 2014, Love et al., 2014, Robinson et al., 2010]....

    [...]

Journal ArticleDOI
TL;DR: The current landscape of available tools is reviewed, the principles of error correction, base modification detection, and long-read transcriptomics analysis are focused on, and the challenges that remain are highlighted.
Abstract: Long-read technologies are overcoming early limitations in accuracy and throughput, broadening their application domains in genomics. Dedicated analysis tools that take into account the characteristics of long-read data are thus required, but the fast pace of development of such tools can be overwhelming. To assist in the design and analysis of long-read sequencing projects, we review the current landscape of available tools and present an online interactive database, long-read-tools.org, to facilitate their browsing. We further focus on the principles of error correction, base modification detection, and long-read transcriptomics analysis and highlight the challenges that remain.

1,172 citations


Cites methods from "Moderated estimation of fold change..."

  • ...The popular tools for short-read differential gene expression analysis, such as limma [143], edgeR [144, 145], and DESeq2 [146], can also be used for long-read differential isoform or gene expression analyses....

    [...]

Journal ArticleDOI
27 Jul 2017-Cell
TL;DR: It is found that Fusobacterium (F.) nucleatum was abundant in colorectal cancer tissues in patients with recurrence post chemotherapy, and was associated with patient clinicopathological characterisitcs, and bioinformatic and functional studies demonstrated that F. nucleatum promoted coloreCTal cancer resistance to chemotherapy.

1,164 citations


Cites background or methods from "Moderated estimation of fold change..."

  • ...The RNA-seq data analysis was performed according to the TopHat- HTSeq-DeSeq2 frame (Anders et al., 2013)....

    [...]

  • ...Differential analyses were performed to the count files using DESeq2 packages, following standard normalization procedures(Love et al., 2014)....

    [...]

  • ...…https://www.zeiss.com/microscopy/ R R Development Core Team https://www.r-project.org/ TopHat2 Kim et al., 2013 http://tophat.cbcb.umd.edu DeSeq2 Love et al., 2014 https://www.bioconductor.org/ HTSeq Anders et al., 2015 http://www-huber.embl.de/HTSeq/ GSVA Hänzelmann et al., 2013…...

    [...]

  • ...Genepharma N/A DNA primer sequences, See Table S7 Sangon Biotech N/A miRNA and U6 Primers, See Table S7 GeneCopoeia Included in Table S7 Software and Algorithms ImageJ National Institutes of Health https://imagej.nih.gov/ij/ FindTar3 School of Life Science, Tsinghua University http://bio.sz.tsinghua.edu.cn/ miRDB Department of Radiation Oncology, Washington University School of Medicine http://mirdb.org/miRDB/ FlowJo FlowJo LLC https://www.flowjo.com/ ZEN 2011 Light Edition ZEISS https://www.zeiss.com/microscopy/ R R Development Core Team https://www.r-project.org/ TopHat2 Kim et al., 2013 http://tophat.cbcb.umd.edu DeSeq2 Love et al., 2014 https://www.bioconductor.org/ HTSeq Anders et al., 2015 http://www-huber.embl.de/HTSeq/ GSVA Hänzelmann et al., 2013 https://www.bioconductor.org/...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...TheWald test P values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....

    [...]

  • ...The Wald test p-values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....

    [...]

  • ...For all algorithms returning P values, the P values from genes with non-zero sum of read counts across samples were adjusted using the Benjamini–Hochberg procedure [21]....

    [...]

  • ...TheWald test P values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....

    [...]

  • ...The Wald test p-values from the subset of genes which pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....

    [...]

Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...The Negative Binomial based approaches compared were DESeq (old) [4], edgeR [32], edgeR with the robust option [33], DSS [6] and EBSeq [34]....

    [...]

Book
01 Jan 1983
TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

23,215 citations

Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations