scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

05 Dec 2014-Genome Biology (BioMed Central)-Vol. 15, Iss: 12, pp 550-550
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
15 Jan 2019-Immunity
TL;DR: The analysis of RNA expression patterns of more than 76,000 individual microglia in mice during development, in old age, and after brain injury uncovered at least nine transcriptionally distinct microglial states, which expressed unique sets of genes and were localized in the brain using specific markers.

1,156 citations

Journal ArticleDOI
TL;DR: A method for the differential analysis of gene expression data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance is described.
Abstract: We describe sleuth (http://pachterlabgithubio/sleuth), a method for the differential analysis of gene expression data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance sleuth is implemented in an interactive shiny app that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of data from RNA-seq experiments

1,154 citations

Journal ArticleDOI
22 Oct 2015-Nature
TL;DR: The developmental mechanism regulating the preferential induction of collecting duct versus kidney mesenchyme progenitors is identified and kidney organoids that contain nephrons associated with a collecting duct network surrounded by renal interstitium and endothelial cells are generated.
Abstract: The human kidney contains up to 2 million epithelial nephrons responsible for blood filtration. Regenerating the kidney requires the induction of the more than 20 distinct cell types required for excretion and the regulation of pH, and electrolyte and fluid balance. We have previously described the simultaneous induction of progenitors for both collecting duct and nephrons via the directed differentiation of human pluripotent stem cells. Paradoxically, although both are of intermediate mesoderm in origin, collecting duct and nephrons have distinct temporospatial origins. Here we identify the developmental mechanism regulating the preferential induction of collecting duct versus kidney mesenchyme progenitors. Using this knowledge, we have generated kidney organoids that contain nephrons associated with a collecting duct network surrounded by renal interstitium and endothelial cells. Within these organoids, individual nephrons segment into distal and proximal tubules, early loops of Henle, and glomeruli containing podocytes elaborating foot processes and undergoing vascularization. When transcription profiles of kidney organoids were compared to human fetal tissues, they showed highest congruence with first trimester human kidney. Furthermore, the proximal tubules endocytose dextran and differentially apoptose in response to cisplatin, a nephrotoxicant. Such kidney organoids represent powerful models of the human organ for future applications, including nephrotoxicity screening, disease modelling and as a source of cells for therapy.

1,143 citations

Journal ArticleDOI
08 May 2015-Science
TL;DR: Tissues exhibit characteristic transcriptional signatures that show stability in postmortem samples that are dominated by a relatively small number of genes, though few are exclusive to a particular tissue and vary more across tissues than individuals.
Abstract: Transcriptional regulation and posttranscriptional processing underlie many cellular and organismal phenotypes. We used RNA sequence data generated by Genotype-Tissue Expression (GTEx) project to investigate the patterns of transcriptome variation across individuals and tissues. Tissues exhibit characteristic transcriptional signatures that show stability in postmortem samples. These signatures are dominated by a relatively small number of genes—which is most clearly seen in blood—though few are exclusive to a particular tissue and vary more across tissues than individuals. Genes exhibiting high interindividual expression variation include disease candidates associated with sex, ethnicity, and age. Primary transcription is the major driver of cellular specificity, with splicing playing mostly a complementary role; except for the brain, which exhibits a more divergent splicing program. Variation in splicing, despite its stochasticity, may play in contrast a comparatively greater role in defining individual phenotypes.

1,131 citations

Journal ArticleDOI
TL;DR: This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project, which covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment.
Abstract: Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells This will provide a range of usage scenarios from which readers can construct their own analysis pipelines

1,128 citations


Cites methods from "Moderated estimation of fold change..."

  • ...Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders & Huber, 2010; Love et al., 2014), or with the calcNormFactors function (Robinson & Oshlack, 2010) in the edgeR package....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...TheWald test P values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....

    [...]

  • ...The Wald test p-values from the subset of genes that pass an independent filtering step, described in the next section, are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....

    [...]

  • ...For all algorithms returning P values, the P values from genes with non-zero sum of read counts across samples were adjusted using the Benjamini–Hochberg procedure [21]....

    [...]

  • ...TheWald test P values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21]....

    [...]

  • ...The Wald test p-values from the subset of genes which pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [20]....

    [...]

Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations


"Moderated estimation of fold change..." refers methods in this paper

  • ...The Negative Binomial based approaches compared were DESeq (old) [4], edgeR [32], edgeR with the robust option [33], DSS [6] and EBSeq [34]....

    [...]

Book
01 Jan 1983
TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

23,215 citations

Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations