scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

23 Dec 2019-Genome Biology (BioMed Central)-Vol. 20, Iss: 1, pp 296-296
TL;DR: It is proposed that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.
Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
24 Jun 2021-Cell
TL;DR: Weighted-nearest neighbor analysis as mentioned in this paper is an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities.

3,369 citations

Journal ArticleDOI
Jonas Schulte-Schrepping1, Nico Reusch1, Daniela Paclik2, Kevin Baßler1, Stephan Schlickeiser2, Bowen Zhang3, Benjamin Krämer4, Tobias Krammer, Sophia Brumhard2, Lorenzo Bonaguro1, Elena De Domenico5, Daniel Wendisch2, Martin Grasshoff3, Theodore S. Kapellos1, Michael Beckstette3, Tal Pecht1, Adem Saglam5, Oliver Dietrich, Henrik E. Mei6, Axel Schulz6, Claudia Conrad2, Désirée Kunkel2, Ehsan Vafadarnejad, Cheng-Jian Xu7, Cheng-Jian Xu3, Arik Horne1, Miriam Herbert1, Anna Drews5, Charlotte Thibeault2, Moritz Pfeiffer2, Stefan Hippenstiel2, Andreas C. Hocke2, Holger Müller-Redetzky2, Katrin-Moira Heim2, Felix Machleidt2, Alexander Uhrig2, Laure Bosquillon de Jarcy2, Linda Jürgens2, Miriam Stegemann2, Christoph R. Glösenkamp2, Hans-Dieter Volk2, Christine Goffinet2, Markus Landthaler8, Emanuel Wyler8, Philipp Georg2, Maria Schneider2, Chantip Dang-Heine2, Nick Neuwinger2, Kai Kappert2, Rudolf Tauber2, Victor M. Corman2, Jan Raabe4, Kim Melanie Kaiser4, Michael To Vinh4, Gereon Rieke4, Christian Meisel2, Thomas Ulas5, Matthias Becker5, Robert Geffers, Martin Witzenrath2, Christian Drosten2, Norbert Suttorp2, Christof von Kalle2, Florian Kurth9, Florian Kurth10, Florian Kurth2, Kristian Händler5, Joachim L. Schultze1, Joachim L. Schultze5, Anna C. Aschenbrenner1, Anna C. Aschenbrenner7, Yang Li7, Yang Li3, Jacob Nattermann4, Birgit Sawitzki2, Antoine-Emmanuel Saliba, Leif E. Sander2, Angel Angelov, Robert Bals, Alexander Bartholomäus, Anke Becker, Daniela Bezdan, Ezio Bonifacio, Peer Bork, Thomas Clavel, Maria Colomé-Tatché, Andreas Diefenbach, Alexander T. Dilthey, Nicole Fischer, Konrad U. Förstner, Julia-Stefanie Frick, Julien Gagneur, Alexander Goesmann, Torsten Hain, Michael Hummel, Stefan Janssen, Jörn Kalinowski, René Kallies, Birte Kehr, Andreas Keller, Sarah Kim-Hellmuth, Christoph Klein, Oliver Kohlbacher, Jan O. Korbel, Ingo Kurth, Kerstin U. Ludwig, Oliwia Makarewicz, Manja Marz, Alice C. McHardy, Christian Mertes, Markus M. Nöthen, Peter Nürnberg, Uwe Ohler, Stephan Ossowski, Jörg Overmann, Silke Peter, Klaus Pfeffer, Anna R. Poetsch, Alfred Pühler, Nikolaus Rajewsky, Markus Ralser, Olaf Rieß, Stephan Ripke, Ulisses Nunes da Rocha, Philip Rosenstiel, Philipp H. Schiffer, Eva-Christina Schulte, Alexander Sczyrba, Oliver Stegle, Jens Stoye, Fabian J. Theis, Janne Vehreschild, Jörg Vogel, Max von Kleist, Andreas Walker, Jörn Walter, Dagmar Wieczorek, John Ziebuhr 
17 Sep 2020-Cell
TL;DR: This study provides detailed insights into the systemic immune response to SARS-CoV-2 infection and it reveals profound alterations in the myeloid cell compartment associated with severe COVID-19.

1,042 citations

Journal ArticleDOI
TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.
Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.

677 citations

Journal ArticleDOI
TL;DR: It is reported that a remarkable shift in epithelial cell phenotypes occurs in the peripheral lung in PF and several previously unrecognized epithelialcell phenotypes are identified, including a KRT5−/KRT17+ pathologic, ECM-producing epithel cell population that was highly enriched in PF lungs.
Abstract: Pulmonary fibrosis (PF) is a form of chronic lung disease characterized by pathologic epithelial remodeling and accumulation of extracellular matrix (ECM). To comprehensively define the cell types, mechanisms, and mediators driving fibrotic remodeling in lungs with PF, we performed single-cell RNA sequencing of single-cell suspensions from 10 nonfibrotic control and 20 PF lungs. Analysis of 114,396 cells identified 31 distinct cell subsets/states. We report that a remarkable shift in epithelial cell phenotypes occurs in the peripheral lung in PF and identify several previously unrecognized epithelial cell phenotypes, including a KRT5−/KRT17+ pathologic, ECM-producing epithelial cell population that was highly enriched in PF lungs. Multiple fibroblast subtypes were observed to contribute to ECM expansion in a spatially discrete manner. Together, these data provide high-resolution insights into the complexity and plasticity of the distal lung epithelium in human disease and indicate a diversity of epithelial and mesenchymal cells contribute to pathologic lung fibrosis.

453 citations


Cites methods from "Normalization and variance stabiliz..."

  • ...The slingshot wrapper function was performed with the UMAP dimensionality reduction and cluster labels as in Seurat objects to identify the trajectory....

    [...]

  • ...Cell type annotation and doublet removal Markers specific for major cell types PTPRC+ (immune cells), EPCAM+ (epithelial cells), PECAM1+/PTPRC− (endothelial cells), and PTPRC−/EPCAM−/PECAM1− (mesenchymal cells) were used to split Seurat clusters into four subgroups (fig....

    [...]

  • ...To test for robustness of the differentially expressed analysis and assess for batch effects, we applied latent.vars function embedded in Seurat FindMarkers to assign processing site, flow cell, or processing site and flow cell as latent variables....

    [...]

  • ...We defined inclusion criteria for cells based on observations from the entire dataset, removed low-quality cells accordingly, then performed dimensionality reduction, and unsupervised clustering of the 114,396 recovered cells using the Seurat (25, 26) package in R (see Materials and Methods and fig....

    [...]

  • ...3 of 15 be due in part to the normalization and variance stabilization approach used in Seurat V3 (26)....

    [...]

Journal ArticleDOI
25 Jun 2020-Cell
TL;DR: Viral-Track is introduced, a computational method that globally scans unmapped scRNA-seq data for the presence of viral RNA, enabling transcriptional cell sorting of infected versus bystander cells and provides a robust technology for dissecting the mechanisms of viral-infection and pathology.

368 citations

References
More filters
Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations

Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations


"Normalization and variance stabiliz..." refers background or result in this paper

  • ..., that the majority of genes are not differentially expressed across conditions) [28]....

    [...]

  • ...We demonstrate that a regularization step, a commmon step in bulk RNA-seq analysis [22, 28] where parameter estimates are pooled across genes with similar mean abundance, can effectively overcome this challenge and yield reproducible models....

    [...]

  • ...This is consistent with previous observations in both bulk and single-cell RNA-seq that count data is overdispersed [9, 12, 14, 28]....

    [...]

Journal ArticleDOI
TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

13,356 citations

Journal ArticleDOI
TL;DR: The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
Abstract: The problem of identifying differentially expressed genes in designed microarray experiments is considered. Lonnstedt and Speed (2002) derived an expression for the posterior odds of differential expression in a replicated two-color experiment using a simple hierarchical parametric model. The purpose of this paper is to develop the hierarchical model of Lonnstedt and Speed (2002) into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples. The model is reset in the context of general linear models with arbitrary coefficients and contrasts of interest. The approach applies equally well to both single channel and two color microarray experiments. Consistent, closed form estimators are derived for the hyperparameters in the model. The estimators proposed have robust behavior even for small numbers of arrays and allow for incomplete data arising from spot filtering or spot quality weights. The posterior odds statistic is reformulated in terms of a moderated t-statistic in which posterior residual standard deviations are used in place of ordinary standard deviations. The empirical Bayes approach is equivalent to shrinkage of the estimated sample variances towards a pooled estimate, resulting in far more stable inference when the number of arrays is small. The use of moderated t-statistics has the advantage over the posterior odds that the number of hyperparameters which need to estimated is reduced; in particular, knowledge of the non-null prior for the fold changes are not required. The moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom. The moderated t inferential approach extends to accommodate tests of composite null hypotheses through the use of moderated F-statistics. The performance of the methods is demonstrated in a simulation study. Results are presented for two publicly available data sets.

11,864 citations

Journal ArticleDOI
13 Jun 2019-Cell
TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

7,892 citations

Related Papers (5)