scispace - formally typeset

Posted ContentDOI

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

28 Apr 2021-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: SCALEX is developed, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space and outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations whileaining true biological differences.
Abstract: Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

Content maybe subject to copyright    Report

Construction of continuously expandable single-cell
atlases through integration of heterogeneous
datasets in a generalized cell-embedding space
Lei Xiong
Tsinghua University https://orcid.org/0000-0002-2392-114X
Kang Tian
Tsinghua University
Yuzhe Li
Peking University
Qiangfeng Zhang ( qczhang@tsinghua.edu.cn )
Tsinghua University https://orcid.org/0000-0002-4913-0338
Article
Keywords: COVID-19, SCALEX, disease severity, immune subtypes
Posted Date: April 28th, 2021
DOI: https://doi.org/10.21203/rs.3.rs-398163/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License

1
Construction of continuously expandable single-cell
atlases through integration of heterogeneous datasets
in a generalized cell-embedding space
Lei Xiong
1,2,4
, Kang Tian
1,2,4
, Yuzhe Li
1,3
, Qiangfeng Cliff Zhang
1,2,*
1
MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural
Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems
Biology, School of Life Sciences, Tsinghua University, Beijing, China 100084
2
Tsinghua-Peking Center for Life Sciences, Beijing, China 100084
3
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 100871
4
Co-first authorship
*
Correspondence: qczhang@tsinghua.edu.cn (Q.C.Z.)
ABSTRACT
Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-
type and regulation complexities. However, experimental conditions often confound
biological variations when comparing data from different samples. For integrative single-
cell data analysis, we have developed SCALEX, a deep generative framework that maps
cells into a generalized, batch-invariant cell-embedding space. We demonstrate that
SCALEX accurately and efficiently integrates heterogenous single-cell data using
multiple benchmarks. It outperforms competing methods, especially for datasets with
partial overlaps, accurately aligning similar cell populations while retaining true
biological differences. We demonstrate the advantages of SCALEX by constructing
continuously expandable single-cell atlases for human, mouse, and COVID-19, which
were assembled from multiple data sources and can keep growing through the inclusion
of new incoming data. Analyses based on these atlases revealed the complex cellular

2
landscapes of human and mouse tissues and identified multiple peripheral immune
subtypes associated with COVID-19 disease severity.
INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible
chromatin using sequencing (scATAC-seq) technologies enable decomposition of
diverse cell-types and states to elucidate their function and regulation in tissues and
heterogeneous systems
1-4
. Efforts like the Human Cell Atlas project
5
and Tabula Muris
Consortium
6
are constructing a single-cell reference landscape for a new era of highly
resolved cell research. With the explosive accumulation of single-cell studies,
integrative analysis of data from experiments of different contexts is essential for
characterizing heterogenous cell populations
7
. However, potentially informative
biological insights are often confounded by batch effects that reflect different donors,
conditions, and/or analytical platforms
8,9
.
Integration methods have been developed to remove batch effects in single-cell
datasets
10-16
. One common strategy is to identify similar cells or cell populations across
batches. This includes the mutual nearest neighborhood (MNN) method
10
which
identifies correspondent pairs of cells between two batches by searching for mutual
nearest neighbors in gene expression. Scanorama
11
generalizes the process of neighbor
searching from within two batches to a multiple-batch manner. Seurat v2
13
applies
canonical correlation analysis (CCA) to identify common cell populations in low-
dimensional embeddings across data batches, while Seurat v3
14
introduces “cell
anchors to mitigate the problem of mixing non-overlapping populations, an issue
experienced in Seurat v2. Harmony
16
also applies population matching across batches,
specifically through a fuzzy clustering algorithm.

3
It is notable that all of these cell similarity-based methods are local-based, wherein
cell-correspondence across batches are identified through the similarity of individual
cells or cell anchors/clusters. Accordingly, these methods all suffer from two common
limitations. First, they are prone to mixing cell populations that only exist in some
batches. This becomes a severe problem for the integration of datasets that contain non-
overlapping cell populations in each batch (i.e., partially-overlapping data). Second,
these methods can only remove batch effects from the current batches being assessed
but cannot manage batch effects from additional, subsequently obtained batches. So
each time a new batch is added, it requires an entirely new integration process that again
examines the previous batches. This severely limits the capacity to integrate new single-
cell sequencing datasets.
As an alternative to the cell similarity-based local methods, scVI
17
applies a
conditional variational autoencoder (VAE)
18
framework to model the inherent
distribution/structure of the input single-cell data. VAE is a deep generative method
that comprises an encoder and a decoder, wherein the encoder projects all high-
dimensional input data into a low-dimensional embedding, and the decoder recovers
them back to the original data space. The VAE framework can maintain the same global
internal data structure between the high- and low-dimensional spaces
19
. However, scVI
includes a set of batch-conditioned parameters into its encoder that restrains the encoder
from learning a batch-invariant embedding space, limiting its generalizability with new
batches.
We previously applied VAE and designed SCALE (Single-Cell ATAC-seq
Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq
data
20
. We found that the VAE framework in SCALE can disentangle cell-type-related
and batch-related features in a low-dimensional embedding space. Here, having
redesigned the VAE framework, we introduce SCALEX as a method for integration of
heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate,

4
scalable, and computationally efficient for multiple benchmark datasets from scRNA-
seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data
integration through projecting all single-cell data into a generalized cell-embedding
space using a batch-free encoder and a batch-specific decoder. Since the encoder is
trained to only preserve batch-invariant biological variations, the resulting cell-
embedding space is a generalized one, i.e., common to all projected data. SCALEX is
therefore able to accurately integrate partially-overlapping datasets without mixing of
non-overlapping cell populations. By design, SCALEX runs very efficiently on huge
datasets. These two advantages make SCALEX especially useful for the construction
and research utilization of large-scale single-cell atlas studies, based on integrating data
from heterogeneous sources. New data can be projected to augment an existing atlas,
enabling continuous expansion and improvement of an atlas. We demonstrated these
functionalities of SCALEX in the construction and analyses of atlases for human,
mouse, and COVID-19 PBMCs.
RESULTS
Projecting single-cell data into a generalized cell-embedding space
The central goal of single-cell data integration is to identify and align similar cells
across different batches, while retaining true biological variations within and across
cell-types. The fundamental concept underlying SCALEX is disentangling batch-
related components away from batch-invariant components of single-cell data and
projecting the batch-invariant components into a generalized, batch-invariant cell-
embedding space. To accomplish this, SCALEX implements a batch-free encoder and
a batch-specific decoder in an asymmetric VAE framework
18
(Fig. 1a. Methods). While
the batch-free encoder extracts only biological-related latent features (z) from input

Citations
More filters

01 Apr 2016
Abstract: Single-cell expression profiles of melanoma Tumors harbor multiple cell types that are thought to play a role in the development of resistance to drug treatments. Tirosh et al. used single-cell sequencing to investigate the distribution of these differing genetic profiles within melanomas. Many cells harbored heterogeneous genetic programs that reflected two different states of genetic expression, one of which was linked to resistance development. Following drug treatment, the resistance-linked expression state was found at a much higher level. Furthermore, the environment of the melanoma cells affected their gene expression programs. Science, this issue p. 189 Melanoma cells show transcriptional heterogeneity. To explore the distinct genotypic and phenotypic states of melanoma tumors, we applied single-cell RNA sequencing (RNA-seq) to 4645 single cells isolated from 19 patients, profiling malignant, immune, stromal, and endothelial cells. Malignant cells within the same tumor displayed transcriptional heterogeneity associated with the cell cycle, spatial context, and a drug-resistance program. In particular, all tumors harbored malignant cells from two distinct transcriptional cell states, such that tumors characterized by high levels of the MITF transcription factor also contained cells with low MITF and elevated levels of the AXL kinase. Single-cell analyses suggested distinct tumor microenvironmental patterns, including cell-to-cell interactions. Analysis of tumor-infiltrating T cells revealed exhaustion programs, their connection to T cell activation and clonal expansion, and their variability across patients. Overall, we begin to unravel the cellular ecosystem of tumors and how single-cell genomics offers insights with implications for both targeted and immune therapies.

57 citations


References
More filters

Posted Content
Sergey Ioffe1, Christian Szegedy1Institutions (1)
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

17,151 citations



Journal ArticleDOI
Peter J. Rousseeuw1Institutions (1)
TL;DR: A new graphical display is proposed for partitioning techniques, where each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation, and provides an evaluation of clustering validity.
Abstract: A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.

10,821 citations


Journal ArticleDOI
Hervé Abdi1, Lynne J. Williams2Institutions (2)
Abstract: Principal component analysis PCA is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the table, to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps. The quality of the PCA model can be evaluated using cross-validation techniques such as the bootstrap and the jackknife. PCA can be generalized as correspondence analysis CA in order to handle qualitative variables and as multiple factor analysis MFA in order to handle heterogeneous sets of variables. Mathematically, PCA depends upon the eigen-decomposition of positive semi-definite matrices and upon the singular value decomposition SVD of rectangular matrices. Copyright © 2010 John Wiley & Sons, Inc.

4,725 citations


Journal ArticleDOI
TL;DR: An analytical strategy for integrating scRNA-seq data sets based on common sources of variation is introduced, enabling the identification of shared populations across data sets and downstream comparative analysis.
Abstract: Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.

4,666 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20161