scispace - formally typeset
Open AccessPosted ContentDOI

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

TLDR
SCALEX is developed, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space and outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations whileaining true biological differences.
Abstract
Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

read more

Content maybe subject to copyright    Report

Construction of continuously expandable single-cell
atlases through integration of heterogeneous
datasets in a generalized cell-embedding space
Lei Xiong
Tsinghua University https://orcid.org/0000-0002-2392-114X
Kang Tian
Tsinghua University
Yuzhe Li
Peking University
Qiangfeng Zhang ( qczhang@tsinghua.edu.cn )
Tsinghua University https://orcid.org/0000-0002-4913-0338
Article
Keywords: COVID-19, SCALEX, disease severity, immune subtypes
Posted Date: April 28th, 2021
DOI: https://doi.org/10.21203/rs.3.rs-398163/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License

1
Construction of continuously expandable single-cell
atlases through integration of heterogeneous datasets
in a generalized cell-embedding space
Lei Xiong
1,2,4
, Kang Tian
1,2,4
, Yuzhe Li
1,3
, Qiangfeng Cliff Zhang
1,2,*
1
MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural
Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems
Biology, School of Life Sciences, Tsinghua University, Beijing, China 100084
2
Tsinghua-Peking Center for Life Sciences, Beijing, China 100084
3
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 100871
4
Co-first authorship
*
Correspondence: qczhang@tsinghua.edu.cn (Q.C.Z.)
ABSTRACT
Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-
type and regulation complexities. However, experimental conditions often confound
biological variations when comparing data from different samples. For integrative single-
cell data analysis, we have developed SCALEX, a deep generative framework that maps
cells into a generalized, batch-invariant cell-embedding space. We demonstrate that
SCALEX accurately and efficiently integrates heterogenous single-cell data using
multiple benchmarks. It outperforms competing methods, especially for datasets with
partial overlaps, accurately aligning similar cell populations while retaining true
biological differences. We demonstrate the advantages of SCALEX by constructing
continuously expandable single-cell atlases for human, mouse, and COVID-19, which
were assembled from multiple data sources and can keep growing through the inclusion
of new incoming data. Analyses based on these atlases revealed the complex cellular

2
landscapes of human and mouse tissues and identified multiple peripheral immune
subtypes associated with COVID-19 disease severity.
INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible
chromatin using sequencing (scATAC-seq) technologies enable decomposition of
diverse cell-types and states to elucidate their function and regulation in tissues and
heterogeneous systems
1-4
. Efforts like the Human Cell Atlas project
5
and Tabula Muris
Consortium
6
are constructing a single-cell reference landscape for a new era of highly
resolved cell research. With the explosive accumulation of single-cell studies,
integrative analysis of data from experiments of different contexts is essential for
characterizing heterogenous cell populations
7
. However, potentially informative
biological insights are often confounded by batch effects that reflect different donors,
conditions, and/or analytical platforms
8,9
.
Integration methods have been developed to remove batch effects in single-cell
datasets
10-16
. One common strategy is to identify similar cells or cell populations across
batches. This includes the mutual nearest neighborhood (MNN) method
10
which
identifies correspondent pairs of cells between two batches by searching for mutual
nearest neighbors in gene expression. Scanorama
11
generalizes the process of neighbor
searching from within two batches to a multiple-batch manner. Seurat v2
13
applies
canonical correlation analysis (CCA) to identify common cell populations in low-
dimensional embeddings across data batches, while Seurat v3
14
introduces “cell
anchors to mitigate the problem of mixing non-overlapping populations, an issue
experienced in Seurat v2. Harmony
16
also applies population matching across batches,
specifically through a fuzzy clustering algorithm.

3
It is notable that all of these cell similarity-based methods are local-based, wherein
cell-correspondence across batches are identified through the similarity of individual
cells or cell anchors/clusters. Accordingly, these methods all suffer from two common
limitations. First, they are prone to mixing cell populations that only exist in some
batches. This becomes a severe problem for the integration of datasets that contain non-
overlapping cell populations in each batch (i.e., partially-overlapping data). Second,
these methods can only remove batch effects from the current batches being assessed
but cannot manage batch effects from additional, subsequently obtained batches. So
each time a new batch is added, it requires an entirely new integration process that again
examines the previous batches. This severely limits the capacity to integrate new single-
cell sequencing datasets.
As an alternative to the cell similarity-based local methods, scVI
17
applies a
conditional variational autoencoder (VAE)
18
framework to model the inherent
distribution/structure of the input single-cell data. VAE is a deep generative method
that comprises an encoder and a decoder, wherein the encoder projects all high-
dimensional input data into a low-dimensional embedding, and the decoder recovers
them back to the original data space. The VAE framework can maintain the same global
internal data structure between the high- and low-dimensional spaces
19
. However, scVI
includes a set of batch-conditioned parameters into its encoder that restrains the encoder
from learning a batch-invariant embedding space, limiting its generalizability with new
batches.
We previously applied VAE and designed SCALE (Single-Cell ATAC-seq
Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq
data
20
. We found that the VAE framework in SCALE can disentangle cell-type-related
and batch-related features in a low-dimensional embedding space. Here, having
redesigned the VAE framework, we introduce SCALEX as a method for integration of
heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate,

4
scalable, and computationally efficient for multiple benchmark datasets from scRNA-
seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data
integration through projecting all single-cell data into a generalized cell-embedding
space using a batch-free encoder and a batch-specific decoder. Since the encoder is
trained to only preserve batch-invariant biological variations, the resulting cell-
embedding space is a generalized one, i.e., common to all projected data. SCALEX is
therefore able to accurately integrate partially-overlapping datasets without mixing of
non-overlapping cell populations. By design, SCALEX runs very efficiently on huge
datasets. These two advantages make SCALEX especially useful for the construction
and research utilization of large-scale single-cell atlas studies, based on integrating data
from heterogeneous sources. New data can be projected to augment an existing atlas,
enabling continuous expansion and improvement of an atlas. We demonstrated these
functionalities of SCALEX in the construction and analyses of atlases for human,
mouse, and COVID-19 PBMCs.
RESULTS
Projecting single-cell data into a generalized cell-embedding space
The central goal of single-cell data integration is to identify and align similar cells
across different batches, while retaining true biological variations within and across
cell-types. The fundamental concept underlying SCALEX is disentangling batch-
related components away from batch-invariant components of single-cell data and
projecting the batch-invariant components into a generalized, batch-invariant cell-
embedding space. To accomplish this, SCALEX implements a batch-free encoder and
a batch-specific decoder in an asymmetric VAE framework
18
(Fig. 1a. Methods). While
the batch-free encoder extracts only biological-related latent features (z) from input

Citations
More filters
Journal ArticleDOI

IMGG: Integrating Multiple Single-Cell Datasets through Connected Graphs and Generative Adversarial Networks

TL;DR: Compared with current methods, IMGG shows excellent performance on a variety of evaluation metrics, and the IMGG-corrected gene expression data incorporate features from multiple batches, allowing for downstream tasks such as differential gene expression analysis.
Journal ArticleDOI

Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review

TL;DR: Deep learning has also emerged as a promising tool for scRNAseq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis as discussed by the authors .
Journal ArticleDOI

Integrating Multiple Single-Cell RNA Sequencing Datasets Using Adversarial Autoencoders

TL;DR: Li et al. as mentioned in this paper integrated multiple single-cell datasets via an adversarial autoencoder to correct the batch effects, which improved the performance of batch correction methods, especially in the case of multiple cell types.
References
More filters
Journal ArticleDOI

Fast, sensitive and accurate integration of single-cell data with Harmony.

TL;DR: Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.
Journal ArticleDOI

From Louvain to Leiden: guaranteeing well-connected communities

TL;DR: In this article, the authors show that the Louvain algorithm may yield arbitrarily badly connected communities and, in the worst case, communities may even be disconnected, especially when running the algorithm iteratively.
Journal ArticleDOI

Tackling the widespread and critical impact of batch effects in high-throughput data

TL;DR: It is argued that batch effects (as well as other technical and biological artefacts) are widespread and critical to address and experimental and computational approaches for doing so are reviewed.
Journal ArticleDOI

Single-cell transcriptomics of 20 mouse organs creates a "Tabula Muris"

TL;DR: A compendium of single-cell transcriptomic data from the model organism Mus musculus that comprises more than 100,000 cells from 20 organs and tissues is presented, representing a new resource for cell biology and enabling the direct and controlled comparison of gene expression in cell types that are shared between tissues.
Journal ArticleDOI

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.

TL;DR: This work presents a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space and demonstrates the superiority of this approach compared with existing methods by using both simulated and real scRNA-seq data sets.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

The authors demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. The authors demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. 

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems1-4. 

SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods). 

The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space. 

6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods). 

COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients. 

Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together. 

COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients. 

Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed. 

Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker. 

The authors applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52). 

Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy69 package. iv). 

by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.