Posted Content•DOI•

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

Q: What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

The authors demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. The authors demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data.

Q: What is the way to integrate scATAC data?

SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods).

Q: How did the authors visualize the integration performance of all methods?

6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods).

Q: What is the composition of the COVID-19 dataset?

COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

Q: Why did Seurat v3 and Harmony achieve the integration performance?

Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together.

Q: What is the composition of the COVID-19 atlas?

COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

Q: What did the authors use to visualize the integration performance of the raw datasets?

Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed.

Q: What is the corresponding expression level of the marker?

Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.

Lei Xiong¹, Kang Tian¹, Yuzhe Li², Yuzhe Li¹, Qiangfeng Zhang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Peking University²

28 Apr 2021-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: SCALEX is developed, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space and outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations whileaining true biological differences.

read less

Abstract: Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

...read moreread less

Summary (4 min read)

Jump to: [INTRODUCTION] – [Projecting single-cell data into a generalized cell-embedding space] – [SCALEX integration is accurate, scalable, and accommodates diverse data types] – [SCALEX integrates partially-overlapping datasets] – [Projection of unseen data into an existing cell-embedding space] – [Expanding an existing cell space by including new data] – [SCALEX supports construction of expandable single-cell atlases] – [An integrative SCALEX COVID-19 PBMC atlas] – [Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study] – [DISCUSSION] – [Methods] and [Differential gene expression analysis and Gene Ontology term enrichment analysis.]

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems [1] [2] [3] [4] .
With the explosive accumulation of single-cell studies, integrative analysis of data from experiments of different contexts is essential for characterizing heterogenous cell populations 7 .
One common strategy is to identify similar cells or cell populations across batches.
Since the encoder is trained to only preserve batch-invariant biological variations, the resulting cellembedding space is a generalized one, i.e., common to all projected data.
These two advantages make SCALEX especially useful for the construction and research utilization of large-scale single-cell atlas studies, based on integrating data from heterogeneous sources.

Projecting single-cell data into a generalized cell-embedding space

The central goal of single-cell data integration is to identify and align similar cells across different batches, while retaining true biological variations within and across cell-types.
The fundamental concept underlying SCALEX is disentangling batchrelated components away from batch-invariant components of single-cell data and projecting the batch-invariant components into a generalized, batch-invariant cellembedding space.

SCALEX integration is accurate, scalable, and accommodates diverse data types

For comparison, the authors included several other methods in the analyses, including Seurat v3, Harmony, Conos, BBKNN, MNN, Scanorama, and scVI .
Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Fig. S1 ).
Indeed, by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.
SCALEX integrated the mouse brain scATAC-seq dataset (two batches assayed by snATAC and 10X) 40 very well, aligning common cell subpopulations and separate distinct ones (Fig. 1f ).

SCALEX integrates partially-overlapping datasets

Partially-overlapping datasets present a major challenge for single-cell data integration for local cell similarity-based methods 13, 14 , often leading to over-correction (i.e., mixing of distinct cell-types).
The liver dataset is a partially-overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Fig. S3 ).
The authors noticed that SCALEX maintained the five hepatocyte subtypes apart, whereas Seurat v3 mixed all five and Harmony mixed the hepatocyte-SCD and hepatocyte-TAT-AS1 cells (Fig. 2a ).
To characterize the performance of SCALEX on partially-overlapping datasets, the authors constructed test datasets with a range of common cell-types, down-sampled from the six major cell-types in the pancreas dataset .
SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas both Seurat v3 and Harmony frequently mixed the cell-types, particularly for the lowoverlapping cases (Fig. 2b , Fig. S4 ).

Projection of unseen data into an existing cell-embedding space

The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder's capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.
The authors speculate that once a cell-embedding space has been constructed after integration of existing data, SCALEX should be able to use the same encoder to project additional (i.e., previously unseen) data onto the same embedding space.
Cell-types were validated by the expression of their canonical markers, including rare cells such as Schwann cells, epsilon cells (Fig. S6b ).
The authors projected three new batches [43] [44] [45] for pancreas tissues (Fig. 3b ) into this "pancreas cell space" using the same encoder trained on the pancreas dataset.
The authors benchmarked annotation accuracy by calculating the adjusted Rand Index (ARI) 46 , the Normalized Mutual Information (NMI) 47 , and the F1 score using the cell-type information in the original studies as a gold standard .

Expanding an existing cell space by including new data

The ability to project new single-cell data into a generalized cell-embedding space allows SCALEX to readily extend this cell space.
SCALEX projection enables post hoc annotation of unknown cell-types in the existing cell space using new data.
The authors found that these cells displayed high expression levels for known epithelial genes .
The authors then projected these epithelial cells onto the pancreas cell space and found that a group of antigen-presenting airway epithelial (SLC16A7+ epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f ).

SCALEX supports construction of expandable single-cell atlases

The ability to combine partially-overlapping data onto a generalized cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse and large datasets.
Common cell-types (including both B, T, and endothelial cells in all tissues and proximal tubule, urothelial, and hepatocytic cells in certain tissues) were well-aligned together at the same position in the cell space.
Importantly, atlases generated with SCALEX can be used and further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data.
The authors found that the same cell-types in the new data batches were correctly projected onto the same locations on the cell-embedding space of the initial mouse atlas (Fig. 4d ), which was also confirmed by the accurate cell-type annotations for the new data by label transfer from the corresponding cell-types in the initial atlas (Fig. 4e . Methods).
Following the same strategy, the authors also constructed a human atlas by SCALEX integration of multiple tissues from two studies (GSE134255, GSE159929) (Fig. S8a,b ).

An integrative SCALEX COVID-19 PBMC atlas

These studies often suffer from small sample size and/or limited sampling of various disease states 58, 64 .
Cells across different studies were integrated accurately with the same cell-types aligned together, confirming integration performance of SCALEX (Fig. 5c , Fig. S9d ).
Also enriched in severe patients, a plasma cell subpopulation (MZB1-Plasma) cells displayed decreased expression for antibody production and were enriched for GO terms of immune and inflammatory responses (Fig. S10c,d ).
Thus, the SCALEX COVID-19 PBMC atlas, generated by integrating a highly diverse collection of singlecell data from individual studies, identified multiple immune cells-types showing dysregulations during COVID-19 disease progression.

Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study

Recently, a large-scale effort of the Single Cell Consortium for COVID-19 in China (SC4) has generated a single-cell atlas that contains over 1 million cells (including PBMCs and other tissues) from 171 COVID-19 patients and 25 healthy controls 65 (Fig. S11a ).
The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g ).
Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC atlas.
First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19.
The integration also supported more precise characterization of specific cell subpopulations.

DISCUSSION

SCALEX provides a VAE framework for integration of heterogeneous single-cell data by disentangling batch-invariant components from batch-related variations and projecting the batch-invariant components into a generalized, low-dimensional cellembedding space.
SCALEX achieves data integration by projecting all single cells into a generalized cell-embedding space using a universal data projector (i.e., the encoder).
SCALEX's ability to informatively combine data from heterogenous studies and platforms makes it particularly suitable for the current era of single-cell biological research.
Then the loss function is transformed into the evidence lower bound (ELBO).
While the ELBO can be further decomposed into two terms:.

Methods

The first term is the reconstruction term, which minimizes the distance between the generated output data and the original input data.
The authors downloaded gene expression matrices and preprocessed them using the following procedure: i).
(5) Repeated ( 2)-( 4) for 100 iterations with different randomly chosen cells and calculated the average, E, as the final batch entropy mixing score.
All other parameters were kept their default values.
After PCA, the authors used the RunHarmony function for integration.

Differential gene expression analysis and Gene Ontology term enrichment analysis.

Differential gene expression analysis was performed on all expressed genes using the rank_genes_groups function with method="t-test" in the Scanpy package, for two certain cell-types in a COVID-19 single-cell atlas.
A gene was considered differentially expressed when a log2-fold change was >1 in the two conditions in comparison, and the Benjamini-Hochberg adjusted P-value was < 0.01.
The top 200 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type were used as the input for GO analysis, and enriched GO terms were acquired for each group of cells of the "GO_Biological_Process_2018" dataset using the Python package GSEApy.
The authors defined the inflammatory score and the cytokine score for each cell following Ren et al.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Construction of continuously expandable single-cell

atlases through integration of heterogeneous

datasets in a generalized cell-embedding space

Lei Xiong

Tsinghua University https://orcid.org/0000-0002-2392-114X

Kang Tian

Tsinghua University

Yuzhe Li

Peking University

Qiangfeng Zhang (  qczhang@tsinghua.edu.cn )

Tsinghua University https://orcid.org/0000-0002-4913-0338

Article

Keywords: COVID-19, SCALEX, disease severity, immune subtypes

Posted Date: April 28th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-398163/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. 

Read Full License

Construction of continuously expandable single-cell

atlases through integration of heterogeneous datasets

in a generalized cell-embedding space

Lei Xiong

1,2,4

, Kang Tian

1,2,4

, Yuzhe Li

1,3

, Qiangfeng Cliff Zhang

1,2,*

MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural

Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems

Biology, School of Life Sciences, Tsinghua University, Beijing, China 100084

Tsinghua-Peking Center for Life Sciences, Beijing, China 100084

Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 100871

Co-first authorship

Correspondence: qczhang@tsinghua.edu.cn (Q.C.Z.)

ABSTRACT

Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-

type and regulation complexities. However, experimental conditions often confound

biological variations when comparing data from different samples. For integrative single-

cell data analysis, we have developed SCALEX, a deep generative framework that maps

cells into a generalized, batch-invariant cell-embedding space. We demonstrate that

SCALEX accurately and efficiently integrates heterogenous single-cell data using

multiple benchmarks. It outperforms competing methods, especially for datasets with

partial overlaps, accurately aligning similar cell populations while retaining true

biological differences. We demonstrate the advantages of SCALEX by constructing

continuously expandable single-cell atlases for human, mouse, and COVID-19, which

were assembled from multiple data sources and can keep growing through the inclusion

of new incoming data. Analyses based on these atlases revealed the complex cellular

landscapes of human and mouse tissues and identified multiple peripheral immune

subtypes associated with COVID-19 disease severity.

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible

chromatin using sequencing (scATAC-seq) technologies enable decomposition of

diverse cell-types and states to elucidate their function and regulation in tissues and

heterogeneous systems

1-4

. Efforts like the Human Cell Atlas project

and Tabula Muris

Consortium

are constructing a single-cell reference landscape for a new era of highly

resolved cell research. With the explosive accumulation of single-cell studies,

integrative analysis of data from experiments of different contexts is essential for

characterizing heterogenous cell populations

. However, potentially informative

biological insights are often confounded by batch effects that reflect different donors,

conditions, and/or analytical platforms

8,9

Integration methods have been developed to remove batch effects in single-cell

datasets

10-16

. One common strategy is to identify similar cells or cell populations across

batches. This includes the mutual nearest neighborhood (MNN) method

which

identifies correspondent pairs of cells between two batches by searching for mutual

nearest neighbors in gene expression. Scanorama

generalizes the process of neighbor

searching from within two batches to a multiple-batch manner. Seurat v2

applies

canonical correlation analysis (CCA) to identify common cell populations in low-

dimensional embeddings across data batches, while Seurat v3

introduces “cell

anchors” to mitigate the problem of mixing non-overlapping populations, an issue

experienced in Seurat v2. Harmony

also applies population matching across batches,

specifically through a fuzzy clustering algorithm.

It is notable that all of these cell similarity-based methods are local-based, wherein

cell-correspondence across batches are identified through the similarity of individual

cells or cell anchors/clusters. Accordingly, these methods all suffer from two common

limitations. First, they are prone to mixing cell populations that only exist in some

batches. This becomes a severe problem for the integration of datasets that contain non-

overlapping cell populations in each batch (i.e., partially-overlapping data). Second,

these methods can only remove batch effects from the current batches being assessed

but cannot manage batch effects from additional, subsequently obtained batches. So

each time a new batch is added, it requires an entirely new integration process that again

examines the previous batches. This severely limits the capacity to integrate new single-

cell sequencing datasets.

As an alternative to the cell similarity-based local methods, scVI

applies a

conditional variational autoencoder (VAE)

framework to model the inherent

distribution/structure of the input single-cell data. VAE is a deep generative method

that comprises an encoder and a decoder, wherein the encoder projects all high-

dimensional input data into a low-dimensional embedding, and the decoder recovers

them back to the original data space. The VAE framework can maintain the same global

internal data structure between the high- and low-dimensional spaces

. However, scVI

includes a set of batch-conditioned parameters into its encoder that restrains the encoder

from learning a batch-invariant embedding space, limiting its generalizability with new

batches.

We previously applied VAE and designed SCALE (Single-Cell ATAC-seq

Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq

data

. We found that the VAE framework in SCALE can disentangle cell-type-related

and batch-related features in a low-dimensional embedding space. Here, having

redesigned the VAE framework, we introduce SCALEX as a method for integration of

heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate,

scalable, and computationally efficient for multiple benchmark datasets from scRNA-

seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data

integration through projecting all single-cell data into a generalized cell-embedding

space using a batch-free encoder and a batch-specific decoder. Since the encoder is

trained to only preserve batch-invariant biological variations, the resulting cell-

embedding space is a generalized one, i.e., common to all projected data. SCALEX is

therefore able to accurately integrate partially-overlapping datasets without mixing of

non-overlapping cell populations. By design, SCALEX runs very efficiently on huge

datasets. These two advantages make SCALEX especially useful for the construction

and research utilization of large-scale single-cell atlas studies, based on integrating data

from heterogeneous sources. New data can be projected to augment an existing atlas,

enabling continuous expansion and improvement of an atlas. We demonstrated these

functionalities of SCALEX in the construction and analyses of atlases for human,

mouse, and COVID-19 PBMCs.

RESULTS

Projecting single-cell data into a generalized cell-embedding space

The central goal of single-cell data integration is to identify and align similar cells

across different batches, while retaining true biological variations within and across

cell-types. The fundamental concept underlying SCALEX is disentangling batch-

related components away from batch-invariant components of single-cell data and

projecting the batch-invariant components into a generalized, batch-invariant cell-

embedding space. To accomplish this, SCALEX implements a batch-free encoder and

a batch-specific decoder in an asymmetric VAE framework

(Fig. 1a. Methods). While

the batch-free encoder extracts only biological-related latent features (z) from input

HTML Viewer

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

The authors demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. The authors demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data.

Q2. What are the main features of the scATAC-seq technologies?

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems1-4.

Q3. What is the way to integrate scATAC data?

SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods).

Q4. What is the way to integrate a cell-embedding space?

The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.

Q5. How did the authors visualize the integration performance of all methods?

6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods).

Q6. What is the composition of the COVID-19 dataset?

COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

Q7. Why did Seurat v3 and Harmony achieve the integration performance?

Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together.

Q8. What is the composition of the COVID-19 atlas?

COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

Q9. What did the authors use to visualize the integration performance of the raw datasets?

Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed.

Q10. What is the corresponding expression level of the marker?

Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.

Q11. What datasets were used to build a single cell atlas?

The authors applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52).

Q12. What was the function used to normalize the total counts of cells?

Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy69 package. iv).

Q13. How did the UMAP score evaluate batch mixing?

by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

Summary (4 min read)

INTRODUCTION

Projecting single-cell data into a generalized cell-embedding space

SCALEX integration is accurate, scalable, and accommodates diverse data types

SCALEX integrates partially-overlapping datasets

Projection of unseen data into an existing cell-embedding space

Expanding an existing cell space by including new data

SCALEX supports construction of expandable single-cell atlases

An integrative SCALEX COVID-19 PBMC atlas

Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study

DISCUSSION

Methods

Differential gene expression analysis and Gene Ontology term enrichment analysis.

Citations

References

Related Papers (5)

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

Q2. What are the main features of the scATAC-seq technologies?

Q3. What is the way to integrate scATAC data?

Q4. What is the way to integrate a cell-embedding space?

Q5. How did the authors visualize the integration performance of all methods?

Q6. What is the composition of the COVID-19 dataset?

Q7. Why did Seurat v3 and Harmony achieve the integration performance?

Q8. What is the composition of the COVID-19 atlas?

Q9. What did the authors use to visualize the integration performance of the raw datasets?

Q10. What is the corresponding expression level of the marker?

Q11. What datasets were used to build a single cell atlas?

Q12. What was the function used to normalize the total counts of cells?

Q13. How did the UMAP score evaluate batch mixing?