What is the way to integrate scATAC data?

SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods).

What is the way to integrate a cell-embedding space?

The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.

How did the authors visualize the integration performance of all methods?

6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods).

What is the composition of the COVID-19 dataset?

COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

Why did Seurat v3 and Harmony achieve the integration performance?

Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together.

What is the composition of the COVID-19 atlas?

COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.

What did the authors use to visualize the integration performance of the raw datasets?

Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed.

What is the corresponding expression level of the marker?

Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.

What datasets were used to build a single cell atlas?

The authors applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52).

What was the function used to normalize the total counts of cells?

Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy69 package. iv).

How did the UMAP score evaluate batch mixing?

by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.

(Open Access) Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space (2021) | Lei Xiong

Q: What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

The authors demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. The authors demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data.

Construction of continuously expandable single-cell

atlases through integration of heterogeneous

datasets in a generalized cell-embedding space

Lei Xiong

Tsinghua University https://orcid.org/0000-0002-2392-114X

Kang Tian

Tsinghua University

Yuzhe Li

Peking University

Qiangfeng Zhang (  qczhang@tsinghua.edu.cn )

Tsinghua University https://orcid.org/0000-0002-4913-0338

Article

Keywords: COVID-19, SCALEX, disease severity, immune subtypes

Posted Date: April 28th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-398163/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. 

Read Full License

Construction of continuously expandable single-cell

atlases through integration of heterogeneous datasets

in a generalized cell-embedding space

Lei Xiong

1,2,4

, Kang Tian

1,2,4

, Yuzhe Li

1,3

, Qiangfeng Cliff Zhang

1,2,*

MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural

Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems

Biology, School of Life Sciences, Tsinghua University, Beijing, China 100084

Tsinghua-Peking Center for Life Sciences, Beijing, China 100084

Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 100871

Co-first authorship

Correspondence: qczhang@tsinghua.edu.cn (Q.C.Z.)

ABSTRACT

Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-

type and regulation complexities. However, experimental conditions often confound

biological variations when comparing data from different samples. For integrative single-

cell data analysis, we have developed SCALEX, a deep generative framework that maps

cells into a generalized, batch-invariant cell-embedding space. We demonstrate that

SCALEX accurately and efficiently integrates heterogenous single-cell data using

multiple benchmarks. It outperforms competing methods, especially for datasets with

partial overlaps, accurately aligning similar cell populations while retaining true

biological differences. We demonstrate the advantages of SCALEX by constructing

continuously expandable single-cell atlases for human, mouse, and COVID-19, which

were assembled from multiple data sources and can keep growing through the inclusion

of new incoming data. Analyses based on these atlases revealed the complex cellular

landscapes of human and mouse tissues and identified multiple peripheral immune

subtypes associated with COVID-19 disease severity.

INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible

chromatin using sequencing (scATAC-seq) technologies enable decomposition of

diverse cell-types and states to elucidate their function and regulation in tissues and

heterogeneous systems

1-4

. Efforts like the Human Cell Atlas project

and Tabula Muris

Consortium

are constructing a single-cell reference landscape for a new era of highly

resolved cell research. With the explosive accumulation of single-cell studies,

integrative analysis of data from experiments of different contexts is essential for

characterizing heterogenous cell populations

. However, potentially informative

biological insights are often confounded by batch effects that reflect different donors,

conditions, and/or analytical platforms

8,9

Integration methods have been developed to remove batch effects in single-cell

datasets

10-16

. One common strategy is to identify similar cells or cell populations across

batches. This includes the mutual nearest neighborhood (MNN) method

which

identifies correspondent pairs of cells between two batches by searching for mutual

nearest neighbors in gene expression. Scanorama

generalizes the process of neighbor

searching from within two batches to a multiple-batch manner. Seurat v2

applies

canonical correlation analysis (CCA) to identify common cell populations in low-

dimensional embeddings across data batches, while Seurat v3

introduces “cell

anchors” to mitigate the problem of mixing non-overlapping populations, an issue

experienced in Seurat v2. Harmony

also applies population matching across batches,

specifically through a fuzzy clustering algorithm.

It is notable that all of these cell similarity-based methods are local-based, wherein

cell-correspondence across batches are identified through the similarity of individual

cells or cell anchors/clusters. Accordingly, these methods all suffer from two common

limitations. First, they are prone to mixing cell populations that only exist in some

batches. This becomes a severe problem for the integration of datasets that contain non-

overlapping cell populations in each batch (i.e., partially-overlapping data). Second,

these methods can only remove batch effects from the current batches being assessed

but cannot manage batch effects from additional, subsequently obtained batches. So

each time a new batch is added, it requires an entirely new integration process that again

examines the previous batches. This severely limits the capacity to integrate new single-

cell sequencing datasets.

As an alternative to the cell similarity-based local methods, scVI

applies a

conditional variational autoencoder (VAE)

framework to model the inherent

distribution/structure of the input single-cell data. VAE is a deep generative method

that comprises an encoder and a decoder, wherein the encoder projects all high-

dimensional input data into a low-dimensional embedding, and the decoder recovers

them back to the original data space. The VAE framework can maintain the same global

internal data structure between the high- and low-dimensional spaces

. However, scVI

includes a set of batch-conditioned parameters into its encoder that restrains the encoder

from learning a batch-invariant embedding space, limiting its generalizability with new

batches.

We previously applied VAE and designed SCALE (Single-Cell ATAC-seq

Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq

data

. We found that the VAE framework in SCALE can disentangle cell-type-related

and batch-related features in a low-dimensional embedding space. Here, having

redesigned the VAE framework, we introduce SCALEX as a method for integration of

heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate,

scalable, and computationally efficient for multiple benchmark datasets from scRNA-

seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data

integration through projecting all single-cell data into a generalized cell-embedding

space using a batch-free encoder and a batch-specific decoder. Since the encoder is

trained to only preserve batch-invariant biological variations, the resulting cell-

embedding space is a generalized one, i.e., common to all projected data. SCALEX is

therefore able to accurately integrate partially-overlapping datasets without mixing of

non-overlapping cell populations. By design, SCALEX runs very efficiently on huge

datasets. These two advantages make SCALEX especially useful for the construction

and research utilization of large-scale single-cell atlas studies, based on integrating data

from heterogeneous sources. New data can be projected to augment an existing atlas,

enabling continuous expansion and improvement of an atlas. We demonstrated these

functionalities of SCALEX in the construction and analyses of atlases for human,

mouse, and COVID-19 PBMCs.

RESULTS

Projecting single-cell data into a generalized cell-embedding space

The central goal of single-cell data integration is to identify and align similar cells

across different batches, while retaining true biological variations within and across

cell-types. The fundamental concept underlying SCALEX is disentangling batch-

related components away from batch-invariant components of single-cell data and

projecting the batch-invariant components into a generalized, batch-invariant cell-

embedding space. To accomplish this, SCALEX implements a batch-free encoder and

a batch-specific decoder in an asymmetric VAE framework

(Fig. 1a. Methods). While

the batch-free encoder extracts only biological-related latent features (z) from input

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

Citations

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

IMGG: Integrating Multiple Single-Cell Datasets through Connected Graphs and Generative Adversarial Networks

Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review

Integrating Multiple Single-Cell RNA Sequencing Datasets Using Adversarial Autoencoders

References

Fast, sensitive and accurate integration of single-cell data with Harmony.

From Louvain to Leiden: guaranteeing well-connected communities

Tackling the widespread and critical impact of batch effects in high-throughput data

Single-cell transcriptomics of 20 mouse organs creates a "Tabula Muris"

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.

Related Papers (5)

Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography.

Evaluating the reproducibility of single-cell gene regulatory network inference algorithms

pcaReduce: hierarchical clustering of single cell transcriptional profiles

Optimal gene selection for cell type discrimination in single cell analyses

SpiceMix: Integrative single-cell spatial modeling for inferring cell identity

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

Q2. What are the main features of the scATAC-seq technologies?

Q3. What is the way to integrate scATAC data?

Q4. What is the way to integrate a cell-embedding space?

Q5. How did the authors visualize the integration performance of all methods?

Q6. What is the composition of the COVID-19 dataset?

Q7. Why did Seurat v3 and Harmony achieve the integration performance?

Q8. What is the composition of the COVID-19 atlas?

Q9. What did the authors use to visualize the integration performance of the raw datasets?

Q10. What is the corresponding expression level of the marker?

Q11. What datasets were used to build a single cell atlas?

Q12. What was the function used to normalize the total counts of cells?

Q13. How did the UMAP score evaluate batch mixing?