scispace - formally typeset
Search or ask a question
Posted ContentDOI

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

Lei Xiong1, Kang Tian1, Yuzhe Li2, Yuzhe Li1, Qiangfeng Zhang1 
28 Apr 2021-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: SCALEX is developed, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space and outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations whileaining true biological differences.
Abstract: Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

Summary (4 min read)

INTRODUCTION

  • Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems [1] [2] [3] [4] .
  • With the explosive accumulation of single-cell studies, integrative analysis of data from experiments of different contexts is essential for characterizing heterogenous cell populations 7 .
  • One common strategy is to identify similar cells or cell populations across batches.
  • Since the encoder is trained to only preserve batch-invariant biological variations, the resulting cellembedding space is a generalized one, i.e., common to all projected data.
  • These two advantages make SCALEX especially useful for the construction and research utilization of large-scale single-cell atlas studies, based on integrating data from heterogeneous sources.

Projecting single-cell data into a generalized cell-embedding space

  • The central goal of single-cell data integration is to identify and align similar cells across different batches, while retaining true biological variations within and across cell-types.
  • The fundamental concept underlying SCALEX is disentangling batchrelated components away from batch-invariant components of single-cell data and projecting the batch-invariant components into a generalized, batch-invariant cellembedding space.

SCALEX integration is accurate, scalable, and accommodates diverse data types

  • For comparison, the authors included several other methods in the analyses, including Seurat v3, Harmony, Conos, BBKNN, MNN, Scanorama, and scVI .
  • Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Fig. S1 ).
  • Indeed, by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.
  • SCALEX integrated the mouse brain scATAC-seq dataset (two batches assayed by snATAC and 10X) 40 very well, aligning common cell subpopulations and separate distinct ones (Fig. 1f ).

SCALEX integrates partially-overlapping datasets

  • Partially-overlapping datasets present a major challenge for single-cell data integration for local cell similarity-based methods 13, 14 , often leading to over-correction (i.e., mixing of distinct cell-types).
  • The liver dataset is a partially-overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Fig. S3 ).
  • The authors noticed that SCALEX maintained the five hepatocyte subtypes apart, whereas Seurat v3 mixed all five and Harmony mixed the hepatocyte-SCD and hepatocyte-TAT-AS1 cells (Fig. 2a ).
  • To characterize the performance of SCALEX on partially-overlapping datasets, the authors constructed test datasets with a range of common cell-types, down-sampled from the six major cell-types in the pancreas dataset .
  • SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas both Seurat v3 and Harmony frequently mixed the cell-types, particularly for the lowoverlapping cases (Fig. 2b , Fig. S4 ).

Projection of unseen data into an existing cell-embedding space

  • The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder's capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.
  • The authors speculate that once a cell-embedding space has been constructed after integration of existing data, SCALEX should be able to use the same encoder to project additional (i.e., previously unseen) data onto the same embedding space.
  • Cell-types were validated by the expression of their canonical markers, including rare cells such as Schwann cells, epsilon cells (Fig. S6b ).
  • The authors projected three new batches [43] [44] [45] for pancreas tissues (Fig. 3b ) into this "pancreas cell space" using the same encoder trained on the pancreas dataset.
  • The authors benchmarked annotation accuracy by calculating the adjusted Rand Index (ARI) 46 , the Normalized Mutual Information (NMI) 47 , and the F1 score using the cell-type information in the original studies as a gold standard .

Expanding an existing cell space by including new data

  • The ability to project new single-cell data into a generalized cell-embedding space allows SCALEX to readily extend this cell space.
  • SCALEX projection enables post hoc annotation of unknown cell-types in the existing cell space using new data.
  • The authors found that these cells displayed high expression levels for known epithelial genes .
  • The authors then projected these epithelial cells onto the pancreas cell space and found that a group of antigen-presenting airway epithelial (SLC16A7+ epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f ).

SCALEX supports construction of expandable single-cell atlases

  • The ability to combine partially-overlapping data onto a generalized cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse and large datasets.
  • Common cell-types (including both B, T, and endothelial cells in all tissues and proximal tubule, urothelial, and hepatocytic cells in certain tissues) were well-aligned together at the same position in the cell space.
  • Importantly, atlases generated with SCALEX can be used and further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data.
  • The authors found that the same cell-types in the new data batches were correctly projected onto the same locations on the cell-embedding space of the initial mouse atlas (Fig. 4d ), which was also confirmed by the accurate cell-type annotations for the new data by label transfer from the corresponding cell-types in the initial atlas (Fig. 4e . Methods).
  • Following the same strategy, the authors also constructed a human atlas by SCALEX integration of multiple tissues from two studies (GSE134255, GSE159929) (Fig. S8a,b ).

An integrative SCALEX COVID-19 PBMC atlas

  • These studies often suffer from small sample size and/or limited sampling of various disease states 58, 64 .
  • Cells across different studies were integrated accurately with the same cell-types aligned together, confirming integration performance of SCALEX (Fig. 5c , Fig. S9d ).
  • Also enriched in severe patients, a plasma cell subpopulation (MZB1-Plasma) cells displayed decreased expression for antibody production and were enriched for GO terms of immune and inflammatory responses (Fig. S10c,d ).
  • Thus, the SCALEX COVID-19 PBMC atlas, generated by integrating a highly diverse collection of singlecell data from individual studies, identified multiple immune cells-types showing dysregulations during COVID-19 disease progression.

Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study

  • Recently, a large-scale effort of the Single Cell Consortium for COVID-19 in China (SC4) has generated a single-cell atlas that contains over 1 million cells (including PBMCs and other tissues) from 171 COVID-19 patients and 25 healthy controls 65 (Fig. S11a ).
  • The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g ).
  • Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC atlas.
  • First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19.
  • The integration also supported more precise characterization of specific cell subpopulations.

DISCUSSION

  • SCALEX provides a VAE framework for integration of heterogeneous single-cell data by disentangling batch-invariant components from batch-related variations and projecting the batch-invariant components into a generalized, low-dimensional cellembedding space.
  • SCALEX achieves data integration by projecting all single cells into a generalized cell-embedding space using a universal data projector (i.e., the encoder).
  • SCALEX's ability to informatively combine data from heterogenous studies and platforms makes it particularly suitable for the current era of single-cell biological research.
  • Then the loss function is transformed into the evidence lower bound (ELBO).
  • While the ELBO can be further decomposed into two terms:.

Methods

  • The first term is the reconstruction term, which minimizes the distance between the generated output data and the original input data.
  • The authors downloaded gene expression matrices and preprocessed them using the following procedure: i).
  • (5) Repeated ( 2)-( 4) for 100 iterations with different randomly chosen cells and calculated the average, E, as the final batch entropy mixing score.
  • All other parameters were kept their default values.
  • After PCA, the authors used the RunHarmony function for integration.

Differential gene expression analysis and Gene Ontology term enrichment analysis.

  • Differential gene expression analysis was performed on all expressed genes using the rank_genes_groups function with method="t-test" in the Scanpy package, for two certain cell-types in a COVID-19 single-cell atlas.
  • A gene was considered differentially expressed when a log2-fold change was >1 in the two conditions in comparison, and the Benjamini-Hochberg adjusted P-value was < 0.01.
  • The top 200 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type were used as the input for GO analysis, and enriched GO terms were acquired for each group of cells of the "GO_Biological_Process_2018" dataset using the Python package GSEApy.
  • The authors defined the inflammatory score and the cytokine score for each cell following Ren et al.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Construction of continuously expandable single-cell
atlases through integration of heterogeneous
datasets in a generalized cell-embedding space
Lei Xiong
Tsinghua University https://orcid.org/0000-0002-2392-114X
Kang Tian
Tsinghua University
Yuzhe Li
Peking University
Qiangfeng Zhang ( qczhang@tsinghua.edu.cn )
Tsinghua University https://orcid.org/0000-0002-4913-0338
Article
Keywords: COVID-19, SCALEX, disease severity, immune subtypes
Posted Date: April 28th, 2021
DOI: https://doi.org/10.21203/rs.3.rs-398163/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License

1
Construction of continuously expandable single-cell
atlases through integration of heterogeneous datasets
in a generalized cell-embedding space
Lei Xiong
1,2,4
, Kang Tian
1,2,4
, Yuzhe Li
1,3
, Qiangfeng Cliff Zhang
1,2,*
1
MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural
Biology & Frontier Research Center for Biological Structure, Center for Synthetic and Systems
Biology, School of Life Sciences, Tsinghua University, Beijing, China 100084
2
Tsinghua-Peking Center for Life Sciences, Beijing, China 100084
3
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 100871
4
Co-first authorship
*
Correspondence: qczhang@tsinghua.edu.cn (Q.C.Z.)
ABSTRACT
Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-
type and regulation complexities. However, experimental conditions often confound
biological variations when comparing data from different samples. For integrative single-
cell data analysis, we have developed SCALEX, a deep generative framework that maps
cells into a generalized, batch-invariant cell-embedding space. We demonstrate that
SCALEX accurately and efficiently integrates heterogenous single-cell data using
multiple benchmarks. It outperforms competing methods, especially for datasets with
partial overlaps, accurately aligning similar cell populations while retaining true
biological differences. We demonstrate the advantages of SCALEX by constructing
continuously expandable single-cell atlases for human, mouse, and COVID-19, which
were assembled from multiple data sources and can keep growing through the inclusion
of new incoming data. Analyses based on these atlases revealed the complex cellular

2
landscapes of human and mouse tissues and identified multiple peripheral immune
subtypes associated with COVID-19 disease severity.
INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible
chromatin using sequencing (scATAC-seq) technologies enable decomposition of
diverse cell-types and states to elucidate their function and regulation in tissues and
heterogeneous systems
1-4
. Efforts like the Human Cell Atlas project
5
and Tabula Muris
Consortium
6
are constructing a single-cell reference landscape for a new era of highly
resolved cell research. With the explosive accumulation of single-cell studies,
integrative analysis of data from experiments of different contexts is essential for
characterizing heterogenous cell populations
7
. However, potentially informative
biological insights are often confounded by batch effects that reflect different donors,
conditions, and/or analytical platforms
8,9
.
Integration methods have been developed to remove batch effects in single-cell
datasets
10-16
. One common strategy is to identify similar cells or cell populations across
batches. This includes the mutual nearest neighborhood (MNN) method
10
which
identifies correspondent pairs of cells between two batches by searching for mutual
nearest neighbors in gene expression. Scanorama
11
generalizes the process of neighbor
searching from within two batches to a multiple-batch manner. Seurat v2
13
applies
canonical correlation analysis (CCA) to identify common cell populations in low-
dimensional embeddings across data batches, while Seurat v3
14
introduces “cell
anchors to mitigate the problem of mixing non-overlapping populations, an issue
experienced in Seurat v2. Harmony
16
also applies population matching across batches,
specifically through a fuzzy clustering algorithm.

3
It is notable that all of these cell similarity-based methods are local-based, wherein
cell-correspondence across batches are identified through the similarity of individual
cells or cell anchors/clusters. Accordingly, these methods all suffer from two common
limitations. First, they are prone to mixing cell populations that only exist in some
batches. This becomes a severe problem for the integration of datasets that contain non-
overlapping cell populations in each batch (i.e., partially-overlapping data). Second,
these methods can only remove batch effects from the current batches being assessed
but cannot manage batch effects from additional, subsequently obtained batches. So
each time a new batch is added, it requires an entirely new integration process that again
examines the previous batches. This severely limits the capacity to integrate new single-
cell sequencing datasets.
As an alternative to the cell similarity-based local methods, scVI
17
applies a
conditional variational autoencoder (VAE)
18
framework to model the inherent
distribution/structure of the input single-cell data. VAE is a deep generative method
that comprises an encoder and a decoder, wherein the encoder projects all high-
dimensional input data into a low-dimensional embedding, and the decoder recovers
them back to the original data space. The VAE framework can maintain the same global
internal data structure between the high- and low-dimensional spaces
19
. However, scVI
includes a set of batch-conditioned parameters into its encoder that restrains the encoder
from learning a batch-invariant embedding space, limiting its generalizability with new
batches.
We previously applied VAE and designed SCALE (Single-Cell ATAC-seq
Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq
data
20
. We found that the VAE framework in SCALE can disentangle cell-type-related
and batch-related features in a low-dimensional embedding space. Here, having
redesigned the VAE framework, we introduce SCALEX as a method for integration of
heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate,

4
scalable, and computationally efficient for multiple benchmark datasets from scRNA-
seq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data
integration through projecting all single-cell data into a generalized cell-embedding
space using a batch-free encoder and a batch-specific decoder. Since the encoder is
trained to only preserve batch-invariant biological variations, the resulting cell-
embedding space is a generalized one, i.e., common to all projected data. SCALEX is
therefore able to accurately integrate partially-overlapping datasets without mixing of
non-overlapping cell populations. By design, SCALEX runs very efficiently on huge
datasets. These two advantages make SCALEX especially useful for the construction
and research utilization of large-scale single-cell atlas studies, based on integrating data
from heterogeneous sources. New data can be projected to augment an existing atlas,
enabling continuous expansion and improvement of an atlas. We demonstrated these
functionalities of SCALEX in the construction and analyses of atlases for human,
mouse, and COVID-19 PBMCs.
RESULTS
Projecting single-cell data into a generalized cell-embedding space
The central goal of single-cell data integration is to identify and align similar cells
across different batches, while retaining true biological variations within and across
cell-types. The fundamental concept underlying SCALEX is disentangling batch-
related components away from batch-invariant components of single-cell data and
projecting the batch-invariant components into a generalized, batch-invariant cell-
embedding space. To accomplish this, SCALEX implements a batch-free encoder and
a batch-specific decoder in an asymmetric VAE framework
18
(Fig. 1a. Methods). While
the batch-free encoder extracts only biological-related latent features (z) from input

Citations
More filters
01 Apr 2016
TL;DR: Tirosh et al. as discussed by the authors applied single-cell RNA sequencing (RNA-seq) to 4645 single cells isolated from 19 patients, profiling malignant, immune, stromal, and endothelial cells.
Abstract: Single-cell expression profiles of melanoma Tumors harbor multiple cell types that are thought to play a role in the development of resistance to drug treatments. Tirosh et al. used single-cell sequencing to investigate the distribution of these differing genetic profiles within melanomas. Many cells harbored heterogeneous genetic programs that reflected two different states of genetic expression, one of which was linked to resistance development. Following drug treatment, the resistance-linked expression state was found at a much higher level. Furthermore, the environment of the melanoma cells affected their gene expression programs. Science, this issue p. 189 Melanoma cells show transcriptional heterogeneity. To explore the distinct genotypic and phenotypic states of melanoma tumors, we applied single-cell RNA sequencing (RNA-seq) to 4645 single cells isolated from 19 patients, profiling malignant, immune, stromal, and endothelial cells. Malignant cells within the same tumor displayed transcriptional heterogeneity associated with the cell cycle, spatial context, and a drug-resistance program. In particular, all tumors harbored malignant cells from two distinct transcriptional cell states, such that tumors characterized by high levels of the MITF transcription factor also contained cells with low MITF and elevated levels of the AXL kinase. Single-cell analyses suggested distinct tumor microenvironmental patterns, including cell-to-cell interactions. Analysis of tumor-infiltrating T cells revealed exhaustion programs, their connection to T cell activation and clonal expansion, and their variability across patients. Overall, we begin to unravel the cellular ecosystem of tumors and how single-cell genomics offers insights with implications for both targeted and immune therapies.

823 citations

Journal ArticleDOI
TL;DR: Compared with current methods, IMGG shows excellent performance on a variety of evaluation metrics, and the IMGG-corrected gene expression data incorporate features from multiple batches, allowing for downstream tasks such as differential gene expression analysis.
Abstract: There is a strong need to eliminate batch-specific differences when integrating single-cell RNA-sequencing (scRNA-seq) datasets generated under different experimental conditions for downstream task analysis. Existing batch correction methods usually transform different batches of cells into one preselected “anchor” batch or a low-dimensional embedding space, and cannot take full advantage of useful information from multiple sources. We present a novel framework, called IMGG, i.e., integrating multiple single-cell datasets through connected graphs and generative adversarial networks (GAN) to eliminate nonbiological differences between different batches. Compared with current methods, IMGG shows excellent performance on a variety of evaluation metrics, and the IMGG-corrected gene expression data incorporate features from multiple batches, allowing for downstream tasks such as differential gene expression analysis.

8 citations

Journal ArticleDOI
TL;DR: Deep learning has also emerged as a promising tool for scRNAseq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis as discussed by the authors .

3 citations

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper integrated multiple single-cell datasets via an adversarial autoencoder to correct the batch effects, which improved the performance of batch correction methods, especially in the case of multiple cell types.
Abstract: Single-cell RNA sequencing (RNA-seq) has been demonstrated to be a proven method for quantifying gene-expression heterogeneity and providing insight into the transcriptome at the single-cell level. When combining multiple single-cell transcriptome datasets for analysis, it is common to first correct the batch effect. Most of the state-of-the-art processing methods are unsupervised, i.e., they do not utilize single-cell cluster labeling information, which could improve the performance of batch correction methods, especially in the case of multiple cell types. To better utilize known labels for complex dataset scenarios, we propose a novel deep learning model named IMAAE (i.e., integrating multiple single-cell datasets via an adversarial autoencoder) to correct the batch effects. After conducting experiments with various dataset scenarios, the results show that IMAAE outperforms existing methods for both qualitative measures and quantitative evaluation. In addition, IMAAE is able to retain both corrected dimension reduction data and corrected gene expression data. These features make it a potential new option for large-scale single-cell gene expression data analysis.
References
More filters
Posted ContentDOI
24 May 2019-bioRxiv
TL;DR: In this paper, the authors present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and singlecell ATAC-seq data, which makes the many existing RNA-seq workflows from scanpy available to large-scale singlecell data from other -omics modalities.
Abstract: Epigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

46 citations

Posted ContentDOI
23 Nov 2020-medRxiv
TL;DR: An overview of a single-cell data resource derived from samples from COVID-19 patients along with initial observations and guidance on data reuse and exploration is provided.
Abstract: In late 2019 and through 2020, the COVID-19 pandemic swept the world, presenting both scientific and medical challenges associated with understanding and treating a previously unknown disease. To help address the need for great understanding of COVID-19, the scientific community mobilized and banded together rapidly to characterize SARS-CoV-2 infection, pathogenesis and its distinct disease trajectories. The urgency of COVID-19 provided a pressing use-case for leveraging relatively new tools, technologies, and nascent collaborative networks. Single-cell biology is one such example that has emerged over the last decade as a powerful approach that provides unprecedented resolution to the cellular and molecular underpinnings of biological processes. Early foundational work within the single-cell community, including the Human Cell Atlas, utilized published and unpublished data to characterize the putative target cells of SARS-CoV-2 sampled from diverse organs based on expression of the viral receptor ACE2 and associated entry factors TMPRSS2 and CTSL (Muus et al., 2020; Sungnak et al., 2020; Ziegler et al., 2020). This initial characterization of reference data provided an important foundation for framing infection and pathology in the airway as well as other organs. However, initial community analysis was limited to samples derived from uninfected donors and other previously-sampled disease indications. This report provides an overview of a single-cell data resource derived from samples from COVID-19 patients along with initial observations and guidance on data reuse and exploration.

30 citations

Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space" ?

The authors demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. The authors demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. 

Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems1-4. 

SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods). 

The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space. 

6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods). 

COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients. 

Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together. 

COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients. 

Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed. 

Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker. 

The authors applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52). 

Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy69 package. iv). 

by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.