Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space
Summary (4 min read)
INTRODUCTION
- Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems [1] [2] [3] [4] .
- With the explosive accumulation of single-cell studies, integrative analysis of data from experiments of different contexts is essential for characterizing heterogenous cell populations 7 .
- One common strategy is to identify similar cells or cell populations across batches.
- Since the encoder is trained to only preserve batch-invariant biological variations, the resulting cellembedding space is a generalized one, i.e., common to all projected data.
- These two advantages make SCALEX especially useful for the construction and research utilization of large-scale single-cell atlas studies, based on integrating data from heterogeneous sources.
Projecting single-cell data into a generalized cell-embedding space
- The central goal of single-cell data integration is to identify and align similar cells across different batches, while retaining true biological variations within and across cell-types.
- The fundamental concept underlying SCALEX is disentangling batchrelated components away from batch-invariant components of single-cell data and projecting the batch-invariant components into a generalized, batch-invariant cellembedding space.
SCALEX integration is accurate, scalable, and accommodates diverse data types
- For comparison, the authors included several other methods in the analyses, including Seurat v3, Harmony, Conos, BBKNN, MNN, Scanorama, and scVI .
- Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Fig. S1 ).
- Indeed, by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.
- SCALEX integrated the mouse brain scATAC-seq dataset (two batches assayed by snATAC and 10X) 40 very well, aligning common cell subpopulations and separate distinct ones (Fig. 1f ).
SCALEX integrates partially-overlapping datasets
- Partially-overlapping datasets present a major challenge for single-cell data integration for local cell similarity-based methods 13, 14 , often leading to over-correction (i.e., mixing of distinct cell-types).
- The liver dataset is a partially-overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Fig. S3 ).
- The authors noticed that SCALEX maintained the five hepatocyte subtypes apart, whereas Seurat v3 mixed all five and Harmony mixed the hepatocyte-SCD and hepatocyte-TAT-AS1 cells (Fig. 2a ).
- To characterize the performance of SCALEX on partially-overlapping datasets, the authors constructed test datasets with a range of common cell-types, down-sampled from the six major cell-types in the pancreas dataset .
- SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas both Seurat v3 and Harmony frequently mixed the cell-types, particularly for the lowoverlapping cases (Fig. 2b , Fig. S4 ).
Projection of unseen data into an existing cell-embedding space
- The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder's capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.
- The authors speculate that once a cell-embedding space has been constructed after integration of existing data, SCALEX should be able to use the same encoder to project additional (i.e., previously unseen) data onto the same embedding space.
- Cell-types were validated by the expression of their canonical markers, including rare cells such as Schwann cells, epsilon cells (Fig. S6b ).
- The authors projected three new batches [43] [44] [45] for pancreas tissues (Fig. 3b ) into this "pancreas cell space" using the same encoder trained on the pancreas dataset.
- The authors benchmarked annotation accuracy by calculating the adjusted Rand Index (ARI) 46 , the Normalized Mutual Information (NMI) 47 , and the F1 score using the cell-type information in the original studies as a gold standard .
Expanding an existing cell space by including new data
- The ability to project new single-cell data into a generalized cell-embedding space allows SCALEX to readily extend this cell space.
- SCALEX projection enables post hoc annotation of unknown cell-types in the existing cell space using new data.
- The authors found that these cells displayed high expression levels for known epithelial genes .
- The authors then projected these epithelial cells onto the pancreas cell space and found that a group of antigen-presenting airway epithelial (SLC16A7+ epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f ).
SCALEX supports construction of expandable single-cell atlases
- The ability to combine partially-overlapping data onto a generalized cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse and large datasets.
- Common cell-types (including both B, T, and endothelial cells in all tissues and proximal tubule, urothelial, and hepatocytic cells in certain tissues) were well-aligned together at the same position in the cell space.
- Importantly, atlases generated with SCALEX can be used and further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data.
- The authors found that the same cell-types in the new data batches were correctly projected onto the same locations on the cell-embedding space of the initial mouse atlas (Fig. 4d ), which was also confirmed by the accurate cell-type annotations for the new data by label transfer from the corresponding cell-types in the initial atlas (Fig. 4e . Methods).
- Following the same strategy, the authors also constructed a human atlas by SCALEX integration of multiple tissues from two studies (GSE134255, GSE159929) (Fig. S8a,b ).
An integrative SCALEX COVID-19 PBMC atlas
- These studies often suffer from small sample size and/or limited sampling of various disease states 58, 64 .
- Cells across different studies were integrated accurately with the same cell-types aligned together, confirming integration performance of SCALEX (Fig. 5c , Fig. S9d ).
- Also enriched in severe patients, a plasma cell subpopulation (MZB1-Plasma) cells displayed decreased expression for antibody production and were enriched for GO terms of immune and inflammatory responses (Fig. S10c,d ).
- Thus, the SCALEX COVID-19 PBMC atlas, generated by integrating a highly diverse collection of singlecell data from individual studies, identified multiple immune cells-types showing dysregulations during COVID-19 disease progression.
Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study
- Recently, a large-scale effort of the Single Cell Consortium for COVID-19 in China (SC4) has generated a single-cell atlas that contains over 1 million cells (including PBMCs and other tissues) from 171 COVID-19 patients and 25 healthy controls 65 (Fig. S11a ).
- The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g ).
- Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC atlas.
- First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19.
- The integration also supported more precise characterization of specific cell subpopulations.
DISCUSSION
- SCALEX provides a VAE framework for integration of heterogeneous single-cell data by disentangling batch-invariant components from batch-related variations and projecting the batch-invariant components into a generalized, low-dimensional cellembedding space.
- SCALEX achieves data integration by projecting all single cells into a generalized cell-embedding space using a universal data projector (i.e., the encoder).
- SCALEX's ability to informatively combine data from heterogenous studies and platforms makes it particularly suitable for the current era of single-cell biological research.
- Then the loss function is transformed into the evidence lower bound (ELBO).
- While the ELBO can be further decomposed into two terms:.
Methods
- The first term is the reconstruction term, which minimizes the distance between the generated output data and the original input data.
- The authors downloaded gene expression matrices and preprocessed them using the following procedure: i).
- (5) Repeated ( 2)-( 4) for 100 iterations with different randomly chosen cells and calculated the average, E, as the final batch entropy mixing score.
- All other parameters were kept their default values.
- After PCA, the authors used the RunHarmony function for integration.
Differential gene expression analysis and Gene Ontology term enrichment analysis.
- Differential gene expression analysis was performed on all expressed genes using the rank_genes_groups function with method="t-test" in the Scanpy package, for two certain cell-types in a COVID-19 single-cell atlas.
- A gene was considered differentially expressed when a log2-fold change was >1 in the two conditions in comparison, and the Benjamini-Hochberg adjusted P-value was < 0.01.
- The top 200 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type were used as the input for GO analysis, and enriched GO terms were acquired for each group of cells of the "GO_Biological_Process_2018" dataset using the Python package GSEApy.
- The authors defined the inflammatory score and the cytokine score for each cell following Ren et al.
Did you find this useful? Give us your feedback
Citations
823 citations
8 citations
3 citations
References
1,046 citations
1,042 citations
987 citations
980 citations
961 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the main features of the scATAC-seq technologies?
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of diverse cell-types and states to elucidate their function and regulation in tissues and heterogeneous systems1-4.
Q3. What is the way to integrate scATAC data?
SCALEX can be used to integrate scATAC-seq data as well as cross-modality data (e.g. scRNA-seq and scATAC-seq) (Methods).
Q4. What is the way to integrate a cell-embedding space?
The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder’s capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space.
Q5. How did the authors visualize the integration performance of all methods?
6The authors used Uniform Manifold Approximation and Projection (UMAP)36 embeddingsto visualize the integration performance of all methods (Methods).
Q6. What is the composition of the COVID-19 dataset?
COVID-19 dataset composition, including healthy controls and in uenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.
Q7. Why did Seurat v3 and Harmony achieve the integration performance?
Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together.
Q8. What is the composition of the COVID-19 atlas?
COVID-19 dataset composition, including healthy controls and influenza patients, as well as mild/moderate, severe, and convalescent COVID-19 patients.
Q9. What did the authors use to visualize the integration performance of the raw datasets?
Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed.
Q10. What is the corresponding expression level of the marker?
Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.
Q11. What datasets were used to build a single cell atlas?
The authors applied SCALEX integration to two large and complex datasets—the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq6,51) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq39,52).
Q12. What was the function used to normalize the total counts of cells?
Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy69 package. iv).
Q13. How did the UMAP score evaluate batch mixing?
by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.