Posted Content•DOI•

SpiceMix: Integrative single-cell spatial modeling for inferring cell identity

Benjamin Chidester¹, Tianming Zhou¹, Jian Ma¹•Institutions (1)

30 Nov 2020-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition in complex tissues but the development of computational methods that can utilize the unique properties of spatial transcriptome data to unveil cell identities remains a challenge.

read less

Abstract: Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition in complex tissues. However, the development of computational methods that capture the unique properties of single-cell spatial transcriptome data to unveil cell identities remains a challenge. Here, we report SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW, a new probabilistic model that enables effective joint analysis of spatial information and gene expression of single cells based on spatial transcriptome data. Both simulation and real data evaluations demonstrate that SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW consistently improves upon the inference of the intrinsic cell types compared with existing approaches. As a proof-of-principle, we use SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW to analyze single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and STARmap. We find that SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW can improve cell identity assignments and uncover potentially new cell subtypes. SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW is a generalizable framework for analyzing spatial transcriptome data that may provide critical insights into the cell-type composition and spatial organization of cells in complex tissues.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [Overview of SPICEMIX] – [Discussion] – [Graphical model formulation] – [Parameter priors] – [Estimation of hidden states] – [P (Y,X|Θ)P (Θ) = argmax] – [Initialization] – [Empirical running time] – [Generation of simulated data] and [3 eL4 neurons]

Introduction

The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3].
Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context.
In addition, the model relies on the assumptions that spatial subtypes are discrete and exhibit homogeneous spatial patterns, which prohibits it from learning the underlying mixture of diverse factors of cell identity with varied spatial patterns (e.g., distinct layer-like structures or diffuse patterns).
Here, the authors report SPICEMIX (Spatial Identification of Cells using Matrix Factorization), a new integrative framework to model spatial transcriptome data.
SPICEMIX has the potential to provide critical new insights into the cell composition based on spatial transcriptome data.

Overview of SPICEMIX

SPICEMIX models the cell-to-cell relationships of the spatial transcriptome by a new probabilistic graphical model formulation, the NMF-HMRF (Fig. 1).
Crucially, SPICEMIX learns the parameters of the model that best explain the input spatial transcriptome data, while simultaneously learning the underlying metagenes and their proportions that define the identities of the cells.
The authors compared the inference of SPICEMIX to that of NMF and HMRF, since they are the fundamental underlying models of many relevant computational methods.
In particular, the identification of layer-specific excitatory neurons by SPICEMIX had a high correspondence with their associated layer (Fig. 3c), whereas several excitatory clusters from the original analysis in [12] were incorrectly dispersed across as many as three layers (see Fig. 3h in [12]).
Notably, as annotated in Fig. 3b , metagene 7 is expressed at a high proportion among oligodendrocytes, distinguishing them from OPCs, while the expression of metagene 8, which is also present in OPCs, distinguished the rare Oligo-2 type from Oligo-1.

Discussion

The authors developed SPICEMIX, an unsupervised method for modeling the diverse factors that collectively contribute to cell identity based on single-cell spatial transcriptome data.
This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity.
In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.
As the area of spatial transcriptomics continues to thrive and data become more widely available, SPICEMIX will be a uniquely useful tool for enabling new discoveries.

Graphical model formulation

The authors formulation for the NMF-HMRF in SPICEMIX enhances standard NMF by modeling the spatial correlations among samples (i.e., cells in this context) via the HMRF [29].
Any graph construction method for determining edges, such as distance thresholding or Delaunay triangulation, can be used.
The observations are related to the hidden variables via the potential function φ, which captures the NMF formulation.
Ux measures the inner-product between the metagene proportions of neighboring cells i and j, weighted by a learned, pairwise correlation matrix Σ−1x , which captures the spatial affinity of metagenes.

Parameter priors

This prior can be viewed as a regularization that allows us to control the importance of the spatial relationships during inference.
Alternating estimation of hidden states and model parameters.
To infer the hidden states and model parameters of the NMF-HMRF model in SPICEMIX, the authors optimize the data likelihood via coordinate ascent, alternating between optimizing hidden states and model parameters.

Estimation of hidden states

Given parameters (9) This is a quadratic program and can be solved efficiently via the iterated conditional model (ICM) [41] using the software package Gurobi [42] (see Supplementary Methods A.1 for more details).
Algorithm 1 NMF-HMRF model-fitting and hidden state estimation.
Derive an initial estimate M (0) using K-means clustering assuming no spatial relationships, also known as 1.

P (Y,X|Θ)P (Θ) = argmax

The authors note that they can estimate metagenes, spatial affinity, and the noise level independently.
The MAP estimate of Σ−1x is convex and is solved by the optimizer Adam [43].
See Supplementary Methods A.2 for details of the optimization method.

Initialization

To produce initialize estimates of the model parameters and hidden states, the authors do the following.
First, the authors use a common strategy for initializing NMF, which is to cluster the data using K-means clustering, with K equal to the number of metagenes, and use the means of the clusters as an estimate of the metagenes.
This produces, in only a few quick iterations, an appropriate initial estimate for the algorithm, which will be subsequently refined.
The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered.
This value can be easily tuned by experimentation, and in their analysis, the authors found that just 5 iterations were necessary.

Empirical running time

The GPU is used for the first 5 iterations, or around that number, only, when the spatial affinity matrix Σ−1x is changed significantly.
Later on, most time is spent solving quadratic programmings.

Generation of simulated data

The authors generated simulated spatial transcriptomic data following expression and spatial patterns similar to cells in the mouse primary visual cortex.
The two inhibitory neuron types were scattered sparsely throughout several layers.
For excitatory neurons, the layer-specific metagene defined the subtype.
The authors generated the value for each gene for each metagene from the Gamma distribution with a scale parameter of 1.
Steps of data processing include: constructing the neighbor graph of cells, selection of hyperparameters for SPICEMIX, NMF, and HMRF, random seed selection, the choice of the number of metagenes, and the choice of the number of clusters for hierarchical clustering.

3 eL4 neurons

Oligo SMC Endo Micro NMFa VIP eL2/3 eL4 SST eL6 eL5a eL5b Micro SMC Endo OPC Astro Astro/Oligo Oligo-1 Oligo-2.
Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SPICEMIX.
It is highlighted in a (left) that SPICEMIX further delineated inhibitory neurons into VIPs and SSTs enclosed by the orange dashed cycle, and delineated Oligos and OPCs into separate subtypes: Astro/Oligo , Oligo-1 (light ), Oligo-2 , and OPC (red), enclosed with the red dashed cycle.
The colored boxes following the name of each marker gene correspond to their known associated cell type.
Average expression of inferred metagenes within SPICEMIX cell types.

Did you find this useful? Give us your feedback

Figures (2)

Figure 1: Overview of SPICEMIX. Gene expression measurements and a neighbor graph are extracted from in situ single-cell spatial trancriptome data and fed into the SPICEMIX framework. SPICEMIX decomposes the expression yi in cell i into a mixture of metagenes weighted by the hidden state xi. Spatial interaction between neighboring cells i and j is modeled by an inner product of their hidden states, weighted by inferred spatial affinities between metagenes Σ. Collectively, the mixture weights for individual cells X, the metagene spatial affinity Σ, and K metagenes M , all inferred by SPICEMIX, provide unique insight into the latent intrinsic and spatial factors of cell identity.

Figure 2: Overview of the simulated spatial transcriptome data of the mouse cortex, and performance comparison between SPICEMIX, NMF, and HMRF. a. Illustration of three major cell types distributed in four layers. In this depiction, excitatory and inhibitory neurons are star-shaped and glial cells are ovals. Subtypes are distinguished by their colors. b. Dendrogram showing the similarity of the expression profiles of the eight subtypes (top), their metagene profiles (middle), and their colors and shapes used in panel a. The top four rows correspond to metagenes that determine major type, the next six rows correspond to metagenes that determine subtypes or are layer-specific, and the bottom three rows correspond to noise metagenes. c. Simulated expression of metagenes 2 and 8, from a single sample, in their spatial context (top) and the estimated expression of those metagenes by SPICEMIX (middle) and NMF (bottom). Visualizations in e and f are of the same sample. d. Box plots of the adjusted Rand index (ARI) that measures the quality of the matching between the identified cell types for each method and the true simulated cell types. The optimal number of cell types for NMF was determined by the Calinski-Harabasz index (‘NMF’ in the legend) or by maximizing the ARI score (‘NMF*’ in the legend). Results are reported across four simulation scenarios with varying noise levels. e. Assignments of excitatory neurons for each method in their spatial context. Colors were assigned to cells by the closest matching simulated type. Cells assigned to the incorrect cell types have bright colors. Cells assigned to the correct cell types have faint colors. Cells in orange belong to a cell type that does not match any simulated cell type. f. UMAP plots of raw expression values of cells (left) and the learned latent states of SPICEMIX (right). Colors match those of the spatial maps.

Content maybe subject to copyright Report

SPICEMIX: Integrative single-cell spatial modeling

for inferring cell identity

Benjamin Chidester

1,#

, Tianming Zhou

1,#

, and Jian Ma

1,*

Computational Biology Department, School of Computer Science,

Carnegie Mellon University, Pittsburgh, PA 15213, USA

These two authors contributed equally.

Correspondence: jianma@cs.cmu.edu

Abstract

Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition

in complex tissues. However, the development of computational methods that capture the unique

properties of single-cell spatial transcriptome data to unveil cell identities remains a challenge. Here,

we report SPICEMIX, a new method based on probabilistic, latent variable modeling that enables ef-

fective joint analysis of spatial information and gene expression of single cells from spatial transcrip-

tome data. Both simulation and real data evaluations demonstrate that SPICEMIX markedly improves

upon the inference of cell types compared with existing approaches. Applications of SPICEMIX to

single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and

STARmap show that SPICEMIX can enhance the inference of cell identities and uncover potentially

new cell subtypes with important biological processes. SPICEMIX is a generalizable framework for

analyzing spatial transcriptome data to provide critical insights into the cell-type composition and

spatial organization of cells in complex tissues.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

Introduction

The compositions of different cell types in various human tissues remain poorly understood due to the

complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell iden-

tity [1–3]. Single-cell RNA-seq (scRNA-seq) has greatly advanced our understanding of complex cell

types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently

limited by the dissociation of cells from their spatial context. To address this limitation, new spatial tran-

scriptomics technologies based on multiplexed imaging and sequencing [7–17] are able to reveal spatial

information of gene expression of dozens to tens of thousands of genes in individual cells in situ within

the tissue context.

However, the development of computational methods that capture the unique properties of the spa-

tially resolved single-cell transcriptome data to unveil single-cell identities remains a challenge [18].

Zhu et al. [19] previously proposed the use of a hidden Markov random ﬁeld (HMRF) to model spatial

domains after distinguishing spatial and intrinsic genes (based on scRNA-seq). The major drawback of

the method of [19] is that it cannot learn contributions of spatial and intrinsic factors to gene expression

directly from spatial transcriptome data. In addition, the model relies on the assumptions that spatial

subtypes are discrete and exhibit homogeneous spatial patterns, which prohibits it from learning the un-

derlying mixture of diverse factors of cell identity with varied spatial patterns (e.g., distinct layer-like

structures or diffuse patterns). Several other methods have been developed to study the relationship

of known cell types in local neighborhoods [20], to explore the spatial variance of genes [21–24], and

to align scRNA-seq with spatial transcriptome data [25–27]. But no existing method seeks to jointly

model spatial patterns of the cells and their expression proﬁles to reveal cell identity, which is of vital

importance to fully utilize spatial transcriptome data.

Here, we report SPICEMIX (Spatial Identiﬁcation of Cells using Matrix Factorization), a new in-

tegrative framework to model spatial transcriptome data. SPICEMIX uses latent variable modeling to

express the interplay of spatial and intrinsic factors that comprise cell identity. Crucially, SPICEMIX

enhances the non-negative matrix factorization (NMF) [28] of gene expression with a novel integration

with the graphical representation of the spatial relationship of cells. Thus, the learned spatial patterns

can elucidate the relationship of intrinsic and spatial factors, leading to much more meaningful represen-

tations of cell identity. Application to the spatial transcriptome data of the mouse primary visual cortex

acquired by seqFISH+ [12] and STARmap [13] demonstrated that the latent representations learned by

SPICEMIX can reﬁne the identiﬁcation of cell types, uncover subtypes missed by other approaches, and

reveal important biological processes. SPICEMIX has the potential to provide critical new insights into

the cell composition based on spatial transcriptome data.

Results

Overview of SPICEMIX

SPICEMIX models the cell-to-cell relationships of the spatial transcriptome by a new probabilistic graph-

ical model formulation, the NMF-HMRF (Fig. 1). The input of the model consists of gene expression

measurements and spatial coordinates of cells from spatial transcriptome data (e.g., seqFISH+ [12] and

STARmap [13]). From the spatial coordinates, an undirected graph is constructed to capture pairwise

spatial relationships, where each cell is a node in the graph. For each node, a latent state vector explains

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

the observed gene expression of the cell. More critically, unique to the NMF-HMRF model is the inte-

gration of NMF into the HMRF [29] to represent the observations as mixtures of latent factors, modeled

by metagenes, where the proportions are the hidden states of the graph. In contrast, in a standard HMRF,

hidden states are assumed to be discrete, thus restricting the expressiveness of the model.

In the NMF-HMRF model of SPICEMIX, the potential functions of the graph capture the probabilis-

tic relationships between variables in the model. The potential functions for observations capture the

likelihood of the observation given the hidden state of the cell. The potential functions for edges capture

the spatial afﬁnity between the metagene proportions of neighboring cells. In a standard HMRF, it is

assumed that neighboring nodes will have similar hidden states, resulting in a spatial smoothing effect

that is inadequate to describe the heterogeneous spatial patterns of the cells. However, in the formula-

tion of SPICEMIX, we do not assume such a relationship a priori, but rather allow the method to learn

spatial afﬁnities from the spatial transcriptome data. Crucially, SPICEMIX learns the parameters of the

model that best explain the input spatial transcriptome data, while simultaneously learning the underlying

metagenes and their proportions that deﬁne the identities of the cells. This is achieved by a new opti-

mization algorithm that alternates between maximizing the joint posterior distribution of the parameters

in the model and maximizing the posterior distribution of the metagenes in the matrix factorization. The

learned parameters, metagenes, and proportions provide biological insights into the latent representation.

See Methods for the detailed description of the SPICEMIX model.

Evaluation of SPICEMIX on simulated spatial transcriptome data

We ﬁrst evaluated SPICEMIX on simulated data that we designed to model the mouse cortex, which has

served as a prominent case study for several spatial transcriptomic methods, including seqFISH+ [12] and

STARmap [13] (Fig. 2a-b; see Methods for detailed simulation strategy). This region of the brain con-

sists of cell types that exhibit strong, layer-wise patterns of expression as well as cell types that sparsely

populate the entire tissue. The goal of the evaluation was to infer the latent metagenes describing gene

expression and to reveal the underlying simulated cell types. We compared the inference of SPICEMIX

to that of NMF and HMRF, since they are the fundamental underlying models of many relevant compu-

tational methods. This comparison also aimed to demonstrate the advantage of the integration of these

two models in SPICEMIX, rather than using either alone. We assessed performance by quantitatively

comparing the cell types learned from each method with the simulated true cell types, using the adjusted

Rand index (ARI). For SPICEMIX and NMF, we applied additional hierarchical clustering to the learned

latent representation to group cells into clusters. The number of clusters was determined objectively by

maximizing the Calinski-Harabasz (CH) index [30]. The strategy for choosing other hyperparameters for

SPICEMIX and NMF is described in Methods. The number of clusters, or discrete states, for HMRF was

chosen automatically during operation, given an upper bound, and the smoothing parameter was chosen

manually to maximize the ARI, representing its best-case performance. We devised four simulation sce-

narios for evaluation, which varied the randomness of the data in terms of both the noise variance and

the variance of the true hidden states (see Methods).

We found that SPICEMIX consistently produced the best ARI score (0.6-0.8 on average; the maxi-

mum value being 1.0) across all scenarios (Fig. 2d). In contrast, NMF achieved an ARI between 0.2-0.4

on average, a reduction by more than 50%. As expected, as the variance of the expression values or

hidden states increased, the performance of all methods decreased (Fig. 2d). To ensure that the CH index

was not favorably biased towards SPICEMIX, we also evaluated NMF when the number of clusters was

instead chosen to maximize the ARI directly rather than according to the CH index (denoted as “NMF*”

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

in Fig. 2d). The resulting ARI ranged from 0.5-0.65. Thus, the best-case scenario for NMF was sig-

niﬁcantly worse than the performance of SPICEMIX. In addition, HMRF achieved a far lower ARI,

between 0.1-0.2 on average. Looking closer at an example simulated sample reveals that the superior

cell type inference of SPICEMIX was due to its successful recovery of both layer-speciﬁc and sparse

spatial patterns of metagenes (Fig. 2c; metagene 8 shows layer-speciﬁc localization whereas metagene 2

has a more diffuse pattern). The precise recovery of these metagenes lead to a much clearer separation

of the simulated cell types in the learned latent space of SPICEMIX (Fig. 2f). Notably, this resulted in a

clear and accurate delineation of the layer-speciﬁc excitatory neurons in the sample (Fig. 2e). We found

that, in contrast, the metagenes learned by NMF lacked spatial coherence (Fig. 2c). Consequently, NMF

often failed to reveal the excitatory neurons according to their layer-speciﬁc enrichment (Fig. 2e). Also,

in contrast to both SPICEMIX and NMF, HMRF smoothed over sparse cell types and yet still failed to

detect clear layer-wise boundaries (Fig. 2e), despite having optimized the smoothing parameter. Specif-

ically, the spatial patterns of the boundaries between HMRF clusters are not consistent with the ground

truth (dashed vertical lines in Fig. 2e), especially in layer L4, where green, yellow, and blue cell types

show an interleaving pattern. This same phenomenon was also manifested in our real data application

(see later sections and Fig. S4).

Taken together, we showed that the novel integration of matrix factorization and spatial modeling in

SPICEMIX yields superior inference of underlying cell identities across a variety of settings, compared

to either NMF or HMRF alone. This improvement was seen for cell types with either sparse or layer-

speciﬁc spatial patterns, both of which are prevalent in real data from complex tissues (e.g., the mouse

cortex data used in this work). In addition, our evaluation also conﬁrmed the effectiveness and robustness

of our new optimization scheme for ﬁtting the SPICEMIX model to spatial transcriptome data.

SPICEMIX reﬁnes cell identity inference from seqFISH+ data

We applied our method to the data acquired by seqFISH+ [12]. Speciﬁcally, we sought a robust model

of the spatial variation of gene expression using SPICEMIX that would reveal both intrinsic factors of

expression as well as spatial patterns, thereby unveiling cell identities more accurately. Here, we used

the data of ﬁve separate samples of the mouse primary visual cortex, all from the same mouse but from

contiguous layers, each from a distinct image or ﬁeld-of-view (FOV), with single-cell expression of

2,470 genes in 523 cells [12]. We compared the cell identities revealed by SPICEMIX to those of NMF

and Eng et al. [12].

The clustering of the learned latent representation of SPICEMIX revealed ﬁve excitatory neural sub-

types, two inhibitory neural subtypes, and eight glial subtypes (Fig. 3a), supported by scRNA-seq marker

genes [31] (Fig. 3b (left)). Although the assignment of major types was consistent between SPICEMIX,

NMF, and [12] (Fig. 3b (middle) and Fig. S1), SPICEMIX reﬁned and expanded the identiﬁcation of

cell subtypes (Fig. 3b (middle)). In particular, the identiﬁcation of layer-speciﬁc excitatory neurons by

SPICEMIX had a high correspondence with their associated layer (Fig. 3c), whereas several excitatory

clusters from the original analysis in [12] were incorrectly dispersed across as many as three layers (see

Fig. 3h in [12]). Furthermore, SPICEMIX correctly distinguished eL5b and eL6 neurons, which were

mixed together in several clusters in [12] (Fig. 3b (middle)). The expression of marker genes Col6a1 and

Ctgf [31] conﬁrmed the identity of these cells (Fig. 3b (left)).

Beyond mere discrete cell type assignments, the metagenes and spatial afﬁnities learned by SPICEMIX

provided new insight into the underlying factors of glial cell states. The metagenes of SPICEMIX tend to

capture either expression patterns of speciﬁc cell types, expressed at high levels, or patterns shared across

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

cell types, expressed at lower levels (Fig. 3b (right)). Notably, as annotated in Fig. 3b (right), metagene 7

is expressed at a high proportion among oligodendrocytes, distinguishing them from OPCs, while the ex-

pression of metagene 8, which is also present in OPCs, distinguished the rare Oligo-2 type from Oligo-1.

This separation is conﬁrmed by the expression patterns of the OPC marker gene Cspg4, the differen-

tiating Oligo marker gene Tcf7l2 [32], and the mature Oligo marker gene Mog [33] (Fig. 3b (left)).

Furthermore, the expression of the latter two marker genes supports the hypothesis that the Oligo-2 cells

of SPICEMIX are likely in an intermediate transition during maturation from OPCs to oligodendrocytes,

corresponding to the proportions of metagenes 7 and 8, rather than constituting a discrete cell type. Also,

the learned metagene spatial afﬁnities reveal that metagene 7 has a strong afﬁnity for metagenes 3 and

4 (highlighted by black arrows in Fig. 3d (right), which are expressed primarily by the excitatory neu-

rons of deeper tissue layers (eL5a, eL5b, and eL6) (Fig. 3b (right)). Thus, the spatial afﬁnity of this

oligodendrocyte-speciﬁc metagene 7 led to the separation of the Oligo-1 cells from OPCs, which, in

contrast, do not have a strong afﬁnity with any particular excitatory neuron type (Fig. 3d). In contrast,

without spatial information to help decompose the highly similar expression proﬁles of these cell types,

both NMF and Eng et al. [12] failed to distinguish these cells from other oligodendrocytes or OPCs

(Fig. S1 and Fig. 3b (middle), respectively). Lastly, SPICEMIX revealed an additional separation of a

cluster of [12] into SMC and Endo cells, which can be conﬁrmed by the expression of their respective

marker genes (i.e., Bgn highly expressed in SMC but not Endo cells, and Flt1 highly expressed in both

SMC and Endo cells [31]) (Fig. 3b).

Together, by analyzing the seqFISH+ data with SPICEMIX, we identiﬁed cell subtypes of the mouse

cortex whose spatial distributions are more consistent with prior experiments. We also delineated rarer

subtypes that were not distinguished by other methods. This analysis strongly demonstrates the advan-

tages and unique capabilities of SPICEMIX.

SPICEMIX reveals spatially-enriched cell types and subtypes from STARmap data

Next, we applied SPICEMIX to a single-cell spatial transcriptome dataset of the mouse cortex acquired

by STARmap [13]. As in the analysis of the seqFISH+ dataset, the learned latent representation of cell

identity of SPICEMIX provided a better characterization of cell subtypes and offered additional insight

into their underlying factors. We analyzed a single sample consisting of 930 cells passing quality control,

all from a single image or FOV, with expression measurements for 1020 genes. To distinguish cell-type

labels between methods, we append an asterisk to the end of the cell labels of Wang et al. [13] when

referenced.

We found that SPICEMIX produced more accurate cell labels than [13] and revealed subtypes missed

both in [13] and by NMF (Fig. 4, Fig. S2). In comparison to NMF, SPICEMIX uncovered the following

additional subtypes: SST inhibitory neuron, Oligo, Astro/Oligo, and two eL6 subtypes (Fig. 4a, b (left);

supported by known marker genes [13, 31]). In comparison to the clusters from [13], SPICEMIX reﬁned

the assignment of excitatory neurons and further delineated the Oligo type into three subtypes: Oligo-1,

Oligo-2, and Astro/Oligo (Fig. 4b (middle)). Speciﬁcally, SPICEMIX was able to learn the layer-like

structure of excitatory neurons in tissue (Fig. 4c), thereby improving upon the assignments reported in

Fig. 5d in [13], which erroneously mixed several neuron subtypes across layer boundaries. We noted

that ≥15 eL2/3* or eL4* cells of [13] in fact resided not in layers L2-L4 but in layers L5 and L6

(black ‘×’ in the middle panel in Fig. 4c) and ≥15 eL5* neurons of [13] resided outside of layer L5

(black dots in the bottom panel in Fig. 4c), which is not consistent with the spatial association of those

neurons. The reﬁnement by SPICEMIX is especially notable in the reassignment of 36 cells in excitatory

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?

Here, the authors report SPICEMIX, a new method based on probabilistic, latent variable modeling that enables effective joint analysis of spatial information and gene expression of single cells from spatial transcriptome data. 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Applications of SPICEMIX to single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and STARmap show that SPICEMIX can enhance the inference of cell identities and uncover potentially new cell subtypes with important biological processes.

Q2. What are the future works mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?

As future work, SPICEMIX could be further enhanced by incorporating additional modalities such as scRNA-seq data. In particular, the refined cell identity with SPICEMIX has the potential to improve future studies of cell-cell interactions [ 37 ]. This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity. In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.

Q3. What are the primary categories of neurons in the mouse cortex?

Cells in the mouse cortex are classified into three primary categories: inhibitory neurons, excitatory neurons, and non-neurons or glial cells [31, 44].

Q4. How long does SPICEMIX take to run?

SPICEMIX takes 0.5-2 hours to run on a spatial transcriptome dataset with 2,000 genes and 1,000 cells on a machine with eight 3.6 GHz CPUs and one GeForce 1080 Ti GPU.

Q5. What is the main drawback of scRNA-seq?

Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context.

Q6. What is the main reason why the compositions of different cell types in various human tissues remain poorly?

The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3].

Q7. What did SPICEMIX learn from the learning of metagenes?

Sparsely expressed metagenes, such as metagene 8, which led to the identification of PVALB inhibitory neurons, were also successfully recovered by SPICEMIX.

Q8. How many iterations of NMF is needed to find an initial estimate?

Since the algorithm uses a few iterations of NMF to provide an initial estimate, which is a reasonable starting point, it is expected to find a good initial estimate of metagenes and latent states efficiently.

Q9. what is the expression of metagenes in oligodendrocytes?

The expression of important marker genes for myelin-sheath formation in oligodendrocytes plotted against the relative expression of metagenes 12 and 13 of the same cells.

Q10. What is the significance of the asterisk after the p-value?

An asterisk after the p-value means that the result is significant under the threshold of 0.05 (see Supplementary Methods B.1 for details).

Q11. What is the simplest way to solve the scaling ambiguity between M and X?

To resolve the scaling ambiguity between M and X , the authors constrain the columns of M to sum to one, so as to lie in the (G − 1)-dimensional simplex, SG−1.

Q12. What is the label of the cell type in the legend?

The labels in the legend are the SPICEMIX cell type, followed by a dash, followed by the cell type of [13], denoted by an asterisk.

Q13. What is the corresponding morphological representation of the metagene?

Given the class-specific metagene proportions, which the authors denote by the K-dimensional vector bc for cell type c, the proportions for an individual cell are given byvi = ṽi∑ k ṽi,kṽi = bc + ηi,where ηi ∼ N (0, σxΣc) is a K-dimensional Gaussian random variable that controls the cell-to-cell variation of metagene proportion.

Q14. Why did the authors find that the algorithm can be too large?

The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered.

Q15. What enhancements could be made to the probabilistic model of SPICEMIX?

In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.

Q16. What is the significance of the correlations between the genes?

The authors found that the correlations of seven of the eleven genes were significant (p < 0.05, after a two-step FDR correction for multiple testing) (Fig. 4f and Fig. S6b), supporting their hypothesis.

SpiceMix: Integrative single-cell spatial modeling for inferring cell identity

Summary (3 min read)

Introduction

Overview of SPICEMIX

Discussion

Graphical model formulation

Parameter priors

Estimation of hidden states

P (Y,X|Θ)P (Θ) = argmax

Initialization

Empirical running time

Generation of simulated data

3 eL4 neurons

Figures (2)

Citations

References

"SpiceMix: Integrative single-cell s..." refers methods in this paper

"SpiceMix: Integrative single-cell s..." refers methods in this paper

"SpiceMix: Integrative single-cell s..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?

Q2. What are the future works mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?

Q3. What are the primary categories of neurons in the mouse cortex?

Q4. How long does SPICEMIX take to run?

Q5. What is the main drawback of scRNA-seq?

Q6. What is the main reason why the compositions of different cell types in various human tissues remain poorly?

Q7. What did SPICEMIX learn from the learning of metagenes?

Q8. How many iterations of NMF is needed to find an initial estimate?

Q9. what is the expression of metagenes in oligodendrocytes?

Q10. What is the significance of the asterisk after the p-value?

Q11. What is the simplest way to solve the scaling ambiguity between M and X?

Q12. What is the label of the cell type in the legend?

Q13. What is the corresponding morphological representation of the metagene?

Q14. Why did the authors find that the algorithm can be too large?

Q15. What enhancements could be made to the probabilistic model of SPICEMIX?

Q16. What is the significance of the correlations between the genes?