SpiceMix: Integrative single-cell spatial modeling for inferring cell identity
Summary (3 min read)
Introduction
- The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3].
- Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context.
- In addition, the model relies on the assumptions that spatial subtypes are discrete and exhibit homogeneous spatial patterns, which prohibits it from learning the underlying mixture of diverse factors of cell identity with varied spatial patterns (e.g., distinct layer-like structures or diffuse patterns).
- Here, the authors report SPICEMIX (Spatial Identification of Cells using Matrix Factorization), a new integrative framework to model spatial transcriptome data.
- SPICEMIX has the potential to provide critical new insights into the cell composition based on spatial transcriptome data.
Overview of SPICEMIX
- SPICEMIX models the cell-to-cell relationships of the spatial transcriptome by a new probabilistic graphical model formulation, the NMF-HMRF (Fig. 1).
- Crucially, SPICEMIX learns the parameters of the model that best explain the input spatial transcriptome data, while simultaneously learning the underlying metagenes and their proportions that define the identities of the cells.
- The authors compared the inference of SPICEMIX to that of NMF and HMRF, since they are the fundamental underlying models of many relevant computational methods.
- In particular, the identification of layer-specific excitatory neurons by SPICEMIX had a high correspondence with their associated layer (Fig. 3c), whereas several excitatory clusters from the original analysis in [12] were incorrectly dispersed across as many as three layers (see Fig. 3h in [12]).
- Notably, as annotated in Fig. 3b , metagene 7 is expressed at a high proportion among oligodendrocytes, distinguishing them from OPCs, while the expression of metagene 8, which is also present in OPCs, distinguished the rare Oligo-2 type from Oligo-1.
Discussion
- The authors developed SPICEMIX, an unsupervised method for modeling the diverse factors that collectively contribute to cell identity based on single-cell spatial transcriptome data.
- This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity.
- In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.
- As the area of spatial transcriptomics continues to thrive and data become more widely available, SPICEMIX will be a uniquely useful tool for enabling new discoveries.
Graphical model formulation
- The authors formulation for the NMF-HMRF in SPICEMIX enhances standard NMF by modeling the spatial correlations among samples (i.e., cells in this context) via the HMRF [29].
- Any graph construction method for determining edges, such as distance thresholding or Delaunay triangulation, can be used.
- The observations are related to the hidden variables via the potential function φ, which captures the NMF formulation.
- Ux measures the inner-product between the metagene proportions of neighboring cells i and j, weighted by a learned, pairwise correlation matrix Σ−1x , which captures the spatial affinity of metagenes.
Parameter priors
- This prior can be viewed as a regularization that allows us to control the importance of the spatial relationships during inference.
- Alternating estimation of hidden states and model parameters.
- To infer the hidden states and model parameters of the NMF-HMRF model in SPICEMIX, the authors optimize the data likelihood via coordinate ascent, alternating between optimizing hidden states and model parameters.
P (Y,X|Θ)P (Θ) = argmax
- The authors note that they can estimate metagenes, spatial affinity, and the noise level independently.
- The MAP estimate of Σ−1x is convex and is solved by the optimizer Adam [43].
- See Supplementary Methods A.2 for details of the optimization method.
Initialization
- To produce initialize estimates of the model parameters and hidden states, the authors do the following.
- First, the authors use a common strategy for initializing NMF, which is to cluster the data using K-means clustering, with K equal to the number of metagenes, and use the means of the clusters as an estimate of the metagenes.
- This produces, in only a few quick iterations, an appropriate initial estimate for the algorithm, which will be subsequently refined.
- The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered.
- This value can be easily tuned by experimentation, and in their analysis, the authors found that just 5 iterations were necessary.
Empirical running time
- The GPU is used for the first 5 iterations, or around that number, only, when the spatial affinity matrix Σ−1x is changed significantly.
- Later on, most time is spent solving quadratic programmings.
Generation of simulated data
- The authors generated simulated spatial transcriptomic data following expression and spatial patterns similar to cells in the mouse primary visual cortex.
- The two inhibitory neuron types were scattered sparsely throughout several layers.
- For excitatory neurons, the layer-specific metagene defined the subtype.
- The authors generated the value for each gene for each metagene from the Gamma distribution with a scale parameter of 1.
- Steps of data processing include: constructing the neighbor graph of cells, selection of hyperparameters for SPICEMIX, NMF, and HMRF, random seed selection, the choice of the number of metagenes, and the choice of the number of clusters for hierarchical clustering.
3 eL4 neurons
- Oligo SMC Endo Micro NMFa VIP eL2/3 eL4 SST eL6 eL5a eL5b Micro SMC Endo OPC Astro Astro/Oligo Oligo-1 Oligo-2.
- Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SPICEMIX.
- It is highlighted in a (left) that SPICEMIX further delineated inhibitory neurons into VIPs and SSTs enclosed by the orange dashed cycle, and delineated Oligos and OPCs into separate subtypes: Astro/Oligo , Oligo-1 (light ), Oligo-2 , and OPC (red), enclosed with the red dashed cycle.
- The colored boxes following the name of each marker gene correspond to their known associated cell type.
- Average expression of inferred metagenes within SPICEMIX cell types.
Did you find this useful? Give us your feedback
Citations
372 citations
150 citations
95 citations
66 citations
18 citations
References
111,197 citations
"SpiceMix: Integrative single-cell s..." refers methods in this paper
...The MAP estimate of is convex and is solved by the optimizer Adam [59]....
[...]
...The MAP estimate of Σ−1 x is convex and is solved by the optimizer Adam [43]....
[...]
43,862 citations
8,059 citations
"SpiceMix: Integrative single-cell s..." refers methods in this paper
...First, to make inference tractable, we approximate the joint probability of the hidden states by the pseudo-likelihood (Murphy, 2012), which is the product of conditional probabilities of the hidden state of individual nodes given that of their neighbors,...
[...]
...By the Hammersley-Clifford theorem (Murphy, 2012), the likelihood of the data for the pairwise HMRF can be formulated as the product of pairwise dependencies between nodes,...
[...]
...12 is an approximation by the mean-field assumption (Murphy, 2012), which is used, in addition to the pseudo-likelihood assumption, to make the inference of model parameters tractable....
[...]
7,892 citations
7,345 citations
"SpiceMix: Integrative single-cell s..." refers background in this paper
...It builds upon non-negative matrix factorization (NMF) (Lee and Seung, 2001), which has become a popular paradigm for latent variable modeling of gene expression (Brunet et al....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. What are the future works mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?
As future work, SPICEMIX could be further enhanced by incorporating additional modalities such as scRNA-seq data. In particular, the refined cell identity with SPICEMIX has the potential to improve future studies of cell-cell interactions [ 37 ]. This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity. In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.
Q3. What are the primary categories of neurons in the mouse cortex?
Cells in the mouse cortex are classified into three primary categories: inhibitory neurons, excitatory neurons, and non-neurons or glial cells [31, 44].
Q4. How long does SPICEMIX take to run?
SPICEMIX takes 0.5-2 hours to run on a spatial transcriptome dataset with 2,000 genes and 1,000 cells on a machine with eight 3.6 GHz CPUs and one GeForce 1080 Ti GPU.
Q5. What is the main drawback of scRNA-seq?
Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context.
Q6. What is the main reason why the compositions of different cell types in various human tissues remain poorly?
The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3].
Q7. What did SPICEMIX learn from the learning of metagenes?
Sparsely expressed metagenes, such as metagene 8, which led to the identification of PVALB inhibitory neurons, were also successfully recovered by SPICEMIX.
Q8. How many iterations of NMF is needed to find an initial estimate?
Since the algorithm uses a few iterations of NMF to provide an initial estimate, which is a reasonable starting point, it is expected to find a good initial estimate of metagenes and latent states efficiently.
Q9. what is the expression of metagenes in oligodendrocytes?
The expression of important marker genes for myelin-sheath formation in oligodendrocytes plotted against the relative expression of metagenes 12 and 13 of the same cells.
Q10. What is the significance of the asterisk after the p-value?
An asterisk after the p-value means that the result is significant under the threshold of 0.05 (see Supplementary Methods B.1 for details).
Q11. What is the simplest way to solve the scaling ambiguity between M and X?
To resolve the scaling ambiguity between M and X , the authors constrain the columns of M to sum to one, so as to lie in the (G − 1)-dimensional simplex, SG−1.
Q12. What is the label of the cell type in the legend?
The labels in the legend are the SPICEMIX cell type, followed by a dash, followed by the cell type of [13], denoted by an asterisk.
Q13. What is the corresponding morphological representation of the metagene?
Given the class-specific metagene proportions, which the authors denote by the K-dimensional vector bc for cell type c, the proportions for an individual cell are given byvi = ṽi∑ k ṽi,kṽi = bc + ηi,where ηi ∼ N (0, σxΣc) is a K-dimensional Gaussian random variable that controls the cell-to-cell variation of metagene proportion.
Q14. Why did the authors find that the algorithm can be too large?
The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered.
Q15. What enhancements could be made to the probabilistic model of SPICEMIX?
In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.
Q16. What is the significance of the correlations between the genes?
The authors found that the correlations of seven of the eleven genes were significant (p < 0.05, after a two-step FDR correction for multiple testing) (Fig. 4f and Fig. S6b), supporting their hypothesis.