scispace - formally typeset
Search or ask a question
Posted ContentDOI

SpiceMix: Integrative single-cell spatial modeling for inferring cell identity

30 Nov 2020-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition in complex tissues but the development of computational methods that can utilize the unique properties of spatial transcriptome data to unveil cell identities remains a challenge.
Abstract: Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition in complex tissues. However, the development of computational methods that capture the unique properties of single-cell spatial transcriptome data to unveil cell identities remains a challenge. Here, we report SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW, a new probabilistic model that enables effective joint analysis of spatial information and gene expression of single cells based on spatial transcriptome data. Both simulation and real data evaluations demonstrate that SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW consistently improves upon the inference of the intrinsic cell types compared with existing approaches. As a proof-of-principle, we use SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW to analyze single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and STARmap. We find that SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW can improve cell identity assignments and uncover potentially new cell subtypes. SO_SCPLOWPICEC_SCPLOWMO_SCPLOWIXC_SCPLOW is a generalizable framework for analyzing spatial transcriptome data that may provide critical insights into the cell-type composition and spatial organization of cells in complex tissues.

Summary (3 min read)

Introduction

  • The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3].
  • Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context.
  • In addition, the model relies on the assumptions that spatial subtypes are discrete and exhibit homogeneous spatial patterns, which prohibits it from learning the underlying mixture of diverse factors of cell identity with varied spatial patterns (e.g., distinct layer-like structures or diffuse patterns).
  • Here, the authors report SPICEMIX (Spatial Identification of Cells using Matrix Factorization), a new integrative framework to model spatial transcriptome data.
  • SPICEMIX has the potential to provide critical new insights into the cell composition based on spatial transcriptome data.

Overview of SPICEMIX

  • SPICEMIX models the cell-to-cell relationships of the spatial transcriptome by a new probabilistic graphical model formulation, the NMF-HMRF (Fig. 1).
  • Crucially, SPICEMIX learns the parameters of the model that best explain the input spatial transcriptome data, while simultaneously learning the underlying metagenes and their proportions that define the identities of the cells.
  • The authors compared the inference of SPICEMIX to that of NMF and HMRF, since they are the fundamental underlying models of many relevant computational methods.
  • In particular, the identification of layer-specific excitatory neurons by SPICEMIX had a high correspondence with their associated layer (Fig. 3c), whereas several excitatory clusters from the original analysis in [12] were incorrectly dispersed across as many as three layers (see Fig. 3h in [12]).
  • Notably, as annotated in Fig. 3b , metagene 7 is expressed at a high proportion among oligodendrocytes, distinguishing them from OPCs, while the expression of metagene 8, which is also present in OPCs, distinguished the rare Oligo-2 type from Oligo-1.

Discussion

  • The authors developed SPICEMIX, an unsupervised method for modeling the diverse factors that collectively contribute to cell identity based on single-cell spatial transcriptome data.
  • This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity.
  • In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts.
  • As the area of spatial transcriptomics continues to thrive and data become more widely available, SPICEMIX will be a uniquely useful tool for enabling new discoveries.

Graphical model formulation

  • The authors formulation for the NMF-HMRF in SPICEMIX enhances standard NMF by modeling the spatial correlations among samples (i.e., cells in this context) via the HMRF [29].
  • Any graph construction method for determining edges, such as distance thresholding or Delaunay triangulation, can be used.
  • The observations are related to the hidden variables via the potential function φ, which captures the NMF formulation.
  • Ux measures the inner-product between the metagene proportions of neighboring cells i and j, weighted by a learned, pairwise correlation matrix Σ−1x , which captures the spatial affinity of metagenes.

Parameter priors

  • This prior can be viewed as a regularization that allows us to control the importance of the spatial relationships during inference.
  • Alternating estimation of hidden states and model parameters.
  • To infer the hidden states and model parameters of the NMF-HMRF model in SPICEMIX, the authors optimize the data likelihood via coordinate ascent, alternating between optimizing hidden states and model parameters.

Estimation of hidden states

  • Given parameters (9) This is a quadratic program and can be solved efficiently via the iterated conditional model (ICM) [41] using the software package Gurobi [42] (see Supplementary Methods A.1 for more details).
  • Algorithm 1 NMF-HMRF model-fitting and hidden state estimation.
  • Derive an initial estimate M (0) using K-means clustering assuming no spatial relationships, also known as 1.

P (Y,X|Θ)P (Θ) = argmax

  • The authors note that they can estimate metagenes, spatial affinity, and the noise level independently.
  • The MAP estimate of Σ−1x is convex and is solved by the optimizer Adam [43].
  • See Supplementary Methods A.2 for details of the optimization method.

Initialization

  • To produce initialize estimates of the model parameters and hidden states, the authors do the following.
  • First, the authors use a common strategy for initializing NMF, which is to cluster the data using K-means clustering, with K equal to the number of metagenes, and use the means of the clusters as an estimate of the metagenes.
  • This produces, in only a few quick iterations, an appropriate initial estimate for the algorithm, which will be subsequently refined.
  • The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered.
  • This value can be easily tuned by experimentation, and in their analysis, the authors found that just 5 iterations were necessary.

Empirical running time

  • The GPU is used for the first 5 iterations, or around that number, only, when the spatial affinity matrix Σ−1x is changed significantly.
  • Later on, most time is spent solving quadratic programmings.

Generation of simulated data

  • The authors generated simulated spatial transcriptomic data following expression and spatial patterns similar to cells in the mouse primary visual cortex.
  • The two inhibitory neuron types were scattered sparsely throughout several layers.
  • For excitatory neurons, the layer-specific metagene defined the subtype.
  • The authors generated the value for each gene for each metagene from the Gamma distribution with a scale parameter of 1.
  • Steps of data processing include: constructing the neighbor graph of cells, selection of hyperparameters for SPICEMIX, NMF, and HMRF, random seed selection, the choice of the number of metagenes, and the choice of the number of clusters for hierarchical clustering.

3 eL4 neurons

  • Oligo SMC Endo Micro NMFa VIP eL2/3 eL4 SST eL6 eL5a eL5b Micro SMC Endo OPC Astro Astro/Oligo Oligo-1 Oligo-2.
  • Note that colors throughout the figure of cells and labels correspond to the cell-type assignments of SPICEMIX.
  • It is highlighted in a (left) that SPICEMIX further delineated inhibitory neurons into VIPs and SSTs enclosed by the orange dashed cycle, and delineated Oligos and OPCs into separate subtypes: Astro/Oligo , Oligo-1 (light ), Oligo-2 , and OPC (red), enclosed with the red dashed cycle.
  • The colored boxes following the name of each marker gene correspond to their known associated cell type.
  • Average expression of inferred metagenes within SPICEMIX cell types.

Did you find this useful? Give us your feedback

Figures (2)

Content maybe subject to copyright    Report

SPICEMIX: Integrative single-cell spatial modeling
for inferring cell identity
Benjamin Chidester
1,#
, Tianming Zhou
1,#
, and Jian Ma
1,*
1
Computational Biology Department, School of Computer Science,
Carnegie Mellon University, Pittsburgh, PA 15213, USA
#
These two authors contributed equally.
*
Correspondence: jianma@cs.cmu.edu
Abstract
Spatial transcriptomics technologies promise to reveal spatial relationships of cell-type composition
in complex tissues. However, the development of computational methods that capture the unique
properties of single-cell spatial transcriptome data to unveil cell identities remains a challenge. Here,
we report SPICEMIX, a new method based on probabilistic, latent variable modeling that enables ef-
fective joint analysis of spatial information and gene expression of single cells from spatial transcrip-
tome data. Both simulation and real data evaluations demonstrate that SPICEMIX markedly improves
upon the inference of cell types compared with existing approaches. Applications of SPICEMIX to
single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and
STARmap show that SPICEMIX can enhance the inference of cell identities and uncover potentially
new cell subtypes with important biological processes. SPICEMIX is a generalizable framework for
analyzing spatial transcriptome data to provide critical insights into the cell-type composition and
spatial organization of cells in complex tissues.
1
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

Introduction
The compositions of different cell types in various human tissues remain poorly understood due to the
complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell iden-
tity [13]. Single-cell RNA-seq (scRNA-seq) has greatly advanced our understanding of complex cell
types in different tissues [46], but its utility in disentangling spatial factors in particular is inherently
limited by the dissociation of cells from their spatial context. To address this limitation, new spatial tran-
scriptomics technologies based on multiplexed imaging and sequencing [717] are able to reveal spatial
information of gene expression of dozens to tens of thousands of genes in individual cells in situ within
the tissue context.
However, the development of computational methods that capture the unique properties of the spa-
tially resolved single-cell transcriptome data to unveil single-cell identities remains a challenge [18].
Zhu et al. [19] previously proposed the use of a hidden Markov random field (HMRF) to model spatial
domains after distinguishing spatial and intrinsic genes (based on scRNA-seq). The major drawback of
the method of [19] is that it cannot learn contributions of spatial and intrinsic factors to gene expression
directly from spatial transcriptome data. In addition, the model relies on the assumptions that spatial
subtypes are discrete and exhibit homogeneous spatial patterns, which prohibits it from learning the un-
derlying mixture of diverse factors of cell identity with varied spatial patterns (e.g., distinct layer-like
structures or diffuse patterns). Several other methods have been developed to study the relationship
of known cell types in local neighborhoods [20], to explore the spatial variance of genes [2124], and
to align scRNA-seq with spatial transcriptome data [2527]. But no existing method seeks to jointly
model spatial patterns of the cells and their expression profiles to reveal cell identity, which is of vital
importance to fully utilize spatial transcriptome data.
Here, we report SPICEMIX (Spatial Identification of Cells using Matrix Factorization), a new in-
tegrative framework to model spatial transcriptome data. SPICEMIX uses latent variable modeling to
express the interplay of spatial and intrinsic factors that comprise cell identity. Crucially, SPICEMIX
enhances the non-negative matrix factorization (NMF) [28] of gene expression with a novel integration
with the graphical representation of the spatial relationship of cells. Thus, the learned spatial patterns
can elucidate the relationship of intrinsic and spatial factors, leading to much more meaningful represen-
tations of cell identity. Application to the spatial transcriptome data of the mouse primary visual cortex
acquired by seqFISH+ [12] and STARmap [13] demonstrated that the latent representations learned by
SPICEMIX can refine the identification of cell types, uncover subtypes missed by other approaches, and
reveal important biological processes. SPICEMIX has the potential to provide critical new insights into
the cell composition based on spatial transcriptome data.
Results
Overview of SPICEMIX
SPICEMIX models the cell-to-cell relationships of the spatial transcriptome by a new probabilistic graph-
ical model formulation, the NMF-HMRF (Fig. 1). The input of the model consists of gene expression
measurements and spatial coordinates of cells from spatial transcriptome data (e.g., seqFISH+ [12] and
STARmap [13]). From the spatial coordinates, an undirected graph is constructed to capture pairwise
spatial relationships, where each cell is a node in the graph. For each node, a latent state vector explains
2
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

the observed gene expression of the cell. More critically, unique to the NMF-HMRF model is the inte-
gration of NMF into the HMRF [29] to represent the observations as mixtures of latent factors, modeled
by metagenes, where the proportions are the hidden states of the graph. In contrast, in a standard HMRF,
hidden states are assumed to be discrete, thus restricting the expressiveness of the model.
In the NMF-HMRF model of SPICEMIX, the potential functions of the graph capture the probabilis-
tic relationships between variables in the model. The potential functions for observations capture the
likelihood of the observation given the hidden state of the cell. The potential functions for edges capture
the spatial affinity between the metagene proportions of neighboring cells. In a standard HMRF, it is
assumed that neighboring nodes will have similar hidden states, resulting in a spatial smoothing effect
that is inadequate to describe the heterogeneous spatial patterns of the cells. However, in the formula-
tion of SPICEMIX, we do not assume such a relationship a priori, but rather allow the method to learn
spatial affinities from the spatial transcriptome data. Crucially, SPICEMIX learns the parameters of the
model that best explain the input spatial transcriptome data, while simultaneously learning the underlying
metagenes and their proportions that define the identities of the cells. This is achieved by a new opti-
mization algorithm that alternates between maximizing the joint posterior distribution of the parameters
in the model and maximizing the posterior distribution of the metagenes in the matrix factorization. The
learned parameters, metagenes, and proportions provide biological insights into the latent representation.
See Methods for the detailed description of the SPICEMIX model.
Evaluation of SPICEMIX on simulated spatial transcriptome data
We first evaluated SPICEMIX on simulated data that we designed to model the mouse cortex, which has
served as a prominent case study for several spatial transcriptomic methods, including seqFISH+ [12] and
STARmap [13] (Fig. 2a-b; see Methods for detailed simulation strategy). This region of the brain con-
sists of cell types that exhibit strong, layer-wise patterns of expression as well as cell types that sparsely
populate the entire tissue. The goal of the evaluation was to infer the latent metagenes describing gene
expression and to reveal the underlying simulated cell types. We compared the inference of SPICEMIX
to that of NMF and HMRF, since they are the fundamental underlying models of many relevant compu-
tational methods. This comparison also aimed to demonstrate the advantage of the integration of these
two models in SPICEMIX, rather than using either alone. We assessed performance by quantitatively
comparing the cell types learned from each method with the simulated true cell types, using the adjusted
Rand index (ARI). For SPICEMIX and NMF, we applied additional hierarchical clustering to the learned
latent representation to group cells into clusters. The number of clusters was determined objectively by
maximizing the Calinski-Harabasz (CH) index [30]. The strategy for choosing other hyperparameters for
SPICEMIX and NMF is described in Methods. The number of clusters, or discrete states, for HMRF was
chosen automatically during operation, given an upper bound, and the smoothing parameter was chosen
manually to maximize the ARI, representing its best-case performance. We devised four simulation sce-
narios for evaluation, which varied the randomness of the data in terms of both the noise variance and
the variance of the true hidden states (see Methods).
We found that SPICEMIX consistently produced the best ARI score (0.6-0.8 on average; the maxi-
mum value being 1.0) across all scenarios (Fig. 2d). In contrast, NMF achieved an ARI between 0.2-0.4
on average, a reduction by more than 50%. As expected, as the variance of the expression values or
hidden states increased, the performance of all methods decreased (Fig. 2d). To ensure that the CH index
was not favorably biased towards SPICEMIX, we also evaluated NMF when the number of clusters was
instead chosen to maximize the ARI directly rather than according to the CH index (denoted as “NMF*”
3
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

in Fig. 2d). The resulting ARI ranged from 0.5-0.65. Thus, the best-case scenario for NMF was sig-
nificantly worse than the performance of SPICEMIX. In addition, HMRF achieved a far lower ARI,
between 0.1-0.2 on average. Looking closer at an example simulated sample reveals that the superior
cell type inference of SPICEMIX was due to its successful recovery of both layer-specific and sparse
spatial patterns of metagenes (Fig. 2c; metagene 8 shows layer-specific localization whereas metagene 2
has a more diffuse pattern). The precise recovery of these metagenes lead to a much clearer separation
of the simulated cell types in the learned latent space of SPICEMIX (Fig. 2f). Notably, this resulted in a
clear and accurate delineation of the layer-specific excitatory neurons in the sample (Fig. 2e). We found
that, in contrast, the metagenes learned by NMF lacked spatial coherence (Fig. 2c). Consequently, NMF
often failed to reveal the excitatory neurons according to their layer-specific enrichment (Fig. 2e). Also,
in contrast to both SPICEMIX and NMF, HMRF smoothed over sparse cell types and yet still failed to
detect clear layer-wise boundaries (Fig. 2e), despite having optimized the smoothing parameter. Specif-
ically, the spatial patterns of the boundaries between HMRF clusters are not consistent with the ground
truth (dashed vertical lines in Fig. 2e), especially in layer L4, where green, yellow, and blue cell types
show an interleaving pattern. This same phenomenon was also manifested in our real data application
(see later sections and Fig. S4).
Taken together, we showed that the novel integration of matrix factorization and spatial modeling in
SPICEMIX yields superior inference of underlying cell identities across a variety of settings, compared
to either NMF or HMRF alone. This improvement was seen for cell types with either sparse or layer-
specific spatial patterns, both of which are prevalent in real data from complex tissues (e.g., the mouse
cortex data used in this work). In addition, our evaluation also confirmed the effectiveness and robustness
of our new optimization scheme for fitting the SPICEMIX model to spatial transcriptome data.
SPICEMIX refines cell identity inference from seqFISH+ data
We applied our method to the data acquired by seqFISH+ [12]. Specifically, we sought a robust model
of the spatial variation of gene expression using SPICEMIX that would reveal both intrinsic factors of
expression as well as spatial patterns, thereby unveiling cell identities more accurately. Here, we used
the data of ve separate samples of the mouse primary visual cortex, all from the same mouse but from
contiguous layers, each from a distinct image or field-of-view (FOV), with single-cell expression of
2,470 genes in 523 cells [12]. We compared the cell identities revealed by SPICEMIX to those of NMF
and Eng et al. [12].
The clustering of the learned latent representation of SPICEMIX revealed ve excitatory neural sub-
types, two inhibitory neural subtypes, and eight glial subtypes (Fig. 3a), supported by scRNA-seq marker
genes [31] (Fig. 3b (left)). Although the assignment of major types was consistent between SPICEMIX,
NMF, and [12] (Fig. 3b (middle) and Fig. S1), SPICEMIX refined and expanded the identification of
cell subtypes (Fig. 3b (middle)). In particular, the identification of layer-specific excitatory neurons by
SPICEMIX had a high correspondence with their associated layer (Fig. 3c), whereas several excitatory
clusters from the original analysis in [12] were incorrectly dispersed across as many as three layers (see
Fig. 3h in [12]). Furthermore, SPICEMIX correctly distinguished eL5b and eL6 neurons, which were
mixed together in several clusters in [12] (Fig. 3b (middle)). The expression of marker genes Col6a1 and
Ctgf [31] confirmed the identity of these cells (Fig. 3b (left)).
Beyond mere discrete cell type assignments, the metagenes and spatial affinities learned by SPICEMIX
provided new insight into the underlying factors of glial cell states. The metagenes of SPICEMIX tend to
capture either expression patterns of specific cell types, expressed at high levels, or patterns shared across
4
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

cell types, expressed at lower levels (Fig. 3b (right)). Notably, as annotated in Fig. 3b (right), metagene 7
is expressed at a high proportion among oligodendrocytes, distinguishing them from OPCs, while the ex-
pression of metagene 8, which is also present in OPCs, distinguished the rare Oligo-2 type from Oligo-1.
This separation is confirmed by the expression patterns of the OPC marker gene Cspg4, the differen-
tiating Oligo marker gene Tcf7l2 [32], and the mature Oligo marker gene Mog [33] (Fig. 3b (left)).
Furthermore, the expression of the latter two marker genes supports the hypothesis that the Oligo-2 cells
of SPICEMIX are likely in an intermediate transition during maturation from OPCs to oligodendrocytes,
corresponding to the proportions of metagenes 7 and 8, rather than constituting a discrete cell type. Also,
the learned metagene spatial affinities reveal that metagene 7 has a strong affinity for metagenes 3 and
4 (highlighted by black arrows in Fig. 3d (right), which are expressed primarily by the excitatory neu-
rons of deeper tissue layers (eL5a, eL5b, and eL6) (Fig. 3b (right)). Thus, the spatial affinity of this
oligodendrocyte-specific metagene 7 led to the separation of the Oligo-1 cells from OPCs, which, in
contrast, do not have a strong affinity with any particular excitatory neuron type (Fig. 3d). In contrast,
without spatial information to help decompose the highly similar expression profiles of these cell types,
both NMF and Eng et al. [12] failed to distinguish these cells from other oligodendrocytes or OPCs
(Fig. S1 and Fig. 3b (middle), respectively). Lastly, SPICEMIX revealed an additional separation of a
cluster of [12] into SMC and Endo cells, which can be confirmed by the expression of their respective
marker genes (i.e., Bgn highly expressed in SMC but not Endo cells, and Flt1 highly expressed in both
SMC and Endo cells [31]) (Fig. 3b).
Together, by analyzing the seqFISH+ data with SPICEMIX, we identified cell subtypes of the mouse
cortex whose spatial distributions are more consistent with prior experiments. We also delineated rarer
subtypes that were not distinguished by other methods. This analysis strongly demonstrates the advan-
tages and unique capabilities of SPICEMIX.
SPICEMIX reveals spatially-enriched cell types and subtypes from STARmap data
Next, we applied SPICEMIX to a single-cell spatial transcriptome dataset of the mouse cortex acquired
by STARmap [13]. As in the analysis of the seqFISH+ dataset, the learned latent representation of cell
identity of SPICEMIX provided a better characterization of cell subtypes and offered additional insight
into their underlying factors. We analyzed a single sample consisting of 930 cells passing quality control,
all from a single image or FOV, with expression measurements for 1020 genes. To distinguish cell-type
labels between methods, we append an asterisk to the end of the cell labels of Wang et al. [13] when
referenced.
We found that SPICEMIX produced more accurate cell labels than [13] and revealed subtypes missed
both in [13] and by NMF (Fig. 4, Fig. S2). In comparison to NMF, SPICEMIX uncovered the following
additional subtypes: SST inhibitory neuron, Oligo, Astro/Oligo, and two eL6 subtypes (Fig. 4a, b (left);
supported by known marker genes [13, 31]). In comparison to the clusters from [13], SPICEMIX refined
the assignment of excitatory neurons and further delineated the Oligo type into three subtypes: Oligo-1,
Oligo-2, and Astro/Oligo (Fig. 4b (middle)). Specifically, SPICEMIX was able to learn the layer-like
structure of excitatory neurons in tissue (Fig. 4c), thereby improving upon the assignments reported in
Fig. 5d in [13], which erroneously mixed several neuron subtypes across layer boundaries. We noted
that 15 eL2/3* or eL4* cells of [13] in fact resided not in layers L2-L4 but in layers L5 and L6
(black × in the middle panel in Fig. 4c) and 15 eL5* neurons of [13] resided outside of layer L5
(black dots in the bottom panel in Fig. 4c), which is not consistent with the spatial association of those
neurons. The refinement by SPICEMIX is especially notable in the reassignment of 36 cells in excitatory
5
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 20, 2021. ; https://doi.org/10.1101/2020.11.29.383067doi: bioRxiv preprint

Citations
More filters
01 Nov 2016
TL;DR: Single-cell genomics has now made it possible to create a comprehensive atlas of human cells and has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry.
Abstract: Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.

372 citations

Journal ArticleDOI
TL;DR: In this article, a broad collection of approaches ranging from batch correction of individual omics datasets to association of chromatin accessibility and genetic variation with transcription are reviewed, as the number of single-cell experiments with multiple data modalities increases.
Abstract: The development of single-cell multimodal assays provides a powerful tool for investigating multiple dimensions of cellular heterogeneity, enabling new insights into development, tissue homeostasis and disease. A key challenge in the analysis of single-cell multimodal data is to devise appropriate strategies for tying together data across different modalities. The term ‘data integration’ has been used to describe this task, encompassing a broad collection of approaches ranging from batch correction of individual omics datasets to association of chromatin accessibility and genetic variation with transcription. Although existing integration strategies exploit similar mathematical ideas, they typically have distinct goals and rely on different principles and assumptions. Consequently, new definitions and concepts are needed to contextualize existing methods and to enable development of new methods. As the number of single-cell experiments with multiple data modalities increases, Argelaguet and colleagues review the concepts and challenges of data integration.

150 citations

Journal ArticleDOI
TL;DR: In this paper , the authors identify the key biological questions in spatial analysis of tissues and develop the requisite computational tools to address them, and group these biological problems and related computational algorithms into classes across length scales, thus characterizing common issues that need to be addressed.
Abstract: Methods for profiling RNA and protein expression in a spatially resolved manner are rapidly evolving, making it possible to comprehensively characterize cells and tissues in health and disease. To maximize the biological insights obtained using these techniques, it is critical to both clearly articulate the key biological questions in spatial analysis of tissues and develop the requisite computational tools to address them. Developers of analytical tools need to decide on the intrinsic molecular features of each cell that need to be considered, and how cell shape and morphological features are incorporated into the analysis. Also, optimal ways to compare different tissue samples at various length scales are still being sought. Grouping these biological problems and related computational algorithms into classes across length scales, thus characterizing common issues that need to be addressed, will facilitate further progress in spatial transcriptomics and proteomics.

95 citations

Journal ArticleDOI
TL;DR: A comprehensive review of the state-of-the-art spatial transcriptomics data analysis methods and pipelines can be found in this article, where the authors discuss how they operate on different technological platforms.
Abstract: Spatial transcriptomics is a rapidly growing field that promises to comprehensively characterize tissue organization and architecture at the single-cell or subcellular resolution. Such information provides a solid foundation for mechanistic understanding of many biological processes in both health and disease that cannot be obtained by using traditional technologies. The development of computational methods plays important roles in extracting biological signals from raw data. Various approaches have been developed to overcome technology-specific limitations such as spatial resolution, gene coverage, sensitivity, and technical biases. Downstream analysis tools formulate spatial organization and cell-cell communications as quantifiable properties, and provide algorithms to derive such properties. Integrative pipelines further assemble multiple tools in one package, allowing biologists to conveniently analyze data from beginning to end. In this review, we summarize the state of the art of spatial transcriptomic data analysis methods and pipelines, and discuss how they operate on different technological platforms.

66 citations

Journal ArticleDOI
TL;DR: In this paper, the authors highlight the opportunities for standardized benchmarking metrics and data-sharing infrastructure in spurring innovation moving forward for spatially resolved transcriptomic data. But, they focus on the computational challenges associated with these associated computational challenges.
Abstract: Spatially resolved transcriptomic data demand new computational analysis methods to derive biological insights. Here, we comment on these associated computational challenges as well as highlight the opportunities for standardized benchmarking metrics and data-sharing infrastructure in spurring innovation moving forward.

18 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"SpiceMix: Integrative single-cell s..." refers methods in this paper

  • ...The MAP estimate of is convex and is solved by the optimizer Adam [59]....

    [...]

  • ...The MAP estimate of Σ−1 x is convex and is solved by the optimizer Adam [43]....

    [...]

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

8,059 citations


"SpiceMix: Integrative single-cell s..." refers methods in this paper

  • ...First, to make inference tractable, we approximate the joint probability of the hidden states by the pseudo-likelihood (Murphy, 2012), which is the product of conditional probabilities of the hidden state of individual nodes given that of their neighbors,...

    [...]

  • ...By the Hammersley-Clifford theorem (Murphy, 2012), the likelihood of the data for the pairwise HMRF can be formulated as the product of pairwise dependencies between nodes,...

    [...]

  • ...12 is an approximation by the mean-field assumption (Murphy, 2012), which is used, in addition to the pseudo-likelihood assumption, to make the inference of model parameters tractable....

    [...]

Journal ArticleDOI
13 Jun 2019-Cell
TL;DR: A strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

7,892 citations

Proceedings Article
01 Jan 2000
TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.
Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

7,345 citations


"SpiceMix: Integrative single-cell s..." refers background in this paper

  • ...It builds upon non-negative matrix factorization (NMF) (Lee and Seung, 2001), which has become a popular paradigm for latent variable modeling of gene expression (Brunet et al....

    [...]

Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Spicemix: integrative single-cell spatial modeling for inferring cell identity" ?

Here, the authors report SPICEMIX, a new method based on probabilistic, latent variable modeling that enables effective joint analysis of spatial information and gene expression of single cells from spatial transcriptome data. 4. 0 International license available under a ( which was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. Applications of SPICEMIX to single-cell spatial transcriptome data of the mouse primary visual cortex acquired by seqFISH+ and STARmap show that SPICEMIX can enhance the inference of cell identities and uncover potentially new cell subtypes with important biological processes. 

As future work, SPICEMIX could be further enhanced by incorporating additional modalities such as scRNA-seq data. In particular, the refined cell identity with SPICEMIX has the potential to improve future studies of cell-cell interactions [ 37 ]. This additional data may improve the inference of the latent variables and parameters of the model, which could further improve the modeling of cellular heterogeneity. In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts. 

Cells in the mouse cortex are classified into three primary categories: inhibitory neurons, excitatory neurons, and non-neurons or glial cells [31, 44]. 

SPICEMIX takes 0.5-2 hours to run on a spatial transcriptome dataset with 2,000 genes and 1,000 cells on a machine with eight 3.6 GHz CPUs and one GeForce 1080 Ti GPU. 

Single-cell RNA-seq (scRNA-seq) has greatly advanced their understanding of complex cell types in different tissues [4–6], but its utility in disentangling spatial factors in particular is inherently limited by the dissociation of cells from their spatial context. 

The compositions of different cell types in various human tissues remain poorly understood due to the complex interplay among intrinsic, spatial, and temporal factors that collectively contribute to cell identity [1–3]. 

Sparsely expressed metagenes, such as metagene 8, which led to the identification of PVALB inhibitory neurons, were also successfully recovered by SPICEMIX. 

Since the algorithm uses a few iterations of NMF to provide an initial estimate, which is a reasonable starting point, it is expected to find a good initial estimate of metagenes and latent states efficiently. 

The expression of important marker genes for myelin-sheath formation in oligodendrocytes plotted against the relative expression of metagenes 12 and 13 of the same cells. 

An asterisk after the p-value means that the result is significant under the threshold of 0.05 (see Supplementary Methods B.1 for details). 

To resolve the scaling ambiguity between M and X , the authors constrain the columns of M to sum to one, so as to lie in the (G − 1)-dimensional simplex, SG−1. 

The labels in the legend are the SPICEMIX cell type, followed by a dash, followed by the cell type of [13], denoted by an asterisk. 

Given the class-specific metagene proportions, which the authors denote by the K-dimensional vector bc for cell type c, the proportions for an individual cell are given byvi = ṽi∑ k ṽi,kṽi = bc + ηi,where ηi ∼ N (0, σxΣc) is a K-dimensional Gaussian random variable that controls the cell-to-cell variation of metagene proportion. 

The authors observed that if T0 is too large, it can cause the algorithm to prematurely reach a local minimum before spatial relationships are considered. 

In addition, further enhancements could be made to the probabilistic model of SPICEMIX including additional priors, such as sparsity, to tailor toward particular application contexts. 

The authors found that the correlations of seven of the eleven genes were significant (p < 0.05, after a two-step FDR correction for multiple testing) (Fig. 4f and Fig. S6b), supporting their hypothesis.