scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Dimensionality reduction for visualizing single-cell data using UMAP.

TL;DR: Comparing the performance of UMAP with five other tools, it is found that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters.
Abstract: Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.

Summary (2 min read)

Introduction

  • The optimal power flow (OPF) problem optimizes certain objective such as power loss and generation cost subject to power flow equations and operational constraints.
  • Recently, convex relaxations of the OPF problem have been proposed.
  • SDP/SOCP relaxation is not always exact, especially when the underlying network is not radial.
  • In [14], the authors perturbed the objective function, and this technique is guaranteed to work in some cases, but in general case the authors still do not have a rank one feasible solution.

B. OPF and SDP relaxation

  • The OPF problem seeks to optimize certain objective, e.g. total line loss, or generation cost, subject to power flow equations (1) and various operational constraints.
  • The voltage magnitude at each load bus i ∈ N needs to be maintained within a prescribed region, i.e. V mini ≤ |Vi| ≤ V maxi .
  • Notice that this optimization is non-convex because of the rank constraint rankW ≤.

III. AN ADMM HEURISTIC FOR THE OPF

  • The authors apply the ADMM method to derive a heuristic for the nonconvex OPF problem (5).
  • The rank constraint helps us to come up with the tractable non-convex minimization, hence the ADMM provides a sequence of convex program which approximately solves the original non-convex OPF.
  • Another important property of their algorithm is that if the initial iterates Z0,Λ0 are Hermitian matrices, then all W k, Zk,Λk are Hermitian matrices: Proposition 3.

A. Feasible point

  • The convergence of the ADMM heuristic for a non-convex problem is still an open question [16].
  • If it converges, then the authors are guaranteed to have a rank one feasible point of the OPF.
  • Finally, since W ∗ ∈ C, the authors can conclude that W ∗ is a rank one feasible point in the optimal power flow problem.

B. Stopping criterion and ρ

  • For the stopping criterion, the authors use the one from [16].
  • When the algorithm converges, Rk and Sk should be zero.
  • This gives rise to the following stopping criterion: ‖Rk‖F ≤ pri, ‖Sk‖F ≤ dual.
  • Lastly, the choice of ρ can be automated based on these residuals.

C. Overall algorithm

  • Notice that when ρ = 0, the optimization (11) is a SDP relaxation of the OPF.
  • This helps us not to get trapped, but the price the authors pay is a possible oscillation.

A. Two bus network

  • It is shown in [3] that the feasible region becomes the two disjoint regions under some reactive power constraints on q1, q2, as shown by the black lines on the ellipse.
  • Hence, a feasible power injection (rank 1 solution) is received in this simple network.
  • Here the mesh network is a ring with 10 nodes and 10 links.
  • Indeed, when their heuristic converges, it recovers the rank one solution although the SDP relaxation always generate a full rank solution.

V. CONCLUSION

  • The authors propose a non-convex ADMM heuristic for the OPF.
  • By introducing a redundant variable whose rank is one, the authors can split the minimization into two steps, where the first step is a convex optimization, and the second step is a rank constrained minimization.
  • Then, the authors show that the second step, a non-convex optimization, can be carried out analytically.
  • Moreover, the authors observe the convergence of their heuristic under the existence of hidden rank one solution in the SDP relaxation of the OPF.
  • Inspired by this, the convergence proof under this 2Minimum eigenvalue of the solution from the SDP relaxation is greater than 0.01 in all cases.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Evaluation of UMAP as an alternative to t-SNE for single-cell data
Etienne Becht
1
, Charles-Antoine Dutertre
1
, Immanuel W.H. Kwok
1
, Lai Guan Ng
1
, Florent
Ginhoux
1
, Evan W. Newell
1*
1
Singapore Immunology Network (SigN), Agency for Science, Technology and Research
(A*STAR)
*
Corresponding author. evan_newell@immunol.a-star.edu.sg
Abstract
Uniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear
dimensionality reduction technique. Another such algorithm, t-SNE, has been the default
method for such task in the past years. Herein we comment on the usefulness of UMAP
high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster
runtime and consistency, meaningful organization of cell clusters and preservation of
continuums in UMAP compared to t-SNE.
Introduction
The last decades have witnessed a large increment in the number of parameters analysed in
single cell cytometry studies. It currently reaches around 20 for flow-cytometry, 40 for mass-
cytometry, and more than 20,000 in single-cell RNA-sequencing. In this context,
dimensionality reduction techniques have been pivotal in enabling researchers to visualize
high-dimensional data. While principal component analysis has historically been the main
technique used for dimensionality reduction (DR), the recent years have highlighted the
importance of non-linear DR techniques to avoid overcrowding issues. Common such
techniques[1] include Isomap, Diffusion Map and t-SNE[2] (also renamed viSNE[3]). t-SNE
is currently the most commonly-used technique and is efficient at highlighting local structure
in the data, which for cytometry notably translates to the representation of cell populations as
distinct clusters. t-SNE however suffers from limitations such as loss of large-scale
information (the inter-cluster relationships), slow computation time and inability to
meaningfully represent very large datasets[4]. A new algorithm, called Uniform Manifold
Approximation and Projection (UMAP) has been recently published by McInnes and
Healy[5]. They claim that compared to t-SNE it preserves as much of the local and more of
the global data structure, with a shorter runtime. Since t-SNE has been extremely prevalent
in the field of cytometry broadly encompassing flow and mass-cytometry as well as single-
cell RNA-sequencing (scRNAseq), we tested these claims on well-characterized single-cell
datasets[6-8]. We confirm that UMAP is an order of magnitude faster than t-SNE. In addition
to this straightforward advantage, we argue that UMAP is not only able to create informative
clusters, but is also able to organize these clusters in a meaningful way. We illustrate these
claims by showing that UMAP can order clusters from T and NK cells from 8 human
organs[7] in a way that both identifies major cell lineages but also a common axis that
broadly recapitulates their differentiation stages. We also show that UMAP allows for an
easier visualization of multibranched cellular trajectories by using a mass-cytometry[6] and a
scRNAseq[8] datasets both recapitulating hematopoiesis.
Results
Faster runtime, equivalent local information and superior global structure
5
10
15
20
25
30
35
40
45
.CC-BY-NC 4.0 International licenseavailable under a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

We ran UMAP and t-SNE simultaneously on a dataset covering 39 samples originating from
8 distinct human tissues enriched for T and NK cells, of more than >350,000 events with 42
protein targets[7]. As observed by McInnes and Healy[5], we measured runtimes that were
significantly lower (5 minutes on average for UMAP for 200,000 cells, versus 2 hour and 22
minutes for Barnes-Hut t-SNE) across a large range of dataset sizes (Figure S1). Using
Phenograph[9] clustering and manual cluster labeling, we classified events into 7 broad cell
populations (Figure S2A). UMAP and t-SNE were both successful at pulling together only
clusters corresponding to similar cell populations with generally very good correspondence
with Phenograph clustering (Figure 1A and Figure S2B). However t-SNE separated cell
populations into distinct clusters more commonly than UMAP, notably splitting CD8 T cells,
gamma-delta T cells and contaminating cells (likely including B cells) into two distinct
clusters each. Although this highlights that tSNE might be more sensitive in segregating
these populations that differ, we were unable to test this quantitatively. We also note that
although these cells were not always segregated into completely distinct clusters by UMAP,
these cell populations remained similarly identifiable in UMAP as compared to tSNE (Figure
S2B). In addition, UMAP appeared more stable than t-SNE, being more consistent across
distinct replicates and independent subsampling which should facilitate consistency in its
intepretation (Figure S3). By color-coding the tissues of origin on the UMAP and t-SNE
maps, we observed that t-SNE grouped cell clusters according to their origin more often than
UMAP (Figure 1B and Figure S4). UMAP instead ordered events according to their origin
within each major cluster, roughly from cord-blood and PBMCs, to liver and spleen, and to
tonsils one the one end to skin, gut and lung on the other end. The sample type was not
given as an input of any of these two algorithms. Instead we observed that UMAP was able
to recapitulate the differentiation stage of T cells within each major cluster, as seen by the
expression levels of events for the resident-memory markers CD69 and CD103, the memory
T cell marker CD45RO and naive cells marker CCR7 on the UMAP projection (Figure 1C).
By contrast, while t-SNE identified similar continuums within clusters, they had no apparent
structure along a common axis that made them easily identifiable (Figure 1D).
UMAP better represents the multi-branched trajectory of hematopoietic development
To investigate how UMAP handles continuity of cell phenotypes we applied it alongside t-
SNE on the well-documented topic of bone-marrow hematopoiesis using both a mass-
cytometry (>86,000 events, 25 parameters, 24 cell populations annotated by its authors[6])
and a scRNAseq dataset (three sample types, 51,252 cells, 25,912 dimensions[8]). On the
mass-cytometry dataset, UMAP visually revealed 8 major cell clusters (Figure 2A). One was
composed of all B cell subsets (and close to a small cluster of plasma cells) and one of all T
cell subsets. Four small homogeneous clusters corresponded to macrophages, NK cells,
eosinophils and non-classical monocytes. The last cluster contained 11 out of the 24
manually-gated populations and appeared most interesting with respect to hematopoiesis.
Indeed, these populations were ordered according to a five-leaf branched structure that was
consistent with hematopoietic differentiation: hematopoietic stem cells (HSC) overlapped
with multipotent progenitors (MPP). These cells neighbored common lymphoid progenitors
(CLP) on one side, and common myeloid progenitors (CMP) on the other. CMP led to
myeloid-erythroid progenitors (MEP) which led to unlabelled erythrocytes (Figure S5), and to
granulocyte-myeloid progenitors (GMP). GMP then led to classical monocytes that further
led to myeloid dendritic cells on one branch and to cells labeled as intermediate monocytes
on another branch. UMAP linked basophils to a population of
Lin
cKit
+
Sca1
CD34
+
FcγRII/III
+
FcεRIα
+
cells, consistently with a previously-described
50
55
60
65
70
75
80
85
90
95
.CC-BY-NC 4.0 International licenseavailable under a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

phenotype for committed basophil progenitors[10]. These putative progenitors appeared
closer to CMP than GMP, a topic that is still intensely debated. Neutrophils were gated-out
from the dataset by its authors and are thus absent from this representation[6]. t-SNE
identified relatively similar clusters with a few differences (Figure 2B), notably singling out
more clusters than UMAP. CD4
+
T cells were separated from other T cell subsets. As noted
by others[11], t-SNE expands low density areas and tends to ignore global relationships.
Thus, while some paths from HSC and MPP to differentiated populations were still apparent
- notably from HSC to monocytes, the overall structure was less clear, as no narrow “neck”
led to larger terminal clusters. t-SNE also separated basophils from their putative precursors
close to CMP and GMP and pDC from CLP. The density of events in the dimensionally-
reduced space also appeared less uniform in t-SNE, with large clusters in the t-SNE space
being less dense than the smaller ones. In contrast, the density of UMAP clusters appeared
more uniform, which could help avoid biases in interpreting phenotypic heterogeneity in large
versus small clusters (Figure S6).
From the scRNAseq dataset we analyzed collectively the transcriptomes of cells isolated
from Bone Marrow (BM), cKit
+
BM and Peripheral Blood (PB) to facilitate identifying mature
versus progenitor cell populations (Figure 2C). We first removed low-abundance cell types
such as basophils and eosinophils, contaminants such as mature erythrocytes as well as
outlier cells originating from unique samples and highly expressing mitochondrial transcripts
(Figure S7). Using published cell signatures specific for mouse BM cell populations[12], we
were able to identify cell clusters that corresponded to MPP, MEP, macrophages, B cells, T
cells and NK cells (Figure 2D). Consistently with the UMAP projection of the mass-cytometry
dataset, MPP were found in the middle of a larger group of clusters that led to differentiated
cells originating from PB samples (Figure 2C). PB events consisted of distinct clusters of
lymphocytes (T, NK and B cells), macrophages, MEP and neutrophils (Figure 2D). Although
this does not prove that cells lying between MPPs and differentiated cells are committed
progenitors, these results suggest that UMAP could be used as a hypothesis generating tool
to identify putative markers for such cells. By investigating a small cluster of cells lying in
between MPP and mature B cells in the UMAP projection, we were indeed able to identify
the pre-B cell marker Vpreb3[13] and to hypothesize that Chchd10 could be another gene
marker for pre-B cells in mouse bone marrow (Figure 2E). These conclusions and
hypotheses would have been more difficult to draw using t-SNE which blurred the
relationship of terminal clusters to MPPs (Figure 2D and Figure 2E).
Discussion
Our analysis and example provided show that UMAP seems to yield representation that are
as meaningful as t-SNE does, particularly in its ability resolve even subtly differing cell
populations. In addition, it provides the useful and intuitively pleasing feature that it
preserves more of the global structure, and notably, the continuity of the cell subsets. In
addition to making plots easier to interpret, we highlight that this also improves its utility for
generating hypotheses related to cellular development. On a practical level, UMAP outputs
are faster to compute and more reproducible than those from t-SNE. Altogether, based on its
ease of use, these results and our other experience so far, we anticipate that UMAP will be a
highly valuable tool that can be rapidly adopted by single-cell analysis community.
Methods
Datasets
100
105
110
115
120
125
130
135
140
.CC-BY-NC 4.0 International licenseavailable under a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

The main characteristics of the datasets we analyzed are presented in Table 1. For the
Wong et al. dataset, based on live CD45+ cell events available on FlowRepository (see
table) non-B (CD19
-
), non-monocyte (CD14
-
) were selected using FlowJo software. In order
to partially equalize weighting of each human tissue, a maximum of 10,000 events were
randomly sampled from each of the 39 samples prior to analysis. Other datasets were used
as described in the table.
Dataset Identifier Single-cell
technique
Organism Tissues Samples Single
cells
Analyzed
parameters
Wong FlowRepository
FR-FCM-ZZTM
Mass-
cytometry
Human 8 distinct
tissues
39 327,457 39
Samusik_01 FlowRepository
FR-FCM-ZZPH
Mass-
cytometry
Mouse Bone
marrow
1 86,864 38
Han Figshare
865e694ad06d5
857db4b
Single-cell
RNAseq
Mouse Bone
marrow
and blood
14 51,252 25,912
Transformations and pre-processing
For the bone marrow mass-cytometry data we used an arcsinh transformation with a
cofactor of 1, and a logicle transform (parameters w=0.25, t=16409, m=4.5, a=0) for the
Wong dataset. For the scRNAseq dataset we transformed count into reads per millions (thus
normalizing the number of counts per cells to 1).
Running UMAP and t-SNE
For both mass-cytometry datasets we used UMAP using 15 nn, a min_dist of 0.2 and
euclidean distance. For the scRNAseq dataset we computed 100 approximate principal
components using the IRLBA R package and used them as an input for both t-SNE and
UMAP. We then ran UMAP using 30 nearest neighbors (nn) and a min_dist of 0.1 and using
the correlation metric. For t-SNE we ran the Barnes-Hut[14] implementation of the t-SNE
algorithm through its R implementation in the Rtsne package, using default parameters.
Cell annotations
For the Samusik_01 dataset we used cell annotations provided by the authors and available
from the public repository. For the Wong et al. dataset we used Phenograph clustering (with
default parameters k=30) and manually labeled the clusters into broad cell populations. For
then Han et al. dataset we used the AUCell R package[15] , which computes the AUC of
gene sets within each single cell, using gene sets from the Haemopedia[12] resource to
annotate cell lineages. We then manually thresholded these AUC scores to obtain
categorical labels. Cells that were assigned to multiple lineages were set to unlabeled.
Figure legends
Figure 1
UMAP and t-SNE projections of the Wong et al. dataset colored according to A) broad cell
lineages, B) tissue of origin, and for C) UMAP and D) t-SNE, the expression of CD69,
145
150
155
160
165
170
175
180
.CC-BY-NC 4.0 International licenseavailable under a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

CD103, CD45RO and CCR7. For C) and D), blue denotes minimal expression, beige
intermediate and red high expression.
Figure 2
A) UMAP and B) t-SNE projection of the Samusik_01 dataset. Events are color-coded
according to manual gates provided by the authors of the dataset. C) UMAP and t-SNE
projections of the Han dataset, color-coded by tissue of origin or D) by cell populations. E)
Expression of the V-set pre-B cell surrogate light chain 3 gene (Vpreb3) and Chchd10 genes
on the UMAP and t-SNE projections of the Han dataset. Blue denotes minimal expression,
beige intermediate and red high expression.
Supplementary Figure 1
Runtime of both UMAP (red) and t-SNE (blue) on randomly-selected subsets of the Wong
dataset using various sampling sizes. 3 subsamples were selected for each subset size and
input to both algorithms. Vertical lines represent standard deviations - and are too short to be
visible for most data points.
Supplementary Figure 2
A) Phenotypic characterization of the phenograph clusters. Each cluster medoid is
represented after column-wise Z-score transformation. B) Identification of each phenograph
cluster of both UMAP (left) and t-SNE. For clarity, only twelve clusters are shown per plot.
Supplementary Figure 3
Datapoints were colored according to their position on the UMAP (left) or t-SNE (right)
projection for the full Wong dataset. Then 3 subsets of various sizes were randomly selected
and input to UMAP and t-SNE. The resulting projections were colored according to the full
dataset projections in order to compare positions across random subsets and replicates.
Supplementary Figure 4
UMAP and t-SNE projections of the Wong dataset individually color-coded by tissue of
origin.
Supplementary Figure 5
Expression of Ter119 (a marker for mature erythrocytes) on the UMAP projection of the
Samusik_01 dataset.
Supplementary Figure 6
Heatmap of the density of a 300x300 square grid of the UMAP or t-SNE projections for the
Samusik_01 dataset. The number of events in each bin is color-coded.
Supplementary Figure 7
Top: UMAP projection of the full Han dataset annotated by AUC scores for various cell
lineages (red : high score, blue : low score). Bottom: full Han dataset colored by sample
type, Sample ID and pre-filtering status.
Acknowledgements
We thank members of the Singapore Immunology Network and notably members of the E.N.
laboratory. We thank Shamin Li, Yannick Simoni, Melissa Chng, Yang Cheng, Jack Wee Lim
185
190
195
200
205
210
215
220
225
.CC-BY-NC 4.0 International licenseavailable under a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

Citations
More filters
Posted Content
TL;DR: The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.
Abstract: UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology The result is a practical scalable algorithm that applies to real world data The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning

5,390 citations


Cites methods from "Dimensionality reduction for visual..."

  • ...Based upon preliminary releases of a so‰ware implementation, UMAP has already found widespread use in the €elds of bioinformatics [5, 12, 17, 46, 2, 45, 15], materials science [34, 23], and machine learning [14, 20, 21, 24, 19, 47] among others....

    [...]

Journal ArticleDOI
TL;DR: Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.
Abstract: The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony ( https://github.com/immunogenomics/harmony ), an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~106 cells on a personal computer. We apply Harmony to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells, mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data. Harmony, for the integration of single-cell transcriptomic data, identifies broad and fine-grained populations, scales to large datasets, and can integrate sequencing- and imaging-based data.

2,459 citations

Journal ArticleDOI
Xin Zou1, Ke Chen1, Jiawei Zou1, Peiyi Han1, Jie Hao1, Ze-Guang Han1 
TL;DR: This study constructed a risk map indicating the vulnerability of different organs to 2019-nCoV infection, and identified the organs at risk, such as lung, heart, esophagus, kidney, bladder, and ileum, and located specific cell types (i.e., type II alveolar cells (AT2), myocardial cells, proximal tubule cells of the kidney, ileal cells, and bladder urothelial cells).
Abstract: It has been known that, the novel coronavirus, 2019-nCoV, which is considered similar to SARS-CoV, invades human cells via the receptor angiotensin converting enzyme II (ACE2). Moreover, lung cells that have ACE2 expression may be the main target cells during 2019-nCoV infection. However, some patients also exhibit non-respiratory symptoms, such as kidney failure, implying that 2019-nCoV could also invade other organs. To construct a risk map of different human organs, we analyzed the single-cell RNA sequencing (scRNA-seq) datasets derived from major human physiological systems, including the respiratory, cardiovascular, digestive, and urinary systems. Through scRNA-seq data analyses, we identified the organs at risk, such as lung, heart, esophagus, kidney, bladder, and ileum, and located specific cell types (i.e., type II alveolar cells (AT2), myocardial cells, proximal tubule cells of the kidney, ileum and esophagus epithelial cells, and bladder urothelial cells), which are vulnerable to 2019-nCoV infection. Based on the findings, we constructed a risk map indicating the vulnerability of different organs to 2019-nCoV infection. This study may provide potential clues for further investigation of the pathogenesis and route of 2019-nCoV infection.

1,809 citations


Cites methods from "Dimensionality reduction for visual..."

  • ...The cell scatter plots were obtained using the UMAP method [13]....

    [...]

Journal ArticleDOI
TL;DR: The steps of a typical single‐cell RNA‐seq analysis, including pre‐processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell‐ and gene‐level downstream analysis, are detailed.
Abstract: Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.

1,180 citations


Cites background from "Dimensionality reduction for visual..."

  • ...What sets UMAP apart in this comparison is its speed and ability to scale to large numbers of cells (Becht et al, 2018)....

    [...]

Journal ArticleDOI
TL;DR: A deep mutational scanning method is described to map how all amino-acid mutations in the RBD affect antibody binding, and this method is applied to 10 human monoclonal antibodies to enable rational design of antibody therapeutics and assessment of the antigenic consequences of viral evolution.

830 citations


Cites methods from "Dimensionality reduction for visual..."

  • ...Overall, the two-dimensional projection in Figure 2D provides a way to visualize the relationships among antibodies in the space of immune-escape mutations, similar to how dimensionality reduction techniques such as tSNE or UMAP help visualize high-dimensional single-cell transcriptomic data (Amir et al., 2013; Becht et al., 2018)....

    [...]

References
More filters
Journal Article
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

30,124 citations

Journal ArticleDOI
22 Dec 2000-Science
TL;DR: An approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set and efficiently computes a globally optimal solution, and is guaranteed to converge asymptotically to the true structure.
Abstract: Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs-30,000 auditory nerve fibers or 10(6) optic nerve fibers-a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

13,652 citations

Journal ArticleDOI
TL;DR: An analytical strategy for integrating scRNA-seq data sets based on common sources of variation is introduced, enabling the identification of shared populations across data sets and downstream comparative analysis.
Abstract: Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.

7,741 citations

Posted Content
TL;DR: The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.
Abstract: UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology The result is a practical scalable algorithm that applies to real world data The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning

5,390 citations

Journal ArticleDOI
02 Sep 2018
TL;DR: Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.
Abstract: Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. UMAP has a rigorous mathematical foundation, but is simple to use, with a scikit-learn compatible API. UMAP is among the fastest manifold learning implementations available – significantly faster than most t-SNE implementations.

4,141 citations

Frequently Asked Questions (7)
Q1. What contributions have the authors mentioned in the paper "Evaluation of umap as an alternative to t-sne for single-cell data" ?

In this context, dimensionality reduction techniques have been pivotal in enabling researchers to visualize high-dimensional data. Since t-SNE has been extremely prevalent in the field of cytometry broadly encompassing flow and mass-cytometry as well as singlecell RNA-sequencing ( scRNAseq ), the authors tested these claims on well-characterized single-cell datasets [ 6-8 ]. In addition to this straightforward advantage, the authors argue that UMAP is not only able to create informative clusters, but is also able to organize these clusters in a meaningful way. The authors illustrate these claims by showing that UMAP can order clusters from T and NK cells from 8 human organs [ 7 ] in a way that both identifies major cell lineages but also a common axis that broadly recapitulates their differentiation stages. The authors also show that UMAP allows for an easier visualization of multibranched cellular trajectories by using a mass-cytometry [ 6 ] and a scRNAseq [ 8 ] datasets both recapitulating hematopoiesis. 

GMP then led to classical monocytes that further led to myeloid dendritic cells on one branch and to cells labeled as intermediate monocytes on another branch. 

t-SNE is currently the most commonly-used technique and is efficient at highlighting local structure in the data, which for cytometry notably translates to the representation of cell populations as distinct clusters. 

The density of events in the dimensionallyreduced space also appeared less uniform in t-SNE, with large clusters in the t-SNE space being less dense than the smaller ones. 

For then Han et al. dataset the authors used the AUCell R package[15] , which computes the AUC of gene sets within each single cell, using gene sets from the Haemopedia[12] resource to annotate cell lineages. 

They claim that compared to t-SNE it preserves as much of the local and more of the global data structure, with a shorter runtime. 

In addition, UMAP appeared more stable than t-SNE, being more consistent across distinct replicates and independent subsampling which should facilitate consistency in its intepretation (Figure S3). 

Trending Questions (1)
What are the problems with using UMAP visualizations for single cell sequencing.?

UMAP visualizations for single-cell sequencing do not have any mentioned problems in the provided paper.