Journal Article•DOI•

Dimensionality reduction for visualizing single-cell data using UMAP.

Etienne Becht¹, Leland McInnes, John Healy, Charles-Antoine Dutertre¹, Immanuel Kwok¹, Lai Guan Ng¹, Florent Ginhoux¹, Evan W. Newell¹, Evan W. Newell² - Show less +5 more•Institutions (2)

Agency for Science, Technology and Research¹, Fred Hutchinson Cancer Research Center²

01 Jan 2019-Nature Biotechnology (Springer Science and Business Media LLC)-Vol. 37, Iss: 1, pp 38-44

TL;DR: Comparing the performance of UMAP with five other tools, it is found that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters.

read less

Abstract: Advances in single-cell technologies have enabled high-resolution dissection of tissue composition. Several tools for dimensionality reduction are available to analyze the large number of parameters generated in single-cell studies. Recently, a nonlinear dimensionality-reduction technique, uniform manifold approximation and projection (UMAP), was developed for the analysis of any type of high-dimensional data. Here we apply it to biological data, using three well-characterized mass cytometry and single-cell RNA sequencing datasets. Comparing the performance of UMAP with five other tools, we find that UMAP provides the fastest run times, highest reproducibility and the most meaningful organization of cell clusters. The work highlights the use of UMAP for improved visualization and interpretation of single-cell data.

...read moreread less

Summary (2 min read)

Jump to: [Introduction] – [B. OPF and SDP relaxation] – [III. AN ADMM HEURISTIC FOR THE OPF] – [A. Feasible point] – [B. Stopping criterion and ρ] – [C. Overall algorithm] – [A. Two bus network] and [V. CONCLUSION]

Introduction

The optimal power flow (OPF) problem optimizes certain objective such as power loss and generation cost subject to power flow equations and operational constraints.
Recently, convex relaxations of the OPF problem have been proposed.
SDP/SOCP relaxation is not always exact, especially when the underlying network is not radial.
In [14], the authors perturbed the objective function, and this technique is guaranteed to work in some cases, but in general case the authors still do not have a rank one feasible solution.

B. OPF and SDP relaxation

The OPF problem seeks to optimize certain objective, e.g. total line loss, or generation cost, subject to power flow equations (1) and various operational constraints.
The voltage magnitude at each load bus i ∈ N needs to be maintained within a prescribed region, i.e. V mini ≤ |Vi| ≤ V maxi .
Notice that this optimization is non-convex because of the rank constraint rankW ≤.

III. AN ADMM HEURISTIC FOR THE OPF

The authors apply the ADMM method to derive a heuristic for the nonconvex OPF problem (5).
The rank constraint helps us to come up with the tractable non-convex minimization, hence the ADMM provides a sequence of convex program which approximately solves the original non-convex OPF.
Another important property of their algorithm is that if the initial iterates Z0,Λ0 are Hermitian matrices, then all W k, Zk,Λk are Hermitian matrices: Proposition 3.

A. Feasible point

The convergence of the ADMM heuristic for a non-convex problem is still an open question [16].
If it converges, then the authors are guaranteed to have a rank one feasible point of the OPF.
Finally, since W ∗ ∈ C, the authors can conclude that W ∗ is a rank one feasible point in the optimal power flow problem.

B. Stopping criterion and ρ

For the stopping criterion, the authors use the one from [16].
When the algorithm converges, Rk and Sk should be zero.
This gives rise to the following stopping criterion: ‖Rk‖F ≤ pri, ‖Sk‖F ≤ dual.
Lastly, the choice of ρ can be automated based on these residuals.

C. Overall algorithm

Notice that when ρ = 0, the optimization (11) is a SDP relaxation of the OPF.
This helps us not to get trapped, but the price the authors pay is a possible oscillation.

A. Two bus network

It is shown in [3] that the feasible region becomes the two disjoint regions under some reactive power constraints on q1, q2, as shown by the black lines on the ellipse.
Hence, a feasible power injection (rank 1 solution) is received in this simple network.
Here the mesh network is a ring with 10 nodes and 10 links.
Indeed, when their heuristic converges, it recovers the rank one solution although the SDP relaxation always generate a full rank solution.

V. CONCLUSION

The authors propose a non-convex ADMM heuristic for the OPF.
By introducing a redundant variable whose rank is one, the authors can split the minimization into two steps, where the first step is a convex optimization, and the second step is a rank constrained minimization.
Then, the authors show that the second step, a non-convex optimization, can be carried out analytically.
Moreover, the authors observe the convergence of their heuristic under the existence of hidden rank one solution in the SDP relaxation of the OPF.
Inspired by this, the convergence proof under this 2Minimum eigenvalue of the solution from the SDP relaxation is greater than 0.01 in all cases.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Evaluation of UMAP as an alternative to t-SNE for single-cell data

Etienne Becht

, Charles-Antoine Dutertre

, Immanuel W.H. Kwok

, Lai Guan Ng

, Florent

Ginhoux

, Evan W. Newell

Singapore Immunology Network (SigN), Agency for Science, Technology and Research

(A*STAR)

Corresponding author. evan_newell@immunol.a-star.edu.sg

Abstract

Uniform Manifold Approximation and Projection (UMAP) is a recently-published non-linear

dimensionality reduction technique. Another such algorithm, t-SNE, has been the default

method for such task in the past years. Herein we comment on the usefulness of UMAP

high-dimensional cytometry and single-cell RNA sequencing, notably highlighting faster

runtime and consistency, meaningful organization of cell clusters and preservation of

continuums in UMAP compared to t-SNE.

Introduction

The last decades have witnessed a large increment in the number of parameters analysed in

single cell cytometry studies. It currently reaches around 20 for flow-cytometry, 40 for mass-

cytometry, and more than 20,000 in single-cell RNA-sequencing. In this context,

dimensionality reduction techniques have been pivotal in enabling researchers to visualize

high-dimensional data. While principal component analysis has historically been the main

technique used for dimensionality reduction (DR), the recent years have highlighted the

importance of non-linear DR techniques to avoid overcrowding issues. Common such

techniques[1] include Isomap, Diffusion Map and t-SNE[2] (also renamed viSNE[3]). t-SNE

is currently the most commonly-used technique and is efficient at highlighting local structure

in the data, which for cytometry notably translates to the representation of cell populations as

distinct clusters. t-SNE however suffers from limitations such as loss of large-scale

information (the inter-cluster relationships), slow computation time and inability to

meaningfully represent very large datasets[4]. A new algorithm, called Uniform Manifold

Approximation and Projection (UMAP) has been recently published by McInnes and

Healy[5]. They claim that compared to t-SNE it preserves as much of the local and more of

the global data structure, with a shorter runtime. Since t-SNE has been extremely prevalent

in the field of cytometry broadly encompassing flow and mass-cytometry as well as single-

cell RNA-sequencing (scRNAseq), we tested these claims on well-characterized single-cell

datasets[6-8]. We confirm that UMAP is an order of magnitude faster than t-SNE. In addition

to this straightforward advantage, we argue that UMAP is not only able to create informative

clusters, but is also able to organize these clusters in a meaningful way. We illustrate these

claims by showing that UMAP can order clusters from T and NK cells from 8 human

organs[7] in a way that both identifies major cell lineages but also a common axis that

broadly recapitulates their differentiation stages. We also show that UMAP allows for an

easier visualization of multibranched cellular trajectories by using a mass-cytometry[6] and a

scRNAseq[8] datasets both recapitulating hematopoiesis.

Results

Faster runtime, equivalent local information and superior global structure

.CC-BY-NC 4.0 International licenseavailable under a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

We ran UMAP and t-SNE simultaneously on a dataset covering 39 samples originating from

8 distinct human tissues enriched for T and NK cells, of more than >350,000 events with 42

protein targets[7]. As observed by McInnes and Healy[5], we measured runtimes that were

significantly lower (5 minutes on average for UMAP for 200,000 cells, versus 2 hour and 22

minutes for Barnes-Hut t-SNE) across a large range of dataset sizes (Figure S1). Using

Phenograph[9] clustering and manual cluster labeling, we classified events into 7 broad cell

populations (Figure S2A). UMAP and t-SNE were both successful at pulling together only

clusters corresponding to similar cell populations with generally very good correspondence

with Phenograph clustering (Figure 1A and Figure S2B). However t-SNE separated cell

populations into distinct clusters more commonly than UMAP, notably splitting CD8 T cells,

gamma-delta T cells and contaminating cells (likely including B cells) into two distinct

clusters each. Although this highlights that tSNE might be more sensitive in segregating

these populations that differ, we were unable to test this quantitatively. We also note that

although these cells were not always segregated into completely distinct clusters by UMAP,

these cell populations remained similarly identifiable in UMAP as compared to tSNE (Figure

S2B). In addition, UMAP appeared more stable than t-SNE, being more consistent across

distinct replicates and independent subsampling which should facilitate consistency in its

intepretation (Figure S3). By color-coding the tissues of origin on the UMAP and t-SNE

maps, we observed that t-SNE grouped cell clusters according to their origin more often than

UMAP (Figure 1B and Figure S4). UMAP instead ordered events according to their origin

within each major cluster, roughly from cord-blood and PBMCs, to liver and spleen, and to

tonsils one the one end to skin, gut and lung on the other end. The sample type was not

given as an input of any of these two algorithms. Instead we observed that UMAP was able

to recapitulate the differentiation stage of T cells within each major cluster, as seen by the

expression levels of events for the resident-memory markers CD69 and CD103, the memory

T cell marker CD45RO and naive cells marker CCR7 on the UMAP projection (Figure 1C).

By contrast, while t-SNE identified similar continuums within clusters, they had no apparent

structure along a common axis that made them easily identifiable (Figure 1D).

UMAP better represents the multi-branched trajectory of hematopoietic development

To investigate how UMAP handles continuity of cell phenotypes we applied it alongside t-

SNE on the well-documented topic of bone-marrow hematopoiesis using both a mass-

cytometry (>86,000 events, 25 parameters, 24 cell populations annotated by its authors[6])

and a scRNAseq dataset (three sample types, 51,252 cells, 25,912 dimensions[8]). On the

mass-cytometry dataset, UMAP visually revealed 8 major cell clusters (Figure 2A). One was

composed of all B cell subsets (and close to a small cluster of plasma cells) and one of all T

cell subsets. Four small homogeneous clusters corresponded to macrophages, NK cells,

eosinophils and non-classical monocytes. The last cluster contained 11 out of the 24

manually-gated populations and appeared most interesting with respect to hematopoiesis.

Indeed, these populations were ordered according to a five-leaf branched structure that was

consistent with hematopoietic differentiation: hematopoietic stem cells (HSC) overlapped

with multipotent progenitors (MPP). These cells neighbored common lymphoid progenitors

(CLP) on one side, and common myeloid progenitors (CMP) on the other. CMP led to

myeloid-erythroid progenitors (MEP) which led to unlabelled erythrocytes (Figure S5), and to

granulocyte-myeloid progenitors (GMP). GMP then led to classical monocytes that further

led to myeloid dendritic cells on one branch and to cells labeled as intermediate monocytes

on another branch. UMAP linked basophils to a population of

Lin

−

cKit

Sca1

−

CD34

FcγRII/III

FcεRIα

cells, consistently with a previously-described

.CC-BY-NC 4.0 International licenseavailable under a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

phenotype for committed basophil progenitors[10]. These putative progenitors appeared

closer to CMP than GMP, a topic that is still intensely debated. Neutrophils were gated-out

from the dataset by its authors and are thus absent from this representation[6]. t-SNE

identified relatively similar clusters with a few differences (Figure 2B), notably singling out

more clusters than UMAP. CD4

T cells were separated from other T cell subsets. As noted

by others[11], t-SNE expands low density areas and tends to ignore global relationships.

Thus, while some paths from HSC and MPP to differentiated populations were still apparent

- notably from HSC to monocytes, the overall structure was less clear, as no narrow “neck”

led to larger terminal clusters. t-SNE also separated basophils from their putative precursors

close to CMP and GMP and pDC from CLP. The density of events in the dimensionally-

reduced space also appeared less uniform in t-SNE, with large clusters in the t-SNE space

being less dense than the smaller ones. In contrast, the density of UMAP clusters appeared

more uniform, which could help avoid biases in interpreting phenotypic heterogeneity in large

versus small clusters (Figure S6).

From the scRNAseq dataset we analyzed collectively the transcriptomes of cells isolated

from Bone Marrow (BM), cKit

BM and Peripheral Blood (PB) to facilitate identifying mature

versus progenitor cell populations (Figure 2C). We first removed low-abundance cell types

such as basophils and eosinophils, contaminants such as mature erythrocytes as well as

outlier cells originating from unique samples and highly expressing mitochondrial transcripts

(Figure S7). Using published cell signatures specific for mouse BM cell populations[12], we

were able to identify cell clusters that corresponded to MPP, MEP, macrophages, B cells, T

cells and NK cells (Figure 2D). Consistently with the UMAP projection of the mass-cytometry

dataset, MPP were found in the middle of a larger group of clusters that led to differentiated

cells originating from PB samples (Figure 2C). PB events consisted of distinct clusters of

lymphocytes (T, NK and B cells), macrophages, MEP and neutrophils (Figure 2D). Although

this does not prove that cells lying between MPPs and differentiated cells are committed

progenitors, these results suggest that UMAP could be used as a hypothesis generating tool

to identify putative markers for such cells. By investigating a small cluster of cells lying in

between MPP and mature B cells in the UMAP projection, we were indeed able to identify

the pre-B cell marker Vpreb3[13] and to hypothesize that Chchd10 could be another gene

marker for pre-B cells in mouse bone marrow (Figure 2E). These conclusions and

hypotheses would have been more difficult to draw using t-SNE which blurred the

relationship of terminal clusters to MPPs (Figure 2D and Figure 2E).

Discussion

Our analysis and example provided show that UMAP seems to yield representation that are

as meaningful as t-SNE does, particularly in its ability resolve even subtly differing cell

populations. In addition, it provides the useful and intuitively pleasing feature that it

preserves more of the global structure, and notably, the continuity of the cell subsets. In

addition to making plots easier to interpret, we highlight that this also improves its utility for

generating hypotheses related to cellular development. On a practical level, UMAP outputs

are faster to compute and more reproducible than those from t-SNE. Altogether, based on its

ease of use, these results and our other experience so far, we anticipate that UMAP will be a

highly valuable tool that can be rapidly adopted by single-cell analysis community.

Methods

Datasets

100

105

110

115

120

125

130

135

140

.CC-BY-NC 4.0 International licenseavailable under a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

The main characteristics of the datasets we analyzed are presented in Table 1. For the

Wong et al. dataset, based on live CD45+ cell events available on FlowRepository (see

table) non-B (CD19

), non-monocyte (CD14

) were selected using FlowJo software. In order

to partially equalize weighting of each human tissue, a maximum of 10,000 events were

randomly sampled from each of the 39 samples prior to analysis. Other datasets were used

as described in the table.

Dataset Identifier Single-cell

technique

Organism Tissues Samples Single

cells

Analyzed

parameters

Wong FlowRepository

FR-FCM-ZZTM

Mass-

cytometry

Human 8 distinct

tissues

39 327,457 39

Samusik_01 FlowRepository

FR-FCM-ZZPH

Mass-

cytometry

Mouse Bone

marrow

1 86,864 38

Han Figshare

865e694ad06d5

857db4b

Single-cell

RNAseq

Mouse Bone

marrow

and blood

14 51,252 25,912

Transformations and pre-processing

For the bone marrow mass-cytometry data we used an arcsinh transformation with a

cofactor of 1, and a logicle transform (parameters w=0.25, t=16409, m=4.5, a=0) for the

Wong dataset. For the scRNAseq dataset we transformed count into reads per millions (thus

normalizing the number of counts per cells to 1).

Running UMAP and t-SNE

For both mass-cytometry datasets we used UMAP using 15 nn, a min_dist of 0.2 and

euclidean distance. For the scRNAseq dataset we computed 100 approximate principal

components using the IRLBA R package and used them as an input for both t-SNE and

UMAP. We then ran UMAP using 30 nearest neighbors (nn) and a min_dist of 0.1 and using

the “correlation” metric. For t-SNE we ran the Barnes-Hut[14] implementation of the t-SNE

algorithm through its R implementation in the Rtsne package, using default parameters.

Cell annotations

For the Samusik_01 dataset we used cell annotations provided by the authors and available

from the public repository. For the Wong et al. dataset we used Phenograph clustering (with

default parameters k=30) and manually labeled the clusters into broad cell populations. For

then Han et al. dataset we used the AUCell R package[15] , which computes the AUC of

gene sets within each single cell, using gene sets from the Haemopedia[12] resource to

annotate cell lineages. We then manually thresholded these AUC scores to obtain

categorical labels. Cells that were assigned to multiple lineages were set to unlabeled.

Figure legends

Figure 1

UMAP and t-SNE projections of the Wong et al. dataset colored according to A) broad cell

lineages, B) tissue of origin, and for C) UMAP and D) t-SNE, the expression of CD69,

145

150

155

160

165

170

175

180

.CC-BY-NC 4.0 International licenseavailable under a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

CD103, CD45RO and CCR7. For C) and D), blue denotes minimal expression, beige

intermediate and red high expression.

Figure 2

A) UMAP and B) t-SNE projection of the Samusik_01 dataset. Events are color-coded

according to manual gates provided by the authors of the dataset. C) UMAP and t-SNE

projections of the Han dataset, color-coded by tissue of origin or D) by cell populations. E)

Expression of the V-set pre-B cell surrogate light chain 3 gene (Vpreb3) and Chchd10 genes

on the UMAP and t-SNE projections of the Han dataset. Blue denotes minimal expression,

beige intermediate and red high expression.

Supplementary Figure 1

Runtime of both UMAP (red) and t-SNE (blue) on randomly-selected subsets of the Wong

dataset using various sampling sizes. 3 subsamples were selected for each subset size and

input to both algorithms. Vertical lines represent standard deviations - and are too short to be

visible for most data points.

Supplementary Figure 2

A) Phenotypic characterization of the phenograph clusters. Each cluster medoid is

represented after column-wise Z-score transformation. B) Identification of each phenograph

cluster of both UMAP (left) and t-SNE. For clarity, only twelve clusters are shown per plot.

Supplementary Figure 3

Datapoints were colored according to their position on the UMAP (left) or t-SNE (right)

projection for the full Wong dataset. Then 3 subsets of various sizes were randomly selected

and input to UMAP and t-SNE. The resulting projections were colored according to the full

dataset projections in order to compare positions across random subsets and replicates.

Supplementary Figure 4

UMAP and t-SNE projections of the Wong dataset individually color-coded by tissue of

origin.

Supplementary Figure 5

Expression of Ter119 (a marker for mature erythrocytes) on the UMAP projection of the

Samusik_01 dataset.

Supplementary Figure 6

Heatmap of the density of a 300x300 square grid of the UMAP or t-SNE projections for the

Samusik_01 dataset. The number of events in each bin is color-coded.

Supplementary Figure 7

Top: UMAP projection of the full Han dataset annotated by AUC scores for various cell

lineages (red : high score, blue : low score). Bottom: full Han dataset colored by sample

type, Sample ID and pre-filtering status.

Acknowledgements

We thank members of the Singapore Immunology Network and notably members of the E.N.

laboratory. We thank Shamin Li, Yannick Simoni, Melissa Chng, Yang Cheng, Jack Wee Lim

185

190

195

200

205

210

215

220

225

.CC-BY-NC 4.0 International licenseavailable under a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which wasthis version posted April 10, 2018. ; https://doi.org/10.1101/298430doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (7)

Q1. What contributions have the authors mentioned in the paper "Evaluation of umap as an alternative to t-sne for single-cell data" ?

In this context, dimensionality reduction techniques have been pivotal in enabling researchers to visualize high-dimensional data. Since t-SNE has been extremely prevalent in the field of cytometry broadly encompassing flow and mass-cytometry as well as singlecell RNA-sequencing ( scRNAseq ), the authors tested these claims on well-characterized single-cell datasets [ 6-8 ]. In addition to this straightforward advantage, the authors argue that UMAP is not only able to create informative clusters, but is also able to organize these clusters in a meaningful way. The authors illustrate these claims by showing that UMAP can order clusters from T and NK cells from 8 human organs [ 7 ] in a way that both identifies major cell lineages but also a common axis that broadly recapitulates their differentiation stages. The authors also show that UMAP allows for an easier visualization of multibranched cellular trajectories by using a mass-cytometry [ 6 ] and a scRNAseq [ 8 ] datasets both recapitulating hematopoiesis.

Q2. What led to myeloid dendritic cells on one branch?

GMP then led to classical monocytes that further led to myeloid dendritic cells on one branch and to cells labeled as intermediate monocytes on another branch.

Q3. What is the common technique used in cytometry?

t-SNE is currently the most commonly-used technique and is efficient at highlighting local structure in the data, which for cytometry notably translates to the representation of cell populations as distinct clusters.

Q4. What was the density of events in the t-SNE space?

The density of events in the dimensionallyreduced space also appeared less uniform in t-SNE, with large clusters in the t-SNE space being less dense than the smaller ones.

Q5. What was the main characteristic of the datasets used?

For then Han et al. dataset the authors used the AUCell R package[15] , which computes the AUC of gene sets within each single cell, using gene sets from the Haemopedia[12] resource to annotate cell lineages.

Q6. What do they claim to be the advantages of t-SNE?

They claim that compared to t-SNE it preserves as much of the local and more of the global data structure, with a shorter runtime.

Q7. How long did UMAP take to intepret?

In addition, UMAP appeared more stable than t-SNE, being more consistent across distinct replicates and independent subsampling which should facilitate consistency in its intepretation (Figure S3).

Dimensionality reduction for visualizing single-cell data using UMAP.

Summary (2 min read)

Introduction

B. OPF and SDP relaxation

III. AN ADMM HEURISTIC FOR THE OPF

A. Feasible point

B. Stopping criterion and ρ

C. Overall algorithm

A. Two bus network

V. CONCLUSION

Citations

Cites methods from "Dimensionality reduction for visual..."

Cites methods from "Dimensionality reduction for visual..."

Cites background from "Dimensionality reduction for visual..."

Cites methods from "Dimensionality reduction for visual..."

References

Related Papers (5)

Frequently Asked Questions (7)

Q1. What contributions have the authors mentioned in the paper "Evaluation of umap as an alternative to t-sne for single-cell data" ?

Q2. What led to myeloid dendritic cells on one branch?

Q3. What is the common technique used in cytometry?

Q4. What was the density of events in the t-SNE space?

Q5. What was the main characteristic of the datasets used?

Q6. What do they claim to be the advantages of t-SNE?

Q7. How long did UMAP take to intepret?

Trending Questions (1)