pcaReduce: hierarchical clustering of single cell transcriptional profiles

doi:10.1186/S12859-016-0984-Y

Žurauskien

˙

eandYauBMC Bioinformatics

(2016) 17:140

DOI 10.1186/s12859-016-0984-y

METHODOLOGY ARTICLE Open Access

pcaReduce: hierarchical clustering of

single cell transcriptional profiles

Justina Žurauskien

˙

e

1

and Christopher Yau

1,2*

Abstract

Background: Advances in single cell genomics provide a way of routinely generating transcriptomics data at the

single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of

heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical

statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel

statistical approaches that are better adapted to the challenges of inferring cellular hierarchies.

Results: We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state

hierarchy where e ach cluster branch is associated with a principal component of variation that can be used to

differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used

statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more

consistent clustering structures when compared to broad and detailed cell type labels.

Conclusions: Our novel integration of principal components analysis and hierarchical clustering establishes a

connection between the representation of the expression data and the number of cell types that can be discovered.

In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising

putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a

growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.

Keywords: Single cell RNA-Seq, Hierarchical clustering, Gene expression

Background

Recent advances in single cell RNA sequencing (scRNA-

seq) technology has enabled the routine high-throughput

collection of quantitative gene expression measurements

across a range of tissue types and diversity of cellular states

at the level of the single cell [1–6]. The application of sin-

gle cell gene expression profiling has identified cell-to-cell

expression variability in phenotypically and/or genetically

identical cells that is masked in standard “population”

gene expression studies where the transcriptomes of thou-

sands to millions of cells are simultaneously measured and

averaged. This expression variability is driven by stochas-

tic gene expression mechanisms whose effects cannot be

measured in the context of a population of cells but only

*Correspondence: cyau@well.ox.ac.uk

1

Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt

Drive, OX3 7BN Oxford, UK

2

Department of Statistics, University of Oxford, 1 S. Parks Rd, OX1 3TG Oxford,

UK

through the microcosm of a single cell. Consequently,

scRNA-seq has increasingly become the method of choice

in discovering molecular underpinnings of complex and

rare cell populations [7, 8], assessing tissue composition

[9–11], studying various diseases [12] and cell develop-

ment/lineage processes [13–16].

Our particular focus in this article is the utility of

scRNA-seq data to enable the identification of function-

ally distinct sub-populations that each possesses a dif-

ferent pattern of gene expression activity [17, 18]. These

sub-populations could indicate different cell types that

exhibit relatively stable, static behaviour but also cell states

representing intermediate stages in transient processes.

Traditionally, cell types have been defined by the func-

tional behaviour of certain cellular features, for example,

CD14+ monocytes show CD14 expression, but with the

availability of scRNA-seq the potential exists to develop a

richer taxonomy of cell types by extending the molecular

features used for characterisation to consider the whole

˙

eandYau.Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the

Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Žurauskien

˙

eandYauBMC Bioinformatics

(2016) 17:140

Page 2 of 11

transcriptome. The population of CD14 expressing mono-

cytes might in fact be a collection of distinct cell subtypes

each sharing a common CD14 expression signature but

also possessing a unique expression pattern of their own.

Unbiased discovery of cell types from scRNA-seq data

can be automated using unsupervised clustering algo-

rithms. Given expression profiles for a collection of single

cells, the objective of the algorithm is to partition the cells

into a number of cell types such that each cell type has

a significantly distinctive expression signature from the

others. Single cell analytical software pipelines have been

developed recently for single cell analysis that include

procedures for unbiased cell type identification.

In RaceID [19], K-means clustering with gap statistics

was used to identify six intestinal cell types, while rare

cell types were identified by examining outliers that could

not be explained by a background noise model. The Back-

SPIN method [9] uses a customized version of the SPIN

algorithm [20]. When applying the BackSPIN method to

3005 mouse cortex and hippocampus cells, the algorithm

identified 77 groups, stopping after 12 splits along the

deepest branch. This was manually limited to 5 splits per

branch and subsequent merging of the neuronal groups

into three main groups (each containing four subgroups)

and oligodendrocytes into one group (containing 3 sub-

groups) resulted in the nine main cell classes given in

the study [9]. Seurat [13, 21] applies a two-dimensional

t-SNE projection [22] to the most significant principal

component scores (a process they refer to as “spectral

t-SNE”) from a principal component analysis of the sin-

gle cell expression matrix. Density clustering (DBSCAN)

[23] is then used to identify cell type clusters in the two-

dimensional space. Seurat was used to classify 44,808

Drop-seq single cell expression profiles into 39 retinal cell

populations [21]. The SINCERA package [24] uses hier-

archical clustering using centered Pearson’s correlation

and average linkage as default settings for the similarity

measurement and linkage method respectively. SNN-cliq

[25] uses the concept of shared nearest neighbour (SNN)

to define similarities between data points (cells) and

achieves clustering by a graph theory-based algorithm.

Finally, the SC3 approach [26] which uses spectral trans-

formations, the K-means algorithm and complete-linkage

to perform consensus clustering.

A limitation of these methods is that they do not estab-

lish a connection between the representation of the data

to the number and nature of the cell types that can be

resolved. For example, Fig. 1 illustrates three clustering

structures derived from a single cell study of mouse sen-

sory neurons [27]. Four broad sensory neuronal cell types

(NF, TH, PEP, NP) were identified by examining clusters

of cells in the subspace spanned by the first few prin-

cipal components (PC2-4 shown in Fig. 1a) and using

expression of key (known) cell markers to label the clus-

ters. Using information contained in additional principal

components, the four major cell types could then be sub-

divided into further distinct cell subtypes. The presence

of these refined cell subtypes is clearly not obvious from

a visual inspection of the data in the subspace spanned by

PC2-4 (Fig. 1b,c).

We have developed an agglomerative clustering

approach that integrates principal components analysis

(PCA) and hierarchical clustering that we call pcaReduce.

This method seeks to establish a connection between the

reduced representations given by principal components

analysis and the number of resolvable cell types (clusters).

The approach is driven by the expectation that informa-

tion pertaining to large, broad classes of cell types are

likely be contained in low dimensional PC representa-

tions whilst refined cell type structure are only defined

in high dimensional PC representations. Our proposed

method is similar to the “iterative” principal component

analysis approaches used to establish putative cell types

in [27, 28]. However, the question we asked was whether

Fig. 1 Cellular hierarchies. Three hierarchically related clustering structures for a single cell mouse neuronal dataset [27]. The data has been

projected on to the first four principle directions, we report the three that allows best data visualisation; we used the given cellular labels to colour

cells according to the a 4, b 8, and c 11 cell subtypes identified in the original study

Žurauskien

˙

eandYauBMC Bioinformatics

(2016) 17:140

Page 3 of 11

we could recapitulate such results fully automatically in

an unsupervised fashion without prior knowledge of cell

type markers that was used in these studies. This ques-

tion is important in study conditions where there maybe

little or unreliable prior knowledge of cell types. We will

test our methods against existing approaches but note

that many existing approaches typically utilise different

combinations of standard dimensionality reduction and

clustering algorithms. Therefore, in our investigations,

instead of using these packages directly, we will explore

the utility of the constituents components (which might

be shared between approaches).

Methods

Let X

n×d

denote a gene expression matrix, where n

is the number of cells measured across d number of

genes; i.e. each cell x

i

={x

i1

, ..., x

id

} is a d-dimensional

object. Further assume that Y

n×q

denotes a score matrix,

obtained after projecting data into first q principle direc-

tions, and Y

i

denotes a subset of cells, Y

i

⊂ Y.

Our clustering algorithm begins by performing a

K-means clustering operation on the projection of the

original gene expression matrix, X

n×d

,tothetopK − 1

principal directions. The number of initial clusters K is

set to a sufficiently large value, say 30, to ensure most

cell types will be captured. Once the initial clusters are

determined, we take two subsets (Y

i

, Y

j

)thatoriginate

from a pair of clusters (i, j) respectively, and calculate the

probability for those observations to be merged together,

p({Y

i

, Y

j

}|μ

ij

, 

ij

). We assume that the probability den-

sity function is a multivariate Gaussian with mean and

covariance matrix given by:

μ

ij

=

n

i

n

i

+ n

j

μ

i

+

n

j

n

i

+ n

j

μ

j

,



ij

=

n

i

n

i

+ n

j



i

+

n

j

n

i

+ n

j



j

,

(1)

where (n

i

, n

j

), (μ

i

, μ

j

) and (

i

, 

j

) denote the sizes, cen-

troids and covariances of the clusters i and j respec-

tively. We repeat this for all possible pairs (i, j).Wethen

choose to merge two clusters by either (i) picking the pair

that has the highest probability or (ii) sampling a pair

of clusters to merge in proportion to their (normalised)

merged probabilities. The number of clusters will now

decrease to K − 1. We then project the data matrix on

to the first K − 2 principal directions, i.e. removing the

(K − 1)-st principal component that explains the lowest

degree of variance in the data, removing this dimen-

sion from the existing cluster centroids and covariance

matrices.

The above clustering operation is then repeated so that

after every merge operation we remove a principal direc-

tion until only a single cluster remains. If sampling-based

merge operations are used, the whole process can be

repeated to obtain a number of alternative clusterings.

This will be useful for assessing the stability of the cluster-

ing results. Algorithm 1 gives a pseudo-code description.

Algorithm 1: The pcaReduce algorithm. Here

X

n×d

is a gene expression matrix with n cells

(given in rows) and d genes (in columns); q is the

number of dimensions – effectively this refers to

the number of levels in the hierarchy; Y is a score

matrix, which is the output of PCA algorithm; μ

ij

and 

ij

definition are given in Eq. (1); (i) and (ii)

denote two different merging settings: (i) merg-

ing is based on largest probability P(i, j) value; (ii)

merging is based on sampling according to P(i, j)

distribution.

Input: X

n×d

and q ;

Output:acollectionofq clusterings;

1 Y ←− PCA(X

n×d

, dim=q);

2 (μ, ) ←− kmeans(Y, K = q + 1);

3 Q ←− q − 1;

4 for r = 1, ..., Q do

5 for all possible pairs (i, j) in

{1, ..., K}×{1, ..., K} do

6 P(i, j) ←− p



Y

i

∪ Y

j

|μ

ij

, 

ij



;

7 end for

8 (i) either choose pair (i, j) with largest

probability P(i, j) and merge clusters i and j;

9 (ii) or sample a pair (i, j) with probability

P(i, j) and merge clusters i and j;

10 q ←− q − 1;

11 Y ← Y

n×q

(i.e. remove last dimension);

12 update (μ, );

13 K ← K − 1;

14 end for

Figure 2 gives an illustration of our method using an

autoencoder representation. Autoencoders are feedfor-

ward neural networks that accept d-dimensional input

data and report d-dimensional output data. The d nodes

in the input and output network layers are connected

via one or more hidden layers. Data transformations are

applied between each layer of the network. If a hid-

den layer has fewer nodes than its predecessor then

the information from the previous layer in the autoen-

coder network is forced into a lower dimensional form

hence performing dimensionality reduction. Each hidden

layer encodes a reduced dimensional representation of

the input data. In an autoencoder, the parameters gov-

erning the data transformations between the layers are

fitted to minimise the mean-squared error between the

original input data and the output representation. It can

Žurauskien

˙

eandYauBMC Bioinformatics

(2016) 17:140

Page 4 of 11

Fig. 2 Method illustration using an autoencoder network. Clustering is applied to the data representation at each linear hidden layer. If there are

K − 1 linear hidden units, the data is projected into a subspace spanned by the top K − 1 principal components. Consistency between the

clusterings at each layer is maintained by enforcing a hierarchical constraint. a Graphical interpretation of an autoencoder network(s). b

Corresponding hierarchical structure

be shown that, when using linear transformations, the

optimal autoencoder is equivalent to doing principal com-

ponents analysis [29]. Using this analogy it is now clear

that pcaReduce can be seen as performing hierarchical

clustering on the different hidden layers of a linear autoen-

coder network linking different clustering structures to

different hidden layer representations of the input data.

This analogy is useful to explain our algorithm and why we

remove one principal component after each merge oper-

ation. We are in fact maintaining clustering consistency

across the hidden layer representations of the input data

in the autoencoder network.

Results

To demonstrate the performance of pcaReduce method,

we considered two single cell RNA-seq dataset exam-

ples. The first contains a collection of cells originating

from diverse biological tissues [30]; and the second dataset

the mouse sensory neuronal cells [27] discussed in the

Introduction. These were selected as they contained pre-

existing hierarchical cluster structures that can be used

to assess unsupervised algorithmic performance. Here we

show that pcaReduce can be applied to re-capture the

known cellular hierarchies and we compare to other statis-

tical techniques, which are commonly applied to address

similar cell sub-typing problems. Below, all examples were

implemented using the free statistical computing platform

R (www.r-project.org).

Cells from disparate tissues

We obtained single cell RNA-seq dataset [30] for 300 cells

whose transcriptional measurements were taken across

8,686 genes (see Additional file 1, Section A for further

details on data preparation). The data were derived from

11 cell types: K562 – myeloid (chronic leukemia), HL60 –

myeloid (acute leukemia), CRL-2339 – lymphoblastoid;

Žurauskien

˙

eandYauBMC Bioinformatics

(2016) 17:140

Page 5 of 11

iPS – pluripotent; CRL-2338 – epithelial, BJ – fibrob-

last (from human foreskin), Kera – foreskin keratinocyte;

NPC – neural progenitor cells, GW(16, 21, 21+3) – gesta-

tional week (16,21, 21+3 weeks), fetal cortex (see Fig. 3a).

In addition, as specified in the original study by [30], these

cell types could also be grouped into four disparate tis-

sue sources: blood, stem, skin and neural tissues. We refer

to these as the cell line-level and tissue-level classifica-

tions respectively and use these as ground-truth classes

in our performance assessment; i.e. we will focus on data

partitions into K = 11 and K = 4clusters.

We applied pcaReduce to this dataset to construct a

hierarchical clustering of cells. First, we initially projected

the data into the subspace spanned by the first 30 prin-

cipal components following a PCA and performed an

initial K-means clustering to get initial cluster assign-

ments (using K = 31 clusters) [31]. After this, we applied

different merging strategies to construct the cellular hier-

archies: first, when merging is performed based on the

most probable cluster merge value (see (i) in Algorithm 1)

and, secondly, when merge candidates are probabilistically

sampled (see (ii) in Algorithmic overview section). The

former gives a single hierarchical clustering whilst the lat-

ter can give a range of candidates hierarchies based on

repeated sampling.

We compared the hierarchical clustering given by

pcaReduce for levels K = 4andK = 11 to the true

cell line and tissue level classifications respectively using

the Adjusted Rand Index [32]. Note that, in Fig. 3a, the

projection of the eleven cell lines in two-dimensional

principal component space cannot be separated into dis-

tinct groups. It is only possible to do this in higher

dimensional representations. Figure 3b illustrates the per-

formance of pcaReduce using the sampling-based merge

operation where each line corresponds to a single run

of the method. Although, pcaReduce has no knowledge

ofthetruenumberofcelllineortissuelabels,thecor-

respondence between the hierarchical clustering output

of pcaReduce and the true classification peaks at around

levels 4 and 11 of the hierarchies respectively which it

discovers without any prior knowledge.

In order to gain an understanding of the misclassifica-

tions, we looked specifically at the most probable hier-

archical structure identified using pcaReduce (Fig. 3c).

Compared to the known cell line and tissue labels (see

Additional file 1: Figure S1), the 11-cluster structure

given by pcaReduce did not fully differentiate the 11 cell

types. This is not unsurprising since the 11 cell types

included a set of three maturing neural cell types (GW16,

GW21 and GW21+3) that are highly related. Interestingly,

PCA2

200 100 0 100 200 300

150 50 0 50 100 200

K562

HL60

2339

iPS

2338

BJ

Kera

NPC

GW21

GW21+3

GW16

5 1015202530

0.0 0.2 0.4 0.6 0.8 1.0

K = 4

K = 11

ARANDI score

Number of clusters

AB

PCA1

C

19 K56223 K562 54 HL60 22 233817 2339 39 Kera

37 BJ

1 Kera

15 NPC13 iPS 10 iPS

26 GW16

8 GW21

16 GW21+3

96 Blood

22 Dermal

17 Blood

77 Dermal

65 Neural

23 Stem

300 Cells

Fig. 3 Application of pcaReduce to single cell RNA sequencing of 11 cell lines. a Projection of the data on to the first two principal components.

b Performance of pcaReduce, the horizontal axis corresponds to a level in the hierarchical cluster structure reported by pcaReduce, the vertical axis

show the Adjusted Rand Index (ARANDI) score between the tissue-level (green) and cell-line level labels (in blue) and the clustering reported by

each level of the hierarchical clustering of pcaReduce. Each line correspond to a single run of pcaReduce using probabilistic sampling. c The most

probable cellular hierarchy identified using pcaReduce

pcaReduce: hierarchical clustering of single cell transcriptional profiles

Citations

SCENIC: single-cell regulatory network inference and clustering.

SC3: consensus clustering of single-cell RNA-seq data

SCENIC: Single-Cell Regulatory Network Inference And Clustering

Challenges in unsupervised clustering of single-cell RNA-seq data.

A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.

References

Visualizing Data using t-SNE

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TopHat: discovering splice junctions with RNA-Seq

Related Papers (5)

Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq

Visualizing Data using t-SNE

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

Massively parallel digital transcriptional profiling of single cells

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells