scispace - formally typeset
Open AccessJournal ArticleDOI

pcaReduce: hierarchical clustering of single cell transcriptional profiles

Reads0
Chats0
TLDR
This work has developed a novel agglomerative clustering method that is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.
Abstract
Advances in single cell genomics provide a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies. We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more consistent clustering structures when compared to broad and detailed cell type labels. Our novel integration of principal components analysis and hierarchical clustering establishes a connection between the representation of the expression data and the number of cell types that can be discovered. In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.

read more

Content maybe subject to copyright    Report

Žurauskien
˙
eandYauBMC Bioinformatics
(2016) 17:140
DOI 10.1186/s12859-016-0984-y
METHODOLOGY ARTICLE Open Access
pcaReduce: hierarchical clustering of
single cell transcriptional profiles
Justina Žurauskien
˙
e
1
and Christopher Yau
1,2*
Abstract
Background: Advances in single cell genomics provide a way of routinely generating transcriptomics data at the
single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of
heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical
statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel
statistical approaches that are better adapted to the challenges of inferring cellular hierarchies.
Results: We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state
hierarchy where e ach cluster branch is associated with a principal component of variation that can be used to
differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used
statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more
consistent clustering structures when compared to broad and detailed cell type labels.
Conclusions: Our novel integration of principal components analysis and hierarchical clustering establishes a
connection between the representation of the expression data and the number of cell types that can be discovered.
In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising
putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a
growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations.
Keywords: Single cell RNA-Seq, Hierarchical clustering, Gene expression
Background
Recent advances in single cell RNA sequencing (scRNA-
seq) technology has enabled the routine high-throughput
collection of quantitative gene expression measurements
across a range of tissue types and diversity of cellular states
at the level of the single cell [1–6]. The application of sin-
gle cell gene expression profiling has identified cell-to-cell
expression variability in phenotypically and/or genetically
identical cells that is masked in standard “population”
gene expression studies where the transcriptomes of thou-
sands to millions of cells are simultaneously measured and
averaged. This expression variability is driven by stochas-
tic gene expression mechanisms whose effects cannot be
measured in the context of a population of cells but only
*Correspondence: cyau@well.ox.ac.uk
1
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt
Drive, OX3 7BN Oxford, UK
2
Department of Statistics, University of Oxford, 1 S. Parks Rd, OX1 3TG Oxford,
UK
through the microcosm of a single cell. Consequently,
scRNA-seq has increasingly become the method of choice
in discovering molecular underpinnings of complex and
rare cell populations [7, 8], assessing tissue composition
[9–11], studying various diseases [12] and cell develop-
ment/lineage processes [13–16].
Our particular focus in this article is the utility of
scRNA-seq data to enable the identification of function-
ally distinct sub-populations that each possesses a dif-
ferent pattern of gene expression activity [17, 18]. These
sub-populations could indicate different cell types that
exhibit relatively stable, static behaviour but also cell states
representing intermediate stages in transient processes.
Traditionally, cell types have been defined by the func-
tional behaviour of certain cellular features, for example,
CD14+ monocytes show CD14 expression, but with the
availability of scRNA-seq the potential exists to develop a
richer taxonomy of cell types by extending the molecular
features used for characterisation to consider the whole
© 2016 Žurauskien
˙
eandYau.Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Žurauskien
˙
eandYauBMC Bioinformatics
(2016) 17:140
Page 2 of 11
transcriptome. The population of CD14 expressing mono-
cytes might in fact be a collection of distinct cell subtypes
each sharing a common CD14 expression signature but
also possessing a unique expression pattern of their own.
Unbiased discovery of cell types from scRNA-seq data
can be automated using unsupervised clustering algo-
rithms. Given expression profiles for a collection of single
cells, the objective of the algorithm is to partition the cells
into a number of cell types such that each cell type has
a significantly distinctive expression signature from the
others. Single cell analytical software pipelines have been
developed recently for single cell analysis that include
procedures for unbiased cell type identification.
In RaceID [19], K-means clustering with gap statistics
was used to identify six intestinal cell types, while rare
cell types were identified by examining outliers that could
not be explained by a background noise model. The Back-
SPIN method [9] uses a customized version of the SPIN
algorithm [20]. When applying the BackSPIN method to
3005 mouse cortex and hippocampus cells, the algorithm
identified 77 groups, stopping after 12 splits along the
deepest branch. This was manually limited to 5 splits per
branch and subsequent merging of the neuronal groups
into three main groups (each containing four subgroups)
and oligodendrocytes into one group (containing 3 sub-
groups) resulted in the nine main cell classes given in
the study [9]. Seurat [13, 21] applies a two-dimensional
t-SNE projection [22] to the most significant principal
component scores (a process they refer to as spectral
t-SNE”) from a principal component analysis of the sin-
gle cell expression matrix. Density clustering (DBSCAN)
[23] is then used to identify cell type clusters in the two-
dimensional space. Seurat was used to classify 44,808
Drop-seq single cell expression profiles into 39 retinal cell
populations [21]. The SINCERA package [24] uses hier-
archical clustering using centered Pearsons correlation
and average linkage as default settings for the similarity
measurement and linkage method respectively. SNN-cliq
[25] uses the concept of shared nearest neighbour (SNN)
to define similarities between data points (cells) and
achieves clustering by a graph theory-based algorithm.
Finally, the SC3 approach [26] which uses spectral trans-
formations, the K-means algorithm and complete-linkage
to perform consensus clustering.
A limitation of these methods is that they do not estab-
lish a connection between the representation of the data
to the number and nature of the cell types that can be
resolved. For example, Fig. 1 illustrates three clustering
structures derived from a single cell study of mouse sen-
sory neurons [27]. Four broad sensory neuronal cell types
(NF, TH, PEP, NP) were identified by examining clusters
of cells in the subspace spanned by the first few prin-
cipal components (PC2-4 shown in Fig. 1a) and using
expression of key (known) cell markers to label the clus-
ters. Using information contained in additional principal
components, the four major cell types could then be sub-
divided into further distinct cell subtypes. The presence
of these refined cell subtypes is clearly not obvious from
a visual inspection of the data in the subspace spanned by
PC2-4 (Fig. 1b,c).
We have developed an agglomerative clustering
approach that integrates principal components analysis
(PCA) and hierarchical clustering that we call pcaReduce.
This method seeks to establish a connection between the
reduced representations given by principal components
analysis and the number of resolvable cell types (clusters).
The approach is driven by the expectation that informa-
tion pertaining to large, broad classes of cell types are
likely be contained in low dimensional PC representa-
tions whilst refined cell type structure are only defined
in high dimensional PC representations. Our proposed
method is similar to the “iterative” principal component
analysis approaches used to establish putative cell types
in [27, 28]. However, the question we asked was whether
Fig. 1 Cellular hierarchies. Three hierarchically related clustering structures for a single cell mouse neuronal dataset [27]. The data has been
projected on to the first four principle directions, we report the three that allows best data visualisation; we used the given cellular labels to colour
cells according to the a 4, b 8, and c 11 cell subtypes identified in the original study

Žurauskien
˙
eandYauBMC Bioinformatics
(2016) 17:140
Page 3 of 11
we could recapitulate such results fully automatically in
an unsupervised fashion without prior knowledge of cell
type markers that was used in these studies. This ques-
tion is important in study conditions where there maybe
little or unreliable prior knowledge of cell types. We will
test our methods against existing approaches but note
that many existing approaches typically utilise different
combinations of standard dimensionality reduction and
clustering algorithms. Therefore, in our investigations,
instead of using these packages directly, we will explore
the utility of the constituents components (which might
be shared between approaches).
Methods
Let X
n×d
denote a gene expression matrix, where n
is the number of cells measured across d number of
genes; i.e. each cell x
i
={x
i1
, ..., x
id
} is a d-dimensional
object. Further assume that Y
n×q
denotes a score matrix,
obtained after projecting data into first q principle direc-
tions, and Y
i
denotes a subset of cells, Y
i
Y.
Our clustering algorithm begins by performing a
K-means clustering operation on the projection of the
original gene expression matrix, X
n×d
,tothetopK 1
principal directions. The number of initial clusters K is
set to a sufficiently large value, say 30, to ensure most
cell types will be captured. Once the initial clusters are
determined, we take two subsets (Y
i
, Y
j
)thatoriginate
from a pair of clusters (i, j) respectively, and calculate the
probability for those observations to be merged together,
p({Y
i
, Y
j
}|μ
ij
,
ij
). We assume that the probability den-
sity function is a multivariate Gaussian with mean and
covariance matrix given by:
μ
ij
=
n
i
n
i
+ n
j
μ
i
+
n
j
n
i
+ n
j
μ
j
,
ij
=
n
i
n
i
+ n
j
i
+
n
j
n
i
+ n
j
j
,
(1)
where (n
i
, n
j
),
i
, μ
j
) and (
i
,
j
) denote the sizes, cen-
troids and covariances of the clusters i and j respec-
tively. We repeat this for all possible pairs (i, j).Wethen
choose to merge two clusters by either (i) picking the pair
that has the highest probability or (ii) sampling a pair
of clusters to merge in proportion to their (normalised)
merged probabilities. The number of clusters will now
decrease to K 1. We then project the data matrix on
to the first K 2 principal directions, i.e. removing the
(K 1)-st principal component that explains the lowest
degree of variance in the data, removing this dimen-
sion from the existing cluster centroids and covariance
matrices.
The above clustering operation is then repeated so that
after every merge operation we remove a principal direc-
tion until only a single cluster remains. If sampling-based
merge operations are used, the whole process can be
repeated to obtain a number of alternative clusterings.
This will be useful for assessing the stability of the cluster-
ing results. Algorithm 1 gives a pseudo-code description.
Algorithm 1: The pcaReduce algorithm. Here
X
n×d
is a gene expression matrix with n cells
(given in rows) and d genes (in columns); q is the
number of dimensions effectively this refers to
the number of levels in the hierarchy; Y is a score
matrix, which is the output of PCA algorithm; μ
ij
and
ij
definition are given in Eq. (1); (i) and (ii)
denote two different merging settings: (i) merg-
ing is based on largest probability P(i, j) value; (ii)
merging is based on sampling according to P(i, j)
distribution.
Input: X
n×d
and q ;
Output:acollectionofq clusterings;
1 Y ←− PCA(X
n×d
, dim=q);
2 (μ, ) ←− kmeans(Y, K = q + 1);
3 Q ←− q 1;
4 for r = 1, ..., Q do
5 for all possible pairs (i, j) in
{1, ..., K}×{1, ..., K} do
6 P(i, j) ←− p
Y
i
Y
j
|μ
ij
,
ij
;
7 end for
8 (i) either choose pair (i, j) with largest
probability P(i, j) and merge clusters i and j;
9 (ii) or sample a pair (i, j) with probability
P(i, j) and merge clusters i and j;
10 q ←− q 1;
11 Y Y
n×q
(i.e. remove last dimension);
12 update (μ, );
13 K K 1;
14 end for
Figure 2 gives an illustration of our method using an
autoencoder representation. Autoencoders are feedfor-
ward neural networks that accept d-dimensional input
data and report d-dimensional output data. The d nodes
in the input and output network layers are connected
via one or more hidden layers. Data transformations are
applied between each layer of the network. If a hid-
den layer has fewer nodes than its predecessor then
the information from the previous layer in the autoen-
coder network is forced into a lower dimensional form
hence performing dimensionality reduction. Each hidden
layer encodes a reduced dimensional representation of
the input data. In an autoencoder, the parameters gov-
erning the data transformations between the layers are
fitted to minimise the mean-squared error between the
original input data and the output representation. It can

Žurauskien
˙
eandYauBMC Bioinformatics
(2016) 17:140
Page 4 of 11
Fig. 2 Method illustration using an autoencoder network. Clustering is applied to the data representation at each linear hidden layer. If there are
K 1 linear hidden units, the data is projected into a subspace spanned by the top K 1 principal components. Consistency between the
clusterings at each layer is maintained by enforcing a hierarchical constraint. a Graphical interpretation of an autoencoder network(s). b
Corresponding hierarchical structure
be shown that, when using linear transformations, the
optimal autoencoder is equivalent to doing principal com-
ponents analysis [29]. Using this analogy it is now clear
that pcaReduce can be seen as performing hierarchical
clustering on the different hidden layers of a linear autoen-
coder network linking different clustering structures to
different hidden layer representations of the input data.
This analogy is useful to explain our algorithm and why we
remove one principal component after each merge oper-
ation. We are in fact maintaining clustering consistency
across the hidden layer representations of the input data
in the autoencoder network.
Results
To demonstrate the performance of pcaReduce method,
we considered two single cell RNA-seq dataset exam-
ples. The first contains a collection of cells originating
from diverse biological tissues [30]; and the second dataset
the mouse sensory neuronal cells [27] discussed in the
Introduction. These were selected as they contained pre-
existing hierarchical cluster structures that can be used
to assess unsupervised algorithmic performance. Here we
show that pcaReduce can be applied to re-capture the
known cellular hierarchies and we compare to other statis-
tical techniques, which are commonly applied to address
similar cell sub-typing problems. Below, all examples were
implemented using the free statistical computing platform
R (www.r-project.org).
Cells from disparate tissues
We obtained single cell RNA-seq dataset [30] for 300 cells
whose transcriptional measurements were taken across
8,686 genes (see Additional file 1, Section A for further
details on data preparation). The data were derived from
11 cell types: K562 myeloid (chronic leukemia), HL60
myeloid (acute leukemia), CRL-2339 lymphoblastoid;

Žurauskien
˙
eandYauBMC Bioinformatics
(2016) 17:140
Page 5 of 11
iPS pluripotent; CRL-2338 epithelial, BJ fibrob-
last (from human foreskin), Kera foreskin keratinocyte;
NPC neural progenitor cells, GW(16, 21, 21+3) gesta-
tional week (16,21, 21+3 weeks), fetal cortex (see Fig. 3a).
In addition, as specified in the original study by [30], these
cell types could also be grouped into four disparate tis-
sue sources: blood, stem, skin and neural tissues. We refer
to these as the cell line-level and tissue-level classifica-
tions respectively and use these as ground-truth classes
in our performance assessment; i.e. we will focus on data
partitions into K = 11 and K = 4clusters.
We applied pcaReduce to this dataset to construct a
hierarchical clustering of cells. First, we initially projected
the data into the subspace spanned by the first 30 prin-
cipal components following a PCA and performed an
initial K-means clustering to get initial cluster assign-
ments (using K = 31 clusters) [31]. After this, we applied
different merging strategies to construct the cellular hier-
archies: first, when merging is performed based on the
most probable cluster merge value (see (i) in Algorithm 1)
and, secondly, when merge candidates are probabilistically
sampled (see (ii) in Algorithmic overview section). The
former gives a single hierarchical clustering whilst the lat-
ter can give a range of candidates hierarchies based on
repeated sampling.
We compared the hierarchical clustering given by
pcaReduce for levels K = 4andK = 11 to the true
cell line and tissue level classifications respectively using
the Adjusted Rand Index [32]. Note that, in Fig. 3a, the
projection of the eleven cell lines in two-dimensional
principal component space cannot be separated into dis-
tinct groups. It is only possible to do this in higher
dimensional representations. Figure 3b illustrates the per-
formance of pcaReduce using the sampling-based merge
operation where each line corresponds to a single run
of the method. Although, pcaReduce has no knowledge
ofthetruenumberofcelllineortissuelabels,thecor-
respondence between the hierarchical clustering output
of pcaReduce and the true classification peaks at around
levels 4 and 11 of the hierarchies respectively which it
discovers without any prior knowledge.
In order to gain an understanding of the misclassifica-
tions, we looked specifically at the most probable hier-
archical structure identified using pcaReduce (Fig. 3c).
Compared to the known cell line and tissue labels (see
Additional file 1: Figure S1), the 11-cluster structure
given by pcaReduce did not fully differentiate the 11 cell
types. This is not unsurprising since the 11 cell types
included a set of three maturing neural cell types (GW16,
GW21 and GW21+3) that are highly related. Interestingly,
PCA2
200 100 0 100 200 300
150 50 0 50 100 200
K562
HL60
2339
iPS
2338
BJ
Kera
NPC
GW21
GW21+3
GW16
5 1015202530
0.0 0.2 0.4 0.6 0.8 1.0
K = 4
K = 11
ARANDI score
Number of clusters
AB
PCA1
C
19 K56223 K562 54 HL60 22 233817 2339 39 Kera
37 BJ
1 Kera
15 NPC13 iPS 10 iPS
26 GW16
8 GW21
16 GW21+3
96 Blood
22 Dermal
17 Blood
77 Dermal
65 Neural
23 Stem
300 Cells
Fig. 3 Application of pcaReduce to single cell RNA sequencing of 11 cell lines. a Projection of the data on to the first two principal components.
b Performance of pcaReduce, the horizontal axis corresponds to a level in the hierarchical cluster structure reported by pcaReduce, the vertical axis
show the Adjusted Rand Index (ARANDI) score between the tissue-level (green) and cell-line level labels (in blue) and the clustering reported by
each level of the hierarchical clustering of pcaReduce. Each line correspond to a single run of pcaReduce using probabilistic sampling. c The most
probable cellular hierarchy identified using pcaReduce

Citations
More filters
Journal ArticleDOI

SCENIC: single-cell regulatory network inference and clustering.

TL;DR: On a compendium of single-cell data from tumors and brain, it is demonstrated that cis-regulatory analysis can be exploited to guide the identification of transcription factors and cell states.
Posted ContentDOI

SCENIC: Single-Cell Regulatory Network Inference And Clustering

TL;DR: SCENIC (Single Cell rEgulatory Network Inference and Clustering) is the first method to analyze scRNA-seq data using a network-centric, rather than cell-centric approach and allows for the simultaneous tracing of genomic regulatory programs and the mapping of cellular identities emerging from these programs.
Journal ArticleDOI

Challenges in unsupervised clustering of single-cell RNA-seq data.

TL;DR: This Review discusses the multiple algorithmic options for clustering scRNA-seq data, including various technical, biological and computational considerations.
Journal ArticleDOI

A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.

TL;DR: A practical guide to help researchers design their first scRNA-seq studies, including introductory information on experimental hardware, protocol choice, quality control, data analysis and biological interpretation is presented.
References
More filters
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Journal ArticleDOI

TopHat: discovering splice junctions with RNA-Seq

TL;DR: The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer.
Related Papers (5)