scispace - formally typeset
Open AccessPosted ContentDOI

Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model

TLDR
It is shown that simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance outperform current practice in a downstream clustering assessment using ground-truth datasets.
Abstract
Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.

read more

Content maybe subject to copyright    Report

TOWNES et al. Genome Biology
(2019) 20:295
https://doi.org/10.1186/s13059-019-1861-6
METHOD Open Access
Feature selection and dimension
reduction for single-cell RNA-Seq based on a
multinomial model
F. William Townes
1,2
, Stephanie C. Hicks
3
, Martin J. Aryee
1,4,5,6
and Rafael A. Irizarry
1,7*
Abstract
Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have
incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial
sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature
selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial
methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature
selection using deviance. These methods outperform the current practice in a downstream clustering assessment
using ground truth datasets.
Keywords: Gene expression, Single cell, RNA-Seq, Dimension reduction, Variable genes, Principal component
analysis, GLM-PCA
Background
Single-cell RNA-Seq (scRNA-Seq) is a powerful tool for
profiling gene expression patterns in individual cells, facil-
itating a variety of analyses such as identification of novel
cell types [1, 2]. In a typical protocol, single cells are iso-
lated in liquid droplets, and messenger RNA (mRNA) is
captured from each cell, converted to cDNA by reverse
transcriptase (RT), then amplified using polymerase chain
reaction (PCR) [35]. Finally, fragments are sequenced,
and expression of a gene in a cell is quantified by the
number of sequencing reads that mapped to that gene
[6]. A crucial difference between scRNA-Seq and tradi-
tional bulk RNA-Seq is the low quantity of mRNA isolated
from individual cells, which requires a larger number of
PCR cycles to produce enough material for sequencing
(bulk RNA-Seq comingles thousands of cells per sam-
ple). For example, the popular 10x Genomics protocol
uses 14 cycles [5]. Thus, many of the reads counted in
scRNA-Seq are duplicates of a single mRNA molecule in
the original cell [7]. Full-length protocols such as SMART-
Seq2 [8]analyzetheseread counts directly, and several
*Correspondence: rafa@ds.dfci.harvard.edu
1
Department of Biostatistics, Harvard University, Cambridge, MA, USA
7
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Full list of author information is available at the end of the article
methods have been developed to facilitate this [9].
However, in many experiments, it is desirable to ana-
lyze larger numbers of cells than possible with full-length
protocols, and isoform-level inference may be unneces-
sary. Under such conditions, it is advantagous to include
unique molecular identifiers (UMIs) which enable com-
putational removal of PCR duplicates [10, 11], producing
UMI counts. Although a zero UMI count is equivalent to a
zero read count, nonzero read counts are larger than their
corresponding UMI counts. In general, all scRNA-Seq
data contain large numbers of zero counts (often > 90%
of the data). Here, we focus on the analysis of scRNA-Seq
data with UMI counts.
Starting from raw counts, a scRNA-Seq data analysis
typically includes normalization, feature selection, and
dimension reduction steps. Normalization seeks to adjust
for differences in experimental conditions between sam-
ples (individual cells), so that these do not confound
true biological differences. For example, the efficiency
of mRNA capture and RT is variable between samples
(technical variation), causing different cells to have differ-
ent total UMI counts, even if the number of molecules
in the original cells is identical. Feature selection refers
to excluding uninformative genes such as those which
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

TOWNES et al. Genome Biology
(2019) 20:295
Page 2 of 16
exhibit no meaningful biological variation across sam-
ples. Since scRNA-Seq experiments usually examine cells
within a single tissue, only a small fraction of genes are
expected to be informative since many genes are bio-
logically variable only across different tissues. Dimension
reduction aims to embed each cell’s high-dimensional
expression profile into a low-dimensional representation
to facilitate visualization and clustering.
While a plethora of methods [5, 1215]havebeendevel-
oped for each of these steps, here, we describe what is
considered to be the standard pipeline [15]. First, raw
counts are normalized by scaling of sample-specific size
factors, followed by log transformation, which attempts to
reduce skewness. Next, feature selection involves identi-
fying the top 500–2000 genes by computing either their
coefficient of variation (highly variable genes [16, 17]) or
average expression level (highly expressed genes) across
all cells [15]. Alternatively, highly dropout genes may be
retained [18]. Principal component analysis (PCA) [19]is
the most popular dimension reduction method (see for
example tutorials for Seurat [17] and Cell Ranger [5]).
PCA compresses each cells 2000-dimensional expression
profile into, say, a 10-dimensional vector of principal com-
ponent coordinates or latent factors. Prior to PCA, data
are usually centered and scaled so that each gene has
mean 0 and standard deviation 1 (z-score transformation).
Finally, a clustering algorithm can be applied to group cells
with similar representations in the low-dimensional PCA
space.
Despite the appealing simplicity of this standard
pipeline, the characteristics of scRNA-Seq UMI counts
present difficulties at each stage. Many normalization
schemes derived from bulk RNA-Seq cannot compute size
factors stably in the presence of large numbers of zeros
[20]. A numerically stable and popular method is to set
the size factor for each cell as the total counts divided by
10
6
(counts per million, CPM). Note that CPM does not
alter zeros, which dominate scRNA-Seq data. Log trans-
formation is not possible for exact zeros, so it is common
practice to add a small pseudocount such as 1 to all nor-
malized counts prior to taking the log. The choice of pseu-
docount is arbitrary and can introduce subtle biases in the
transformed data [21]. For a statistical interpretation of
the pseudocount, see the Methods section. Similarly, the
use of highly variable genes for feature selection is some-
what arbitrary since the observed variability will depend
on the pseudocount: pseudocounts close to zero arbitrar-
ily increase the variance of genes with zero counts. Finally,
PCA implicitly relies on Euclidean geometry, which may
not be appropriate for highly sparse, discrete, and skewed
data, even after normalizations and transformations [22].
Widely used methods for the analysis of scRNA-Seq
lack statistically rigorous justification based on a plau-
sible data generating a mechanism for UMI counts.
Instead, it appears many of the techniques have been bor-
rowed from the data analysis pipelines developed for read
counts, especially those based on bulk RNA-Seq [23]. For
example, models based on the lognormal distribution can-
not account for exact zeros, motivating the development
of zero-inflated lognormal models for scRNA-Seq read
counts [2427]. Alternatively, ZINB-WAVE uses a zero-
inflated negative binomial model for dimension reduction
of read counts [28]. However, as shown below, the sam-
pling distribution of UMI counts is not zero inflated [29]
and differs markedly from read counts, so application of
read count models to UMI counts needs either theoretical
or empirical justification.
We present a unifying statistical foundation for scRNA-
Seq with UMI counts based on the multinomial dis-
tribution. The multinomial model adequately describes
negative control data, and there is no need to model zero
inflation. We show the mechanism by which PCA on
log-normalized UMI counts can lead to distorted low-
dimensional factors and false discoveries. We identify the
source of the frequently observed and undesirable fact
that the fraction of zeros reported in each cell drives the
first principal component in most experiments [30]. To
remove these distortions, we propose the use of GLM-
PCA, a generalization of PCA to exponential family like-
lihoods [31]. GLM-PCA operates on raw counts, avoiding
the pitfalls of normalization. We also demonstrate that
applying PCA to deviance or Pearson residuals provides
a useful and fast approximation to GLM-PCA. We pro-
vide a closed-form deviance statistic as a feature selection
method. We systematically compare the performance of
all combinations of methods using ground truth datasets
and assessment procedures from [15]. We conclude by
suggesting best practices.
Results and discussion
Datasets
We used 9 public UMI count datasets to benchmark our
methods (Table 1). The first dataset was a highly con-
trolled experiment specifically designed to understand the
technical variability. No actual cells were used to gener-
ate this dataset. Instead, each droplet received the same
ratio of 92 synthetic spike-in RNA molecules from Exter-
nal RNA Controls Consortium (ERCC). We refer to this
dataset as the technical replicates negative control as there
is no biological variability whatsoever, and in principle,
each expression profile should be the same.
The second and third datasets contained cells from
homogeneous populations purified using fluorescence-
activated cell sorting (FACS). We refer to these datasets
as biological replicates negative controls.Becausethese
cells were all the same type, we did not expect
to observe any significant differences in unsupervised
analysis. The 10× Zheng monocytes data had low total

TOWNES et al. Genome Biology
(2019) 20:295
Page 3 of 16
Table 1 Single cell RNA-Seq datasets used
Number Author Tissue Cells MTU Notes
1 Zheng [5] ERCC 1015 11,125 Spike-in only; technical negative control
2 Zheng [5] Monocytes 2612 782 1 cell type; biological negative control
3Tung[32] iPSCs 57 24,170 1 cell type; biological negative control
4 Duo [15] PBMCs 3994 1215 4 equal clusters of FACS-purified cells
5 Duo [15] PBMCs 3994 1298 8 equal clusters of FACS-purified cells
6Haber[33] Intestine 533 3755 Authors computationally identified 12 types
7Muraro[34] Pancreas 2282 18,795 Authors computationally identified 9 types
8 Zheng [5] PBMCs 68,579 1292 Benchmarking computational speed
Species: all H. sapiens except Haber (M. musculus). Protocols: all 10× except Muraro (CEL-Seq2) and Tung (SMARTer). MTU median total UMI count. iPSCs induced pluripotent
stem cells
UMI counts, while the SMARTer Tung data had high
counts.
Thefourthandfifthdatasetswerecreatedby[15].
The authors allocated FACS-purified peripheral blood
mononuclear cells (PBMCs) from 10× data [5] equally
into four (Zheng 4eq dataset) and eight (Zheng 8eq
dataset) clusters, respectively. In these positive control
datasets, the cluster identity of all cells was assigned inde-
pendently of gene expression (using FACS), so they served
as the ground truth labels.
The sixth and seventh datasets contained a wider variety
of cell types. However, the cluster identities were deter-
mined computationally by the original authors’ unsuper-
vised analyses and could not serve as a ground truth. The
10× Haber intestinal dataset had low total UMI counts,
while the CEL-Seq2 Muraro pancreas dataset had high
counts.
The final Zheng dataset consisted of a larger number
of unsorted PBMCs and was used to compare computa-
tional speed of different dimension reduction algorithms.
We refer to it as the PBMC 68K dataset.
UMI count distribution differs from reads
To illustrate the marked difference between UMI count
distributions and read count distributions, we created
histograms from individual genes and spike-ins of the
negative control data. Here, the UMI counts are the com-
putationally de-duplicated versions of the read counts;
both measurements are from the same experiment,
so no differences are due to technical or biological
variation. The results suggest that while read counts
appear zero-inflated and multimodal, UMI counts fol-
low a discrete distribution with no zero inflation
(Additional file 1: Figure S1). The apparent zero inflation
in read counts is a result of PCR duplicates.
Multinomial sampling distribution for UMI counts
Consider a single cell i containing t
i
total mRNA tran-
scripts. Let n
i
be the total number of UMIs for the same
cell. When the cell is processed by a scRNA-Seq proto-
col, it is lysed, then some fraction of the transcripts are
captured by beads within the droplets. A series of com-
plex biochemical reactions occur, including attachment of
barcodes and UMIs, and reverse transcription of the cap-
tured mRNA to a cDNA molecule. Finally, the cDNA is
sequenced, and PCR duplicates are removed to generate
the UMI counts [5]. In each of these stages, some frac-
tion of the molecules from the previous stage are lost
[5, 7, 32]. In particular, reverse transcriptase is an ineffi-
cient and error-prone enzyme [35]. Therefore, the number
of UMI counts representing the cell is much less than
the number of transcripts in the original cell (n
i
t
i
).
Specifically, n
i
typically ranges from 1000 10, 000 while
t
i
is estimated to be approximately 200,000 for a typical
mammalian cell [36]. Furthermore, which molecules are
selected and successfully become UMIs is a random pro-
cess. Let x
ij
bethetruenumberofmRNAtranscriptsof
gene j in cell i,andy
ij
be the UMI count for the same
gene and cell. We define the relative abundance π
ij
as the
true number of mRNA transcripts represented by gene j
in cell i divided by the total number of mRNA transcripts
in cell i.Relativeabundanceisgivenbyπ
ij
= x
ij
/t
i
where
total transcripts t
i
=
j
x
ij
.Sincen
i
t
i
,thereisa
competition to be counted” [37]; genes with large relative
abundance π
ij
in the original cell are more likely to have
nonzero UMI counts, but genes with small relative abun-
dances may be observed with UMI counts of exact zeros.
The UMI counts y
ij
are a multinomial sample of the true
biological counts x
ij
, containing only relative information
about expression patterns in the cell [37, 38].
The multinomial distribution can be approximated
by independent Poisson distributions and overdispersed
(Dirichlet) multinomials by independent negative bino-
mial distributions. These approximations are useful for
computational tractability. Details are provided in the
Methods”section.
The multinomial model makes two predictions which
we verified using negative control data. First, the fraction

TOWNES et al. Genome Biology
(2019) 20:295
Page 4 of 16
of zeros in a sample (cell or droplet) is inversely related
to the total number of UMIs in that sample. Second,
the probability of an endogenous gene or ERCC spike-in
having zero counts is a decreasing function of its mean
expression (equations provided in the Methods”section).
Both of these predictions were validated by the negative
control data (Fig. 1). In particular, the empirical probabil-
ity of a gene being zero across droplets was well calibrated
to the theoretical prediction based on the multinomial
model. This also demonstrates that UMI counts are not
zero inflated , consistent with [29].
To further validate the multinomial model, we assessed
goodness-of-fit of seven possible null distributions to
both the Tung and Zheng monocytes negative control
datasets (Additional file 1: Figure S2). When applied to
UMI counts, the multinomial, Dirichlet-multinomial, and
Poisson (as approximation to multinomial) distributions
fit best. When applied to read counts, the zero-inflated
lognormal was the best fitting distribution followed by the
Dirichlet-multinomial.
These results are consistent with [39], which also found
that the relationship between average expression and zero
probability follows the theoretical curve predicted by a
Poisson model using negative control data processed with
Indrop [4]andDropseq[3] protocols. These are droplet
protocols with typically low counts. It has been argued
that the Poisson model is insufficient to describe the sam-
pling distribution of genes with high counts and the neg-
ative binomial model is more appropriate [11]. The Tung
dataset contained high counts, and we nevertheless found
the Poisson gave a better fit than the negative binomial.
However, the difference was not dramatic, so our results
do not preclude the negative binomial as a reasonable
sampling distribution for UMI counts. Taken together,
Fig. 1 Multinomial model adequately characterizes sampling distributions of technical and biological replicates negative control data. a Fraction of
zeros is plotted against the total number of UMI in each droplet for the technical replicates. b As a but for cells in the biological replicates
(monocytes). c After down-sampling replicates to 10,000 UMIs per droplet to remove variability due to the differences in sequencing depth, the
fraction of zeros is computed for each gene and plotted against the log of expression across all samples for the technical replicates data. The solid
curve is theoretical probability of observing a zero as a function of the expected counts derived from the multinomial model (blue) and its Poisson
approximation (green). d As c but for the biological replicates (monocytes) dataset and after down-sampling to 575 UMIs per cell. Here, we also add
the theoretical probability derived from a negative binomial model (red)

TOWNES et al. Genome Biology
(2019) 20:295
Page 5 of 16
these results suggest our data-generating mechanism is an
accurate model of technical noise in real data.
Normalization and log transformation distorts UMI data
Standard scRNA-Seq analysis involves normalizing raw
counts using size factors, applying a log transformation
with a pseudocount, and then centering and scaling each
gene before dimension reduction. The most popular nor-
malization is counts per million (CPM). The CPM are
defined as (y
ij
/n
i
)×10
6
(i.e., the size factor is n
i
/10
6
). This
is equivalent to the maximum likelihood estimator (MLE)
for relative abundance ˆπ
ij
multiplied by 10
6
.Thelog-CPM
are then log
2
(c π
ij
10
6
) = log
2
( ˜π
ij
) + C,where ˜π
ij
is
a maximum a posteriori estimator (MAP) for π
ij
(math-
ematical justification and interpretation of this approach
provided in the Methods section). The additive constant
C is irrelevant if data are centered for each gene after log
transformation, as is common practice. Thus, normaliza-
tion of raw counts is equivalent to using MLEs or MAP
estimators of the relative abundances.
Log transformation of MLEs is not possible for UMI
counts due to exact zeros, while log transformation of
MAP estimators of π
ij
systematically distorts the differ-
ences between zero and nonzero UMI counts, depending
on the arbitrary pseudocount c (derivations provided in
the Methods section). To illustrate this phenomenon,
we examined the distribution of an illustrative gene
before and after the log transform with varying normal-
izations using the biological replicates negative control
data (Fig. 2). Consistent with our theoretical predictions,
this artificially caused the distribution to appear zero
inflated and exaggerated differences between cells based
on whether the count was zero or nonzero.
Focusing on the entire negative control datasets, we
applied PCA to log-CPM values. We observed a strong
correlation (r = 0.8 for technical and r = 0.98 for
monocytes biological replicates) between the first princi-
pal component (PC) and the fraction of zeros, consistent
with [30]. Application of PCA to CPM values without log
transform reduced this correlation to r = 0.1 for tech-
nical and r = 0.7 for monocytes biological replicates.
Additionally, the first PC of log-CPM correlated with the
log of total UMI, which is consistent with the multinomial
model (Fig. 3). Note that in datasets with strong biological
variability, the nuisance variation from zero fraction and
total counts could appear in secondary PCs rather than
the first PC, but it would still confound downstream anal-
yses. Based on these results, the log transformation is not
necessary and in fact detrimental for the analysis of UMI
counts. The benefits of avoiding normalization by instead
directly modeling raw counts have been demonstrated in
the context of differential expression [40]. Where normal-
ization is unavoidable, we propose the use of approximate
multinomial deviance residuals (defined in the Residuals
and z-scores section) instead of log-transformed CPM.
Zero inflation is an artifact of log normalization
To see how normalization and log transformation intro-
duce the appearance of zero inflation, consider the follow-
ing example. Let y
ij
be the observed UMI counts following
a multinomial distribution with size n
i
for each cell and
relative abundance π
j
for each gene, constant across cells.
Focusing on a single gene j, y
ij
follows a binomial distri-
bution with parameters n
i
and p
j
. Assume π
j
= 10
4
and
the n
i
range from 1000 3000, which is consistent with
the biological replicates negative control data (Fig. 1 and
Additional file 1: Figure S1). Under this assumption, we
expect to see about 74–90% zeros, 22–30% ones, and less
than 4% values above one. However, notice that after nor-
malization to CPM and log transformation, all the zeros
remain log 2(1 + 0) = 0, yet the ones turn into values
ranging from log
2
(1 + 1/3000 × 10
6
) = log
2
(334) 8.4
to log
2
(1001) 10. The few values that are 2 will have
values ranging from log
2
(668) 9.4 to log
2
(2001) 11.
The large, artificial gap between zero and nonzero val-
ues makes the log-normalized data appear zero-inflated
Fig. 2 Example of how current approaches to normalization and transformation artificially distort differences between zero and nonzero counts.
a UMI count distribution for gene ENSG00000114391 in the monocytes biological replicates negative control dataset. b Counts per million (CPM)
distribution for the exact same count data. c Distribution of log
2
(1 + CPM) values for the exact same count data

Citations
More filters
Journal ArticleDOI

Eleven grand challenges in single-cell data science

David Lähnemann, +71 more
- 07 Feb 2020 - 
TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.
Journal ArticleDOI

The art of using t-SNE for single-cell transcriptomics.

TL;DR: A protocol is introduced to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.
Journal ArticleDOI

muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data.

TL;DR: Methods to perform cross-condition differential state analyses are surveyed, including cell-level mixed models and methods based on aggregated pseudobulk data, to uncover subpopulation-specific responses to lipopolysaccharide treatment and provide robust tools for multi-condition analysis within the muscat R package.
Journal ArticleDOI

Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq

TL;DR: This work shows with simulations that a method it calls consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell, and applies it to published brain organoid and visual cortex scRNA-Seq datasets.
References
More filters
Journal ArticleDOI

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

TL;DR: An analytical strategy for integrating scRNA-seq data sets based on common sources of variation is introduced, enabling the identification of shared populations across data sets and downstream comparative analysis.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model" ?

Using negative controls, the authors show UMI counts follow multinomial sampling with no zero inflation. The authors propose simple multinomial methods, including generalized principal component analysis ( GLM-PCA ) for non-normal distributions, and feature selection using deviance. 

When the cell is processed by a scRNA-Seq protocol, it is lysed, then some fraction of the transcripts are captured by beads within the droplets. 

Any high dimensional, sparse dataset where samples contain only relative information in the form of counts may conceivably be modeled by the multinomial distribution. 

A numerically stable and popular method is to set the size factor for each cell as the total counts divided by 106 (counts per million, CPM). 

After down-sampling replicates to 10,000 UMIs per droplet to remove variability due to the differences in sequencing depth, thefraction of zeros is computed for each gene and plotted against the log of expression across all samples for the technical replicates data. 

the authors propose to replace the normal null model with a multinomial null model as a better match to the data-generating mechanism. 

The authors have outlined a statistical framework for analysis of scRNA-Seq data with UMI counts based on a multinomial model, providing effective and simple to compute methods for feature selection and dimension reduction. 

While the multinomial likelihood is ideal for modeling technical variability in scRNA-Seq UMI counts (Fig. 1), in many cases, there may be excess biological variability present as well. 

The authors refer to this dataset as the technical replicates negative control as there is no biological variability whatsoever, and in principle, each expression profile should be the same. 

It has been argued that the Poisson model is insufficient to describe the sampling distribution of genes with high counts and the negative binomial model is more appropriate [11].