What is the way to model a high dimensional dataset?

Any high dimensional, sparse dataset where samples contain only relative information in the form of counts may conceivably be modeled by the multinomial distribution.

What is the way to replace the normal null model with a multinomial?

the authors propose to replace the normal null model with a multinomial null model as a better match to the data-generating mechanism.

What is the framework for scRNA-Seq analysis?

The authors have outlined a statistical framework for analysis of scRNA-Seq data with UMI counts based on a multinomial model, providing effective and simple to compute methods for feature selection and dimension reduction.

What is the way to model biological variability in scRNA-Seq counts?

While the multinomial likelihood is ideal for modeling technical variability in scRNA-Seq UMI counts (Fig. 1), in many cases, there may be excess biological variability present as well.

(Open Access) Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model (2019) | F. William Townes

Q: What have the authors contributed in "Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model" ?

Using negative controls, the authors show UMI counts follow multinomial sampling with no zero inflation. The authors propose simple multinomial methods, including generalized principal component analysis ( GLM-PCA ) for non-normal distributions, and feature selection using deviance.

Q: What is the process of lysing a cell?

When the cell is processed by a scRNA-Seq protocol, it is lysed, then some fraction of the transcripts are captured by beads within the droplets.

Q: What is the meaning of the term technical replicates negative control?

The authors refer to this dataset as the technical replicates negative control as there is no biological variability whatsoever, and in principle, each expression profile should be the same.

TOWNES et al. Genome Biology

(2019) 20:295

https://doi.org/10.1186/s13059-019-1861-6

METHOD Open Access

Feature selection and dimension

reduction for single-cell RNA-Seq based on a

multinomial model

F. William Townes

1,2

, Stephanie C. Hicks

, Martin J. Aryee

1,4,5,6

and Rafael A. Irizarry

1,7*

Abstract

Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have

incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial

sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature

selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial

methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature

selection using deviance. These methods outperform the current practice in a downstream clustering assessment

using ground truth datasets.

Keywords: Gene expression, Single cell, RNA-Seq, Dimension reduction, Variable genes, Principal component

analysis, GLM-PCA

Background

Single-cell RNA-Seq (scRNA-Seq) is a powerful tool for

profiling gene expression patterns in individual cells, facil-

itating a variety of analyses such as identification of novel

cell types [1, 2]. In a typical protocol, single cells are iso-

lated in liquid droplets, and messenger RNA (mRNA) is

captured from each cell, converted to cDNA by reverse

transcriptase (RT), then amplified using polymerase chain

reaction (PCR) [3–5]. Finally, fragments are sequenced,

and expression of a gene in a cell is quantified by the

number of sequencing reads that mapped to that gene

[6]. A crucial difference between scRNA-Seq and tradi-

tional bulk RNA-Seq is the low quantity of mRNA isolated

from individual cells, which requires a larger number of

PCR cycles to produce enough material for sequencing

(bulk RNA-Seq comingles thousands of cells per sam-

ple). For example, the popular 10x Genomics protocol

uses 14 cycles [5]. Thus, many of the reads counted in

scRNA-Seq are duplicates of a single mRNA molecule in

the original cell [7]. Full-length protocols such as SMART-

Seq2 [8]analyzetheseread counts directly, and several

*Correspondence: rafa@ds.dfci.harvard.edu

Department of Biostatistics, Harvard University, Cambridge, MA, USA

Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA

Full list of author information is available at the end of the article

methods have been developed to facilitate this [9].

However, in many experiments, it is desirable to ana-

lyze larger numbers of cells than possible with full-length

protocols, and isoform-level inference may be unneces-

sary. Under such conditions, it is advantagous to include

unique molecular identifiers (UMIs) which enable com-

putational removal of PCR duplicates [10, 11], producing

UMI counts. Although a zero UMI count is equivalent to a

zero read count, nonzero read counts are larger than their

corresponding UMI counts. In general, all scRNA-Seq

data contain large numbers of zero counts (often > 90%

of the data). Here, we focus on the analysis of scRNA-Seq

data with UMI counts.

Starting from raw counts, a scRNA-Seq data analysis

typically includes normalization, feature selection, and

dimension reduction steps. Normalization seeks to adjust

for differences in experimental conditions between sam-

ples (individual cells), so that these do not confound

true biological differences. For example, the efficiency

of mRNA capture and RT is variable between samples

(technical variation), causing different cells to have differ-

ent total UMI counts, even if the number of molecules

in the original cells is identical. Feature selection refers

to excluding uninformative genes such as those which

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the

Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

TOWNES et al. Genome Biology

(2019) 20:295

Page 2 of 16

exhibit no meaningful biological variation across sam-

ples. Since scRNA-Seq experiments usually examine cells

within a single tissue, only a small fraction of genes are

expected to be informative since many genes are bio-

logically variable only across different tissues. Dimension

reduction aims to embed each cell’s high-dimensional

expression profile into a low-dimensional representation

to facilitate visualization and clustering.

While a plethora of methods [5, 12–15]havebeendevel-

oped for each of these steps, here, we describe what is

considered to be the standard pipeline [15]. First, raw

counts are normalized by scaling of sample-specific size

factors, followed by log transformation, which attempts to

reduce skewness. Next, feature selection involves identi-

fying the top 500–2000 genes by computing either their

coefficient of variation (highly variable genes [16, 17]) or

average expression level (highly expressed genes) across

all cells [15]. Alternatively, highly dropout genes may be

retained [18]. Principal component analysis (PCA) [19]is

the most popular dimension reduction method (see for

example tutorials for Seurat [17] and Cell Ranger [5]).

PCA compresses each cell’s 2000-dimensional expression

profile into, say, a 10-dimensional vector of principal com-

ponent coordinates or latent factors. Prior to PCA, data

are usually centered and scaled so that each gene has

mean 0 and standard deviation 1 (z-score transformation).

Finally, a clustering algorithm can be applied to group cells

with similar representations in the low-dimensional PCA

space.

Despite the appealing simplicity of this standard

pipeline, the characteristics of scRNA-Seq UMI counts

present difficulties at each stage. Many normalization

schemes derived from bulk RNA-Seq cannot compute size

factors stably in the presence of large numbers of zeros

[20]. A numerically stable and popular method is to set

the size factor for each cell as the total counts divided by

(counts per million, CPM). Note that CPM does not

alter zeros, which dominate scRNA-Seq data. Log trans-

formation is not possible for exact zeros, so it is common

practice to add a small pseudocount such as 1 to all nor-

malized counts prior to taking the log. The choice of pseu-

docount is arbitrary and can introduce subtle biases in the

transformed data [21]. For a statistical interpretation of

the pseudocount, see the “Methods” section. Similarly, the

use of highly variable genes for feature selection is some-

what arbitrary since the observed variability will depend

on the pseudocount: pseudocounts close to zero arbitrar-

ily increase the variance of genes with zero counts. Finally,

PCA implicitly relies on Euclidean geometry, which may

not be appropriate for highly sparse, discrete, and skewed

data, even after normalizations and transformations [22].

Widely used methods for the analysis of scRNA-Seq

lack statistically rigorous justification based on a plau-

sible data generating a mechanism for UMI counts.

Instead, it appears many of the techniques have been bor-

rowed from the data analysis pipelines developed for read

counts, especially those based on bulk RNA-Seq [23]. For

example, models based on the lognormal distribution can-

not account for exact zeros, motivating the development

of zero-inflated lognormal models for scRNA-Seq read

counts [24–27]. Alternatively, ZINB-WAVE uses a zero-

inflated negative binomial model for dimension reduction

of read counts [28]. However, as shown below, the sam-

pling distribution of UMI counts is not zero inflated [29]

and differs markedly from read counts, so application of

read count models to UMI counts needs either theoretical

or empirical justification.

We present a unifying statistical foundation for scRNA-

Seq with UMI counts based on the multinomial dis-

tribution. The multinomial model adequately describes

negative control data, and there is no need to model zero

inflation. We show the mechanism by which PCA on

log-normalized UMI counts can lead to distorted low-

dimensional factors and false discoveries. We identify the

source of the frequently observed and undesirable fact

that the fraction of zeros reported in each cell drives the

first principal component in most experiments [30]. To

remove these distortions, we propose the use of GLM-

PCA, a generalization of PCA to exponential family like-

lihoods [31]. GLM-PCA operates on raw counts, avoiding

the pitfalls of normalization. We also demonstrate that

applying PCA to deviance or Pearson residuals provides

a useful and fast approximation to GLM-PCA. We pro-

vide a closed-form deviance statistic as a feature selection

method. We systematically compare the performance of

all combinations of methods using ground truth datasets

and assessment procedures from [15]. We conclude by

suggesting best practices.

Results and discussion

Datasets

We used 9 public UMI count datasets to benchmark our

methods (Table 1). The first dataset was a highly con-

trolled experiment specifically designed to understand the

technical variability. No actual cells were used to gener-

ate this dataset. Instead, each droplet received the same

ratio of 92 synthetic spike-in RNA molecules from Exter-

nal RNA Controls Consortium (ERCC). We refer to this

dataset as the technical replicates negative control as there

is no biological variability whatsoever, and in principle,

each expression profile should be the same.

The second and third datasets contained cells from

homogeneous populations purified using fluorescence-

activated cell sorting (FACS). We refer to these datasets

as biological replicates negative controls.Becausethese

cells were all the same type, we did not expect

to observe any significant differences in unsupervised

analysis. The 10× Zheng monocytes data had low total

TOWNES et al. Genome Biology

(2019) 20:295

Page 3 of 16

Table 1 Single cell RNA-Seq datasets used

Number Author Tissue Cells MTU Notes

1 Zheng [5] ERCC 1015 11,125 Spike-in only; technical negative control

2 Zheng [5] Monocytes 2612 782 1 cell type; biological negative control

3Tung[32] iPSCs 57 24,170 1 cell type; biological negative control

4 Duo [15] PBMCs 3994 1215 4 equal clusters of FACS-purified cells

5 Duo [15] PBMCs 3994 1298 8 equal clusters of FACS-purified cells

6Haber[33] Intestine 533 3755 Authors computationally identified 12 types

7Muraro[34] Pancreas 2282 18,795 Authors computationally identified 9 types

8 Zheng [5] PBMCs 68,579 1292 Benchmarking computational speed

Species: all H. sapiens except Haber (M. musculus). Protocols: all 10× except Muraro (CEL-Seq2) and Tung (SMARTer). MTU median total UMI count. iPSCs induced pluripotent

stem cells

UMI counts, while the SMARTer Tung data had high

counts.

Thefourthandfifthdatasetswerecreatedby[15].

The authors allocated FACS-purified peripheral blood

mononuclear cells (PBMCs) from 10× data [5] equally

into four (Zheng 4eq dataset) and eight (Zheng 8eq

dataset) clusters, respectively. In these positive control

datasets, the cluster identity of all cells was assigned inde-

pendently of gene expression (using FACS), so they served

as the ground truth labels.

The sixth and seventh datasets contained a wider variety

of cell types. However, the cluster identities were deter-

mined computationally by the original authors’ unsuper-

vised analyses and could not serve as a ground truth. The

10× Haber intestinal dataset had low total UMI counts,

while the CEL-Seq2 Muraro pancreas dataset had high

counts.

The final Zheng dataset consisted of a larger number

of unsorted PBMCs and was used to compare computa-

tional speed of different dimension reduction algorithms.

We refer to it as the PBMC 68K dataset.

UMI count distribution differs from reads

To illustrate the marked difference between UMI count

distributions and read count distributions, we created

histograms from individual genes and spike-ins of the

negative control data. Here, the UMI counts are the com-

putationally de-duplicated versions of the read counts;

both measurements are from the same experiment,

so no differences are due to technical or biological

variation. The results suggest that while read counts

appear zero-inflated and multimodal, UMI counts fol-

low a discrete distribution with no zero inflation

(Additional file 1: Figure S1). The apparent zero inflation

in read counts is a result of PCR duplicates.

Multinomial sampling distribution for UMI counts

Consider a single cell i containing t

total mRNA tran-

scripts. Let n

be the total number of UMIs for the same

cell. When the cell is processed by a scRNA-Seq proto-

col, it is lysed, then some fraction of the transcripts are

captured by beads within the droplets. A series of com-

plex biochemical reactions occur, including attachment of

barcodes and UMIs, and reverse transcription of the cap-

tured mRNA to a cDNA molecule. Finally, the cDNA is

sequenced, and PCR duplicates are removed to generate

the UMI counts [5]. In each of these stages, some frac-

tion of the molecules from the previous stage are lost

[5, 7, 32]. In particular, reverse transcriptase is an ineffi-

cient and error-prone enzyme [35]. Therefore, the number

of UMI counts representing the cell is much less than

the number of transcripts in the original cell (n

 t

Specifically, n

typically ranges from 1000 − 10, 000 while

is estimated to be approximately 200,000 for a typical

mammalian cell [36]. Furthermore, which molecules are

selected and successfully become UMIs is a random pro-

cess. Let x

bethetruenumberofmRNAtranscriptsof

gene j in cell i,andy

be the UMI count for the same

gene and cell. We define the relative abundance π

as the

true number of mRNA transcripts represented by gene j

in cell i divided by the total number of mRNA transcripts

in cell i.Relativeabundanceisgivenbyπ

= x

where

total transcripts t



.Sincen

 t

,thereisa

“competition to be counted” [37]; genes with large relative

abundance π

in the original cell are more likely to have

nonzero UMI counts, but genes with small relative abun-

dances may be observed with UMI counts of exact zeros.

The UMI counts y

are a multinomial sample of the true

biological counts x

, containing only relative information

about expression patterns in the cell [37, 38].

The multinomial distribution can be approximated

by independent Poisson distributions and overdispersed

(Dirichlet) multinomials by independent negative bino-

mial distributions. These approximations are useful for

computational tractability. Details are provided in the

“Methods”section.

The multinomial model makes two predictions which

we verified using negative control data. First, the fraction

TOWNES et al. Genome Biology

(2019) 20:295

Page 4 of 16

of zeros in a sample (cell or droplet) is inversely related

to the total number of UMIs in that sample. Second,

the probability of an endogenous gene or ERCC spike-in

having zero counts is a decreasing function of its mean

expression (equations provided in the “Methods”section).

Both of these predictions were validated by the negative

control data (Fig. 1). In particular, the empirical probabil-

ity of a gene being zero across droplets was well calibrated

to the theoretical prediction based on the multinomial

model. This also demonstrates that UMI counts are not

zero inflated , consistent with [29].

To further validate the multinomial model, we assessed

goodness-of-fit of seven possible null distributions to

both the Tung and Zheng monocytes negative control

datasets (Additional file 1: Figure S2). When applied to

UMI counts, the multinomial, Dirichlet-multinomial, and

Poisson (as approximation to multinomial) distributions

fit best. When applied to read counts, the zero-inflated

lognormal was the best fitting distribution followed by the

Dirichlet-multinomial.

These results are consistent with [39], which also found

that the relationship between average expression and zero

probability follows the theoretical curve predicted by a

Poisson model using negative control data processed with

Indrop [4]andDropseq[3] protocols. These are droplet

protocols with typically low counts. It has been argued

that the Poisson model is insufficient to describe the sam-

pling distribution of genes with high counts and the neg-

ative binomial model is more appropriate [11]. The Tung

dataset contained high counts, and we nevertheless found

the Poisson gave a better fit than the negative binomial.

However, the difference was not dramatic, so our results

do not preclude the negative binomial as a reasonable

sampling distribution for UMI counts. Taken together,

Fig. 1 Multinomial model adequately characterizes sampling distributions of technical and biological replicates negative control data. a Fraction of

zeros is plotted against the total number of UMI in each droplet for the technical replicates. b As a but for cells in the biological replicates

(monocytes). c After down-sampling replicates to 10,000 UMIs per droplet to remove variability due to the differences in sequencing depth, the

fraction of zeros is computed for each gene and plotted against the log of expression across all samples for the technical replicates data. The solid

curve is theoretical probability of observing a zero as a function of the expected counts derived from the multinomial model (blue) and its Poisson

approximation (green). d As c but for the biological replicates (monocytes) dataset and after down-sampling to 575 UMIs per cell. Here, we also add

the theoretical probability derived from a negative binomial model (red)

TOWNES et al. Genome Biology

(2019) 20:295

Page 5 of 16

these results suggest our data-generating mechanism is an

accurate model of technical noise in real data.

Normalization and log transformation distorts UMI data

Standard scRNA-Seq analysis involves normalizing raw

counts using size factors, applying a log transformation

with a pseudocount, and then centering and scaling each

gene before dimension reduction. The most popular nor-

malization is counts per million (CPM). The CPM are

defined as (y

)×10

(i.e., the size factor is n

/10

). This

is equivalent to the maximum likelihood estimator (MLE)

for relative abundance ˆπ

multiplied by 10

.Thelog-CPM

are then log

(c +ˆπ

) = log

( ˜π

) + C,where ˜π

a maximum a posteriori estimator (MAP) for π

(math-

ematical justification and interpretation of this approach

provided in the “Methods” section). The additive constant

C is irrelevant if data are centered for each gene after log

transformation, as is common practice. Thus, normaliza-

tion of raw counts is equivalent to using MLEs or MAP

estimators of the relative abundances.

Log transformation of MLEs is not possible for UMI

counts due to exact zeros, while log transformation of

MAP estimators of π

systematically distorts the differ-

ences between zero and nonzero UMI counts, depending

on the arbitrary pseudocount c (derivations provided in

the “Methods” section). To illustrate this phenomenon,

we examined the distribution of an illustrative gene

before and after the log transform with varying normal-

izations using the biological replicates negative control

data (Fig. 2). Consistent with our theoretical predictions,

this artificially caused the distribution to appear zero

inflated and exaggerated differences between cells based

on whether the count was zero or nonzero.

Focusing on the entire negative control datasets, we

applied PCA to log-CPM values. We observed a strong

correlation (r = 0.8 for technical and r = 0.98 for

monocytes biological replicates) between the first princi-

pal component (PC) and the fraction of zeros, consistent

with [30]. Application of PCA to CPM values without log

transform reduced this correlation to r = 0.1 for tech-

nical and r = 0.7 for monocytes biological replicates.

Additionally, the first PC of log-CPM correlated with the

log of total UMI, which is consistent with the multinomial

model (Fig. 3). Note that in datasets with strong biological

variability, the nuisance variation from zero fraction and

total counts could appear in secondary PCs rather than

the first PC, but it would still confound downstream anal-

yses. Based on these results, the log transformation is not

necessary and in fact detrimental for the analysis of UMI

counts. The benefits of avoiding normalization by instead

directly modeling raw counts have been demonstrated in

the context of differential expression [40]. Where normal-

ization is unavoidable, we propose the use of approximate

multinomial deviance residuals (defined in the “Residuals

and z-scores” section) instead of log-transformed CPM.

Zero inflation is an artifact of log normalization

To see how normalization and log transformation intro-

duce the appearance of zero inflation, consider the follow-

ing example. Let y

be the observed UMI counts following

a multinomial distribution with size n

for each cell and

relative abundance π

for each gene, constant across cells.

Focusing on a single gene j, y

follows a binomial distri-

bution with parameters n

and p

. Assume π

= 10

−4

and

the n

range from 1000 − 3000, which is consistent with

the biological replicates negative control data (Fig. 1 and

Additional file 1: Figure S1). Under this assumption, we

expect to see about 74–90% zeros, 22–30% ones, and less

than 4% values above one. However, notice that after nor-

malization to CPM and log transformation, all the zeros

remain log 2(1 + 0) = 0, yet the ones turn into values

ranging from log

(1 + 1/3000 × 10

) = log

(334) ≈ 8.4

to log

(1001) ≈ 10. The few values that are 2 will have

values ranging from log

(668) ≈ 9.4 to log

(2001) ≈ 11.

The large, artificial gap between zero and nonzero val-

ues makes the log-normalized data appear zero-inflated

Fig. 2 Example of how current approaches to normalization and transformation artificially distort differences between zero and nonzero counts.

a UMI count distribution for gene ENSG00000114391 in the monocytes biological replicates negative control dataset. b Counts per million (CPM)

distribution for the exact same count data. c Distribution of log

(1 + CPM) values for the exact same count data

Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model

Figures

Citations

Eleven grand challenges in single-cell data science

The art of using t-SNE for single-cell transcriptomics.

Orchestrating Single-Cell Analysis with Bioconductor

muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data.

Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq

References

A K-Means Clustering Algorithm

Analysis of a complex of statistical variables into principal components.

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

Massively parallel digital transcriptional profiling of single cells

Related Papers (5)

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Computational and analytical challenges in single-cell transcriptomics

Visualizing Data using t-SNE

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells

Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model" ?

Q2. What is the process of lysing a cell?

Q3. What is the way to model a high dimensional dataset?

Q4. What is the popular method for calculating the size factor for a cell?

Q5. What is the probability of zeros in the biological replicates data?

Q6. What is the way to replace the normal null model with a multinomial?

Q7. What is the framework for scRNA-Seq analysis?

Q8. What is the way to model biological variability in scRNA-Seq counts?

Q9. What is the meaning of the term technical replicates negative control?

Q10. What is the argument that the Poisson model is insufficient to describe the sampling distribution of genes?