scispace - formally typeset
Search or ask a question
Posted ContentDOI

Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes

22 Mar 2016-bioRxiv (Cold Spring Harbor Laboratory)-pp 045070
TL;DR: This work develops an alternative statistical method, which it refers to as scPLS, for more accurate inference of confounding effects, based on partial least squares and models control and target genes jointly to better infer and control for confounding effects.
Abstract: Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. We apply scPLS to analyze three scRNAseq data sets to further illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

Summary (2 min read)

Target Genes

  • Single cell RNA sequencing technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations.
  • Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
  • These hidden confounding factors can cause systematic bias, are notoriously difficult to control for, and are the focus of the present study.
  • In the Simulations section the authors present comparisons between scPLS and several existing methods using simulations.

Review of Previous Methods

  • Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden confounding factors and control for hidden confounding effects.
  • The application-specific methods become inconvenient in cases where there are multiple variables of interest (e.g. in eQTL mapping problems).
  • In contrast, the second subcategory of unsupervised methods are recently developed to take advantage of a set of control genes for inferring the confounding factors29,37.
  • Similarly, most scRNAseq studies include a set of control genes that are known to have varying expression levels across cell cycles.
  • The two subcategories of unsupervised methods use different strategies to infer the confounding factors.

Results

  • The authors provide modeling details for scPLS here.
  • Consistent with the clustering performance comparison, the authors found that scPLS also yielded more accurate proportion of variance estimates (Fig. 2b).
  • In the target genes, the confounding factors and structured biological factors explain a median of 18% and 30% of gene expression variance, respectively.
  • The results are shown in Table 1 and are overall consistent with the simulations.
  • To demonstrate its effectiveness there, the authors applied scPLS and several other methods to a second dataset that was used for demonstrating cell cycle influence37.

Discussion

  • The authors have presented scPLS for removing hidden confounding effects in scRNAseq studies.
  • Importantly, the performance of scPLS is robust to the number of genes included in the control set and yields comparable results even when a much smaller number of control genes is used.
  • In fact, low-rank factors inferred from many data sets using standard factor models have been linked to important biological pathways or transcription factors42–46.
  • The authors have been mainly focused on comparing the performance of different confounding effects removing methods by evaluating the clustering performance as the target downstream analysis.
  • Like many other methods for scRNAseq21 or bulk58,59 RNAseq studies, scPLS requires a data transformation step that converts the count data into quantitative expression data.

Methods

  • The authors list the EM algorithm below, with detailed derivation provided later.
  • The naive EM algorithm is computationally expensive: it scales quadratically with the number of genes and linearly with the number of cells/samples.
  • Therefore, the authors apply the EM-in-chunks algorithm with chunk size 500 throughout the rest of the paper.
  • In the E step, the authors calculate the expectation of the log likelihood function for complete data.
  • Ten replicates were performed for each setting on an Intel Xeon E5-2670 2.6 GHz CPU.

Author Contributions

  • The authors declare that they have no competing interests.
  • Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations, also known as Publisher's note.
  • This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
  • If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
www.nature.com/scientificreports
Controlling for Confounding Eects
in Single Cell RNA Sequencing
Studies Using both Control and
Target Genes
Mengjie Chen
1,2
& Xiang Zhou
3,4
Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased
and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many
advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the inuence
of confounding eects. Controlling for confounding eects in scRNAseq data is a crucial step for
accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS
(single cell partial least squares), for robust and accurate inference of confounding eects. scPLS takes
advantage of the fact that genes in a scRNAseq study often can be naturally classied into two sets: a
control set of genes that are free of eects of the predictor variables and a target set of genes that are
of primary interest. By modeling the two sets of genes jointly using the partial least squares regression,
scPLS is capable of making full use of the data to improve the inference of confounding eects. With
extensive simulations and comparisons with other methods, we demonstrate the eectiveness of
scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benets in removing
technical confounding eects as well as for removing cell cycle eects.
Single-cell RNA sequencing (scRNAseq) has emerged as a powerful tool in genomics. While the traditional RNA
sequencing, known as the bulk RNAseq, measures gene expression levels averaged across many dierent cells in
a sample of potentially heterogeneous cell population, scRNAseq can measure gene expression levels directly at
the single cell resolution. As a result, scRNAseq is less inuenced by the variation of cell type and cell composition
across dierent samples–a major confounding in the analyses of bulk RNAseq studies. Because of this benet and
its high resolution, scRNAseq provides unprecedented insights into many basic biological questions that are pre-
viously dicult to address. For example, scRNAseq has been applied to classify novel cell subtypes
1,2
and cellular
states
3,4
, reconstruct cell lineage and quantify progressive gene expression during development
58
, perform spatial
mapping and re-localization
9,10
, identify dierentially expressed genes and gene expression modulars
1113
, and
investigate the genetic basis of gene expression variation by detecting heterogenic allelic specic expressions
14,15
.
Like any other genomic sequencing experiment, scRNAseq studies are inuenced by many factors that can
introduce unwanted variation in the sequencing data and confound the down-stream analysis
16
. However, such
unwanted variation are oen exacerbated in scRNAseq experiments due to a range of scRNAseq specic con-
ditions that include amplication bias, low amount of input material and low transcript capture eciency
17
;
dropout events that are driven by both biological and technical factors
18,19
; global changes in expression due to
transcriptional bursts
20
; as well as changes in cell cycle and cell size
21
. Indeed, adjusting for confounding fac-
tors in scRNAseq data has been shown to be crucial for accurate estimation of gene expression levels and suc-
cessful down-stream analysis
1618,22,23
. However, depending on the source, adjusting for confounding factors in
scRNAseq can be non-trivial. Some confounding eects, such as read sampling noise and drop-out events, are
direct consequences of low sequencing-depth, which are random in nature and can be readily addressed by prob-
abilistic modeling using existing statistical methods
18,2225
. Other confounding eects are inherent to a particular
1
Department of Medicine, University of Chicago, Chicago, IL 60637, USA.
2
Department of Human Genetics,
University of Chicago, Chicago, IL 60637, USA.
3
Department of Biostatistics, University of Michigan, Ann Arbor, MI
48109, USA.
4
Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA. Correspondence and
requests for materials should be addressed to M.C. (email: mengjiechen@uchicago.edu) or X.Z. (email: xzhousph@
umich.edu)
Received: 18 May 2017
Accepted: 29 September 2017
Published: xx xx xxxx
OPEN

www.nature.com/scientificreports/
2
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
experimental protocol and can cause amplication bias, but can be easily mitigated by using new protocols
26
. Yet
other confounding eects are due to observable batches and can be adjusted for by including batch labels and
technician ids as covariates or dealt with other statistical methods
27,28
. However, many confounding factors are
hidden and are dicult or even impossible to measure. Common hidden confounding factors include various
technical artifacts during library preparation and sequencing, and unwanted biological confounders such as cell
cycle status. ese hidden confounding factors can cause systematic bias, are notoriously dicult to control for,
and are the focus of the present study.
To eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel sta-
tistical method, which we refer to as scPLS (single cell partial least squares). scPLS takes advantage of the fact
that genes in a scRNAseq study can oen be naturally classied into two sets: a control set of genes that are free of
eects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of
genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve
the inference of confounding factors. scPLS is closely related to and bridges between two existing subcategories
of methods for transcriptome analysis: a subcategory of methods that treat control and target genes in the same
fashion (e.g. PCA
2932
and LMM
3335
), and another subcategory of methods that use control genes alone for infer-
ring confounding factors (e.g. RUV
29,36
and scLVM
37
). By bridging between the two subcategories of methods,
scPLS enjoys robust performance across a range of application scenarios. scPLS is also computationally ecient:
with a new block-wise expectation maximization (EM) algorithm, it is scalable to thousands of cells and tens of
thousands of genes. Using simulations and two real data applications, we show how scPLS can be used to remove
confounding eects and enable accurate down-stream analysis in scRNAseq studies. Our method is implemented
as a part of the Citrus project and is freely available at: http://chenmengjie.github.io/Citrus/.
e paper is organized as follows. In the Review of Previous Methods section, we provide a brief review of
existing statistical methods for removing confounding eects in transcriptome analysis and describe how scPLS
is related to and motivated from these methods. In the Method Overview Section, we provide a methodologi-
cal description of the scPLS model, with inference details provided in the Methods Section. In the Simulations
section we present comparisons between scPLS and several existing methods using simulations. In Real Data
Applications section, we apply scPLS to two real scRNAseq data sets to remove technical confounding eects or
cell cycle eects. Finally, we conclude the paper with a summary and discussion.
Review of Previous Methods
Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden
confounding factors and control for hidden confounding eects. Based on their targeted application, these statis-
tical methods can be generally classied into two categories.
e rst category of methods are supervised and application-specic: these methods are designed to infer
the confounding factors in the presence of a known predictor variable, and to correct for the confounding eects
without removing the eects of the predictor variable. For example, scientists are oen interested in identifying
genes that are dierentially expressed between two pre-determined treatment conditions or that are associated
with a measured predictor variable of interest. To remove the confounding eects in these applications, meth-
ods, include SVA
30
, sparse regression models
38,39
, and, more recently, RUV
40,41
, are developed. Although these
application-specic methods are widely applied in many genomics studies, their usage is naturally restricted to
cases where the primary variable of interest is known. e application-specic methods become inconvenient in
cases where there are multiple variables of interest (e.g. in eQTL mapping problems). ey also become inappli-
cable when the primary variable of interest is not observed (e.g. in clustering problems).
e second category of methods are unsupervised, and are designed to infer the confounding factors without
knowing or using the predictor variable of interest. Our scPLS belongs to this category. Notable applications of
unsupervised methods in scRNAseq studies include cell type clustering and classication
18
. Existing unsuper-
vised statistical methods can be further classied into two subcategories. e rst subcategory of methods treat
all genes in the same fashion and use all of them to infer the confounding factors. For example, the principal
component analysis (PCA) or the factor model extracts the principal components or factors from all genes (or all
highly variable genes) as surrogates for the confounding factors
2932
. e inferred factors are treated as covariates
whose eects are further removed from gene expression levels before downstream analyses. Similarly, the linear
mixed models (LMMs) construct a sample relatedness matrix based on all genes to capture the inuence of the
confounding factors
3335
. e relatedness matrix are then included in the downstream analyses to control for the
confounding eects. In contrast, the second subcategory of unsupervised methods are recently developed to take
advantage of a set of control genes for inferring the confounding factors
29,37
. ese methods divide genes into
two sets: a control set of genes that are known to be free of eects of interest a priori and a target set of genes that
are of primary interest. Unlike the rst subcategory, the second subcategory of methods treat the two gene sets
dierently in inferring the confounding factors: the confounding factors are only inferred from the control set,
and are then used to remove the confounding eects in the target genes for subsequent downstream analysis. For
example, scRNAseq studies oen add ERCC spike-in controls prior to the PCR amplication and sequencing
steps. e spike-in controls can be used to capture the hidden confounding technical factors associated with the
experimental procedures, which are further used to remove technical confounding eects (e.g. reverse transcrip-
tion or PCR amplication confounding eects) from the target genes
33
. Similarly, most scRNAseq studies include
a set of control genes that are known to have varying expression levels across cell cycles. ese cell cycle genes
can be used to capture the unmeasured cell cycle status of each cell, which are further used to remove cell cycle
eects in the target genes
37
. Prominent methods in the second subcategory include the unsupervised version of
RUV
29,36
and scLVM
37
.
e two subcategories of unsupervised methods use dierent strategies to infer the confounding factors.
erefore, these two sets of methods are expected to perform well in dierent settings. Specically, the rst

www.nature.com/scientificreports/
3
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
subcategory of methods have the advantage of using information contained in all genes to accurately infer the
confounding eects. However, when the predictor variable of interest inuences a large number of genes, then
this subcategory of methods may incorrectly remove the primary eects of interest. On the other hand, the sec-
ond subcategory of methods infer confounding factors only from the control genes and are thus not prone to
mistakenly removing the primary eects of interest. However, these methods overlook one important fact–that
the hidden confounding factors not only inuence the control genes but also the target genes, i.e. the exact reason
that we need to remove such confounding eects in the rst place. Because the confounding factors inuence
both control and target genes, using control genes alone to infer the confounding factors can be suboptimal as it
fails to use the information from target genes.
To more eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel
statistical method, which we refer to as scPLS (single cell partial least squares). scPLS bridges between the two
subcategories of unsupervised methods and eectively includes each as a special case. Like the rst subcategory
of methods, scPLS models both control and target genes jointly to infer the confounding factors. Like the second
subcategory of methods, scPLS is capable of taking advantage of a control set to guild the inference of confound-
ing factors. scPLS builds upon the partial least squares regression model and relies on a key modeling assumption
that only target genes contain the primary eects of interest or other systematic biological variations. By incor-
porating such systematic variations in the target genes only, we can jointly model both control and target genes
to infer the confounding eects while avoiding mis-removing the primary eects of interest. erefore, scPLS
has the potential to make full use of the data to improve the inference of confounding factors and the removal of
confounding eects.
Results
scPLS Method Overview. We provide modeling details for scPLS here. While the formulation of scPLS is
general, we focus on its application in scRNAseq. e scRNAseq data resembles that of the bulk RNAseq data and
consists of a gene expression matrix on n cells and
p q+
genes. We consider dividing the genes into two sets: a con-
trol set that contains q control genes and a target set that contains p genes of primary interest. e control genes are
selected based on the purpose of the analysis. For example, the control set would contain ERCC spike-ins if we want
to remove technical confounding factors, and would contain cell cycle genes if we want to remove cell cycle eects.
We use the following partial least squares regression to jointly model both control and target genes:
ΛεεΨ=+
~
xz,MVN(0,)
(1)
xxixixiii
~yz u ,MVN(0,)
(2)
yuyi yi yi
i
ii
ΛΛ εε Ψ=+ +
where for
i
'th individual cell,
x
i
is a q-vector of expression levels for q control genes;
y
i
is a p-vector of expression
levels for p target genes;
z
i
is
k
z
-vector of unknown confounding factors that aect both control and target genes;
the coecients of the confounding factors are represented by the
q
by
k
z
loading matrix
for the control genes
and the
p
by
k
z
loading matrix
for the target genes;
u
i
is a
k
u
-vector of unknown factors in the target genes and
potentially represents the predictors of interest or other structured variations (see below);
Λ
u
is a
p
by
k
u
loading
matrix;
ε
xi
is a
q
-vector of idiosyncratic error with covariance
diag(, ,)
xi xxq1
22
σσΨ =
;
yi
ε
is a
p
-vector of idiosyn-
cratic error with covariance
diag(, ,)
yi yyp1
22
σσΨ =
; MVN denotes the multivariate normal distribution. We
assume that
ε
xi
,
ε
yi
,
z
i
, and
u
i
are all independent from each other. Following standard latent factor models, we
further assume that
z IMVN(0, )
i
~
and
~u IMVN(0, )
i
. We model transformed data instead of the raw read
counts. We also assume that the expression levels of each gene have been centered to have mean zero, which
allows us to ignore the intercept.
scPLS includes two types of unknown latent factors. e rst set of factors,
z
i
, represents the unknown con-
founding factors that aect both control and target genes. e eects of
z
i
on the control and target genes are
captured in the loading matrices
and
, respectively. We call
z
i
the confounding factors throughout the text,
and we aim to remove the confounding eects
Λ z
y i
from the target genes. e second set of factors,
u
i
, aims to
capture a low dimensional structure of the expression level of
p
target genes. e factors
u
i
can represent the
unknown predictor variables of interest, specic experimental perturbations, cell subpopulations, gene signatures
or other intermediate factors that coordinately regulate a set of genes. erefore, the factors
u
i
can be interpreted
as cell subtypes, treatment status, transcription factors or regulators of biological pathways in dierent stud-
ies
4246
. Although
u
i
could be of direct biological interest in many data sets, we do not explicitly examine the
inferred
u
i
here. Rather, we view modeling
u
i
in the target genes as a way to better capture the complex variance
structure there and to facilitate the precise estimation of confounding factors
z
i
. For simplicity, we call
u
i
the
biological factors throughout the text, though we note that
u
i
could well represent non-biological processes such
as treatment or environmental eects. us, the expression levels of the control genes can be described by a linear
combination of the confounding factors
z
i
and residual errors; the expression levels of the target genes can be
described by a linear combination of the confounding factors
z
i
, the biological factors
u
i
and residual errors. For
both types of confounding factors, we are interested in inferring the factor eects
z
y i
Λ
and
Λ u
u i
rather than the
individual factors
z
i
and
u
i
. erefore, unlike in standard factor models, we are not concerned with the identia-
bility of the factors. Figure1 shows an illustration of scPLS.
scPLS is closely related to the two subcategories of unsupervised methods described in the previous Section.
Specically, without the biological eects term
Λ u
u i
, scPLS eectively reduces to the rst subcategory of methods
that treat all genes in the same fashion for inferring the confounding factors. Without the Equation2 term, scPLS
eectively reduces to the second subcategory of methods that use only control genes for inference. (Note that,
aer inferring the confounding factors
z
i
from Equation1, the second subcategory of methods still use a reduced

www.nature.com/scientificreports/
4
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
version of Equation2 without the biological eects term
Λ u
u i
to remove the confounding eects.) By including
both modeling terms, scPLS can robustly control for confounding eects across a range of scenarios. erefore,
scPLS provides a exible modeling framework that eectively includes the two subcategories of unsupervised
methods as special cases and has the potential to outperform these previous methods.
Simulations. We performed a simulation study to compare scPLS with other methods. Specically, we sim-
ulated gene expression levels for 50 control genes and 1,000 target genes for 200 cells. ese 200 cells come from
two equal-sized groups, representing two treatment conditions or two cell subpopulations. Among the 1,000
target genes, only 200 of them are dierentially expressed (DE) between the two groups and thus represent the
signature of the two groups. e eect sizes of the DE genes were simulated from a normal distribution and we
scaled the eects further so that the group label explains twenty percent of phenotypic variation (PVE) in expres-
sion levels in the DE genes. In addition to the group eects, we set
k k2, 5
zu
==
and simulated each element of
z
i
and
u
i
from a standard normal distribution. We simulated each element of
x
Λ
from
σ−.N(025, )
l
2
and each
element of
from
σ.N(0 25, )
l
2
. Note that
and
were simulated dierently to capture the fact that the eect
sizes of the confounding factors could be dierent for control and target genes. We simulated each element of
Λ
u
from
N(0,)
b
2
σ
. We simulated each element of
ε
xi
and
ε
yi
from a standard normal distribution. We set
σ =.04
l
2
and
σ =.06
b
2
to ensure that, in non-DE genes, the confounding factors
z
i
explain 20% PVE in either the control or the
target genes; the biological factors
u
i
explain 30% PVE of the target genes; and the residual errors to explain the
rest of PVE. To vary signal strength in the data, we also created a series of sub data sets by varying the number of
non DE genes in the data, so that the proportion of variance explained by DE genes in total equal to a xed per-
centage (PDE, in the range of 20–100%, with 10% increments). Aer we simulated gene expression levels, we
further converted these continuous values into count data by using a Poisson distribution: the nal observation
for ith cell and jth gene
c
ij
is from
µ +~c NwPoi( exp( ))
ij ij
, with
w
ij
being the continuous gene expression levels
simulated above and
N 500000, log(10/500000)µ==
, which ensures an average read count of 10. Note that,
because of the residual errors, the resulting count data are over-dispersed with respect to a Poisson distribution.
We considered three dierent simulation scenarios. In scenario I, the confounding factors
z
i
are independent
of group labels. In scenario II, the confounding factors are correlated with group labels. To simulate correlated
data, we simulated each element of
z
i
from
N(0,1)
if the corresponding sample belongs to the rst group, but
from
−.N(025,1)
if the corresponding sample belongs to the second group. Finally, we also considered a scenario
III where there is no biological factor (i.e. data were simulated eectively under the PCA modeling assumption
and all genes could be used to infer the confounding factors). We performed 10 simulation replicates for each
scenario. For scenario I and II, we further introduced dropout events that are commonly observed in scRNAseq
data. is was done by going through one gene at a time and setting the expression level for
j
th gene
c
ij
to zero
with probability
ij
π
that depends on the expression level through
l
og c
ij
1
ij
ij
=
π
π
.
We compared scPLS to four dierent methods: (1) PCA and (2) LMM (implemented in GEMMA
47,48
) use all
genes to infer the confounding eects; while (3) RUVseq (version 1.2.0); which we simply refer to as RUV in the
following text) and (4) scLVM (version 0.99.1) use only control genes to infer the confounding eects. We note
that while some of these methods are developed not specically for scRNAseq, these methods represent a range
of strategies to deal with confounding factors. We used default settings in all the above methods. We used the
Figure 1. Illustration of scPLS. We model the expression level of genes in the control set
X()
and genes in the
target set
Y()
jointly. Both control and target genes are aected by the common confounding factors (Z) with
eects
and
in the two gene sets, respectively. e target genes are also inuenced by biological factors (U)
with eects
u
Λ
. e biological factors represent intermediate factors that coordinately regulate a set of genes,
and are introduced to better capture the complex variance structure in the target genes.
E
x
and
E
y
represent
residual errors. scPLS aims to remove the confounding eects
ΛZ
y
in the target genes.

www.nature.com/scientificreports/
5
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
count data directly for RUV and used log transformed data (i.e.
+clog(1)
ij
) for all other methods. For PCA and
RUV, we set the number of latent factors to be the true number (i.e. 2). Such number is determined automatically
by the soware itself for scLVM, and is not needed for LMM. We compared dierent methods based on clustering
performance. In particular, for each of these methods, we obtained corrected data and applied k-means method
to cluster cells into two subpopulations. We the compared the clusters inferred from the corrected data with the
truth and used adjusted rand index (ARI) to measure clustering performance. ARI is computed across a range of
signal strength that is measured as PDE explained above. Intuitively, if a method performs well in removing con-
founding factors, then the corrected data from this method can be used to better infer the two cell subpopulations
and thus yields a higher ARI score.
Overall, scPLS performs the best in both scenarios I and II, with or without dropout events (Fig.2a). e
addition of dropout events in either of the two scenarios reduces the performance of all methods but does not
change their relative rank of performance. e superior performance of scPLS also suggests that properly using
both control and target genes can lead to eective removal of confounding eects. Among the rest of the methods,
PCA and LMM performs better than RUV and scLVM, suggesting that target genes contain a substantial amount
of information for removing confounding eects. Beside the comparison of clustering performance, for each gene
in turn, we also used dierent methods to estimate the proportion of gene expression variance contributed by
confounding factors. Consistent with the clustering performance comparison, we found that scPLS also yielded
more accurate proportion of variance estimates (Fig.2b).
To examine the robustness of scPLS, we applied scPLS to the same data but with a reduced number of control
genes (Fig.3a). Because scPLS does not completely rely on the information contained in the control genes, it
achieves robust performance even if we only use a much smaller subset of control genes. We also examined the
performance of scPLS in Scenario III where there is no biological eects (Fig.3b) and found that scPLS performs
well there. As it is oen unknown whether a low-rank structural variation exists in a real data set, our simulation
suggests that we can always include the biological factors
u
i
in the model even in the absence of such factors. In
addition, scPLS is not sensitive with respect to the number of biological factors used in tting the model, and
achieves similar power for a range of reasonable
k
u
values (Fig.3c).
Real Data Applications. Next, we applied scPLS to two real data sets. e rst dataset is used to demonstrate
the eectiveness of scPLS in removing the technical confounding eects by using ERCC spike-ins. Removing
technical confounding eects is a common and important task in transcriptome analysis. e second dataset is
used to demonstrate the eectiveness of scPLS in removing cell cycle eects by using a known set of cell cycle
genes. Removing cell cycle eects can reveal gene expression heterogeneity that is otherwise obscured.
Removing Technical Confounding Factors. e rst dataset consists of 251 samples from
22
. Among
these, 119 are mouse embryonic stem cells (mESCs), including 74 mESCs cultured in a two-inhibitor (2i)
medium and 45 mESCs cultured in a serum medium. e remaining 132 cells are control “cells” that are obtained
by mixing single cells cultured in each condition (i.e. these control “cells” are similar to bulk seq data in terms of
consisting a mixture of cell types, but are prepared and sequenced using single cell protocol). e control cells
include 76 cells cultured in 2i and 56 cells cultured in serum. Because the control cells are homogeneous within
each culture condition, when we cluster these cell, we would expect the only true cluster detectable among these
cells is the culture condition. erefore, we decide to focus on these control cells to compare the performance of
dierent methods for removing technical eects.
We obtained the raw UMI counts data directly from the authors. e data contains measurements for 92
ERCC spike-ins and 23,459 genes. Due to the low coverage of this dataset (median coverage equals one), we
ltered out lowly expressed genes and selected only genes that have at least ve counts and spike-ins that have at
least one count in more than a third of the cells. is ltering step resulted in a total of 32 ERCC spike-ins that
were used as the controls and 2,795 genes that were used as the targets.
As in the simulations, we log transformed the count data and centered the transformed values for scPLS, PCA,
LMM and scLVM. We used the count data for RUV. In this data, scPLS infers
k 1
z
=
confounding factors and
k 1
u
=
biological factors. In the target genes, the confounding factors and structured biological factors explain a
median of 18% and 30% of gene expression variance, respectively. e PVE by the confounding and biological
factors can be as high as 73.7% and 77.9%, respectively, in the target genes.
We applied scPLS and the other four methods to remove confounding eects in the data. Since control cells
are homogeneous within each culture condition, we reasoned that if the method is eective in removing con-
founding eects, then the corrected data from the corresponding method could be used to better reveal two
clusters that correspond to the two known culture conditions. For the clustering analysis, we applied the four
dierent clustering approaches on the uncorrected or corrected data from dierent methods. e four clustering
approaches include: (1) kmeans, where we applied the k-means method directly on the uncorrected or corrected
data; (2) PCA, where we extracted the top ve PCs from either the uncorrected or corrected data and then applied
the k-means method using the top PCs; (3) tSNE, where we used tSNE to either the uncorrected or corrected data
and then applied the k-means method on the extracted tSNE factors; (4) SC3, where we used a recently devel-
oped state-of-the-art single cell clustering method single cell census clustering (SC3)
49
. For all these clustering
approaches, we set the number of clusters to two and measured clustering performance by the adjusted Rand
Index (ARI). e results are shown in Table1 and are overall consistent with the simulations. Specically, scPLS
outperforms the other methods in three out of the four clustering approaches. scPLS performs slightly worse
than RUV when tSNE was used to cluster data–but tSNE works extremely poorly in this data presumably because
tSNE’s non-linearity assumption does not t the data well.

Citations
More filters
Journal ArticleDOI
TL;DR: This paper introduces a general framework, RUV*, that both unites and generalizes existing RUV approaches and provides conditions under which RUV2 and RUV4 are equivalent, and implements RUVB, a version of RUV* based on Bayesian factor analysis.
Abstract: Unwanted variation, including hidden confounding, is a well-known problem in many fields, particularly large-scale gene expression studies. Recent proposals to use control genes --- genes assumed to be unassociated with the covariates of interest --- have led to new methods to deal with this problem. Going by the moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1, RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify connections between existing methods. In particular we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework also preserves an advantage of RUV approaches --- their modularity --- which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data we found that RUVB is competitive with existing methods in terms of both power and calibration, although we also highlight the challenges of providing consistently reliable calibration among data sets.

12 citations

References
More filters
Proceedings Article
01 Jan 2012
TL;DR: A beta-negative binomial (BNB) process is proposed, leading to a beta-gamma-Poisson process, which may be viewed as a \multiscoop" generalization of the beta-Bernoulli process.
Abstract: A beta-negative binomial (BNB) process is proposed, leading to a beta-gamma-Poisson process, which may be viewed as a \multiscoop" generalization of the beta-Bernoulli process. The BNB process is augmented into a beta-gamma-gamma-Poisson hierarchical structure, and applied as a nonparametric Bayesian prior for an innite Poisson factor analysis model. A nite approximation for the beta process L evy random measure is constructed for convenient implementation. Ecient MCMC computations are performed with data augmentation and marginalization techniques. Encouraging results are shown on document count matrix factorization.

264 citations

Journal ArticleDOI
TL;DR: Single-cell RNA-seq on viable PDX cells identified a candidate tumor cell subgroup associated with anti- cancer drug resistance, a powerful approach for identifying unique tumor cell-specific gene expression profiles which could facilitate the development of optimized clinical anti-cancer strategies.
Abstract: Intra-tumoral genetic and functional heterogeneity correlates with cancer clinical prognoses. However, the mechanisms by which intra-tumoral heterogeneity impacts therapeutic outcome remain poorly understood. RNA sequencing (RNA-seq) of single tumor cells can provide comprehensive information about gene expression and single-nucleotide variations in individual tumor cells, which may allow for the translation of heterogeneous tumor cell functional responses into customized anti-cancer treatments. We isolated 34 patient-derived xenograft (PDX) tumor cells from a lung adenocarcinoma patient tumor xenograft. Individual tumor cells were subjected to single cell RNA-seq for gene expression profiling and expressed mutation profiling. Fifty tumor-specific single-nucleotide variations, including KRAS G12D , were observed to be heterogeneous in individual PDX cells. Semi-supervised clustering, based on KRAS G12D mutant expression and a risk score representing expression of 69 lung adenocarcinoma-prognostic genes, classified PDX cells into four groups. PDX cells that survived in vitro anti-cancer drug treatment displayed transcriptome signatures consistent with the group characterized by KRAS G12D and low risk score. Single-cell RNA-seq on viable PDX cells identified a candidate tumor cell subgroup associated with anti-cancer drug resistance. Thus, single-cell RNA-seq is a powerful approach for identifying unique tumor cell-specific gene expression profiles which could facilitate the development of optimized clinical anti-cancer strategies.

236 citations

Journal ArticleDOI
TL;DR: In this article, a generative statistical model that accurately quantifies technical noise with the help of external RNA spike-ins was proposed to distinguish biological variability from the high level of technical noise that affects scRNA-seq protocols.
Abstract: Single-cell RNA-sequencing (scRNA-seq) facilitates identification of new cell types and gene regulatory networks as well as dissection of the kinetics of gene expression and patterns of allele-specific expression. However, to facilitate such analyses, separating biological variability from the high level of technical noise that affects scRNA-seq protocols is vital. Here we describe and validate a generative statistical model that accurately quantifies technical noise with the help of external RNA spike-ins. Applying our approach to investigate stochastic allele-specific expression in individual cells, we demonstrate that a large fraction of stochastic allele-specific expression can be explained by technical noise, especially for lowly and moderately expressed genes: we predict that only 17.8% of stochastic allele-specific expression patterns are attributable to biological noise with the remainder due to technical noise.

214 citations

Journal ArticleDOI
TL;DR: Comparisons of cell types from utricles and cochleae demonstrate divergence between auditory and vestibular cells, despite a common origin, and suggest possible plasticity among cells at the sensory–nonsensory boundary.
Abstract: In the inner ear, cochlear and vestibular sensory epithelia utilize grossly similar cell types to transduce different stimuli: sound and acceleration. Each individual sensory epithelium is composed of highly heterogeneous populations of cells based on physiological and anatomical criteria. However, limited numbers of each cell type have impeded transcriptional characterization. Here we generated transcriptomes for 301 single cells from the utricular and cochlear sensory epithelia of newborn mice to circumvent this challenge. Cluster analysis indicates distinct profiles for each of the major sensory epithelial cell types, as well as less-distinct sub-populations. Asynchrony within utricles allows reconstruction of the temporal progression of cell-type-specific differentiation and suggests possible plasticity among cells at the sensory-nonsensory boundary. Comparisons of cell types from utricles and cochleae demonstrate divergence between auditory and vestibular cells, despite a common origin. These results provide significant insights into the developmental processes that form unique inner ear cell types.

200 citations

Journal ArticleDOI
TL;DR: In Cyclebase version 3.0, the content of the database is updated to reflect changes to genome annotation, added new mRNA and protein expression data, and integrated cell-cycle phenotype information from high-content screens and model-organism databases.
Abstract: The eukaryotic cell division cycle is a highly regulated process that consists of a complex series of events and involves thousands of proteins. Researchers have studied the regulation of the cell cycle in several organisms, employing a wide range of high-throughput technologies, such as microarray-based mRNA expression profiling and quantitative proteomics. Due to its complexity, the cell cycle can also fail or otherwise change in many different ways if important genes are knocked out, which has been studied in several microscopy-based knockdown screens. The data from these many large-scale efforts are not easily accessed, analyzed and combined due to their inherent heterogeneity. To address this, we have created Cyclebase—available at http://www.cyclebase.org—an online database that allows users to easily visualize and download results from genome-wide cell-cycle-related experiments. In Cyclebase version 3.0, we have updated the content of the database to reflect changes to genome annotation, added new mRNA and protein expression data, and integrated cell-cycle phenotype information from high-content screens and model-organism databases. The new version of Cyclebase also features a new web interface, designed around an overview figure that summarizes all the cell-cycle-related data for a gene.

193 citations

Frequently Asked Questions (2)
Q1. What have the authors contributed in "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?

Here, the authors present a novel statistical method, which they refer to as scPLS ( single cell partial least squares ), for robust and accurate inference of confounding effects. ScPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. With extensive simulations and comparisons with other methods, the authors demonstrate the effectiveness of scPLS. Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects. 

Exploring the use of biological factors in scPLS is an interesting avenue for future research. Therefore, it would be important to evaluate the performance of scPLS in other analysis settings in future studies. Therefore, extending their framework to modeling count data65,66 is another promising avenue for future research. One potential disadvantage of scPLS is that it does not model raw count data directly.