scispace - formally typeset
Search or ask a question
Posted ContentDOI

Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes

22 Mar 2016-bioRxiv (Cold Spring Harbor Laboratory)-pp 045070
TL;DR: This work develops an alternative statistical method, which it refers to as scPLS, for more accurate inference of confounding effects, based on partial least squares and models control and target genes jointly to better infer and control for confounding effects.
Abstract: Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. We apply scPLS to analyze three scRNAseq data sets to further illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

Summary (2 min read)

Target Genes

  • Single cell RNA sequencing technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations.
  • Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
  • These hidden confounding factors can cause systematic bias, are notoriously difficult to control for, and are the focus of the present study.
  • In the Simulations section the authors present comparisons between scPLS and several existing methods using simulations.

Review of Previous Methods

  • Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden confounding factors and control for hidden confounding effects.
  • The application-specific methods become inconvenient in cases where there are multiple variables of interest (e.g. in eQTL mapping problems).
  • In contrast, the second subcategory of unsupervised methods are recently developed to take advantage of a set of control genes for inferring the confounding factors29,37.
  • Similarly, most scRNAseq studies include a set of control genes that are known to have varying expression levels across cell cycles.
  • The two subcategories of unsupervised methods use different strategies to infer the confounding factors.

Results

  • The authors provide modeling details for scPLS here.
  • Consistent with the clustering performance comparison, the authors found that scPLS also yielded more accurate proportion of variance estimates (Fig. 2b).
  • In the target genes, the confounding factors and structured biological factors explain a median of 18% and 30% of gene expression variance, respectively.
  • The results are shown in Table 1 and are overall consistent with the simulations.
  • To demonstrate its effectiveness there, the authors applied scPLS and several other methods to a second dataset that was used for demonstrating cell cycle influence37.

Discussion

  • The authors have presented scPLS for removing hidden confounding effects in scRNAseq studies.
  • Importantly, the performance of scPLS is robust to the number of genes included in the control set and yields comparable results even when a much smaller number of control genes is used.
  • In fact, low-rank factors inferred from many data sets using standard factor models have been linked to important biological pathways or transcription factors42–46.
  • The authors have been mainly focused on comparing the performance of different confounding effects removing methods by evaluating the clustering performance as the target downstream analysis.
  • Like many other methods for scRNAseq21 or bulk58,59 RNAseq studies, scPLS requires a data transformation step that converts the count data into quantitative expression data.

Methods

  • The authors list the EM algorithm below, with detailed derivation provided later.
  • The naive EM algorithm is computationally expensive: it scales quadratically with the number of genes and linearly with the number of cells/samples.
  • Therefore, the authors apply the EM-in-chunks algorithm with chunk size 500 throughout the rest of the paper.
  • In the E step, the authors calculate the expectation of the log likelihood function for complete data.
  • Ten replicates were performed for each setting on an Intel Xeon E5-2670 2.6 GHz CPU.

Author Contributions

  • The authors declare that they have no competing interests.
  • Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations, also known as Publisher's note.
  • This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
  • If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
www.nature.com/scientificreports
Controlling for Confounding Eects
in Single Cell RNA Sequencing
Studies Using both Control and
Target Genes
Mengjie Chen
1,2
& Xiang Zhou
3,4
Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased
and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many
advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the inuence
of confounding eects. Controlling for confounding eects in scRNAseq data is a crucial step for
accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS
(single cell partial least squares), for robust and accurate inference of confounding eects. scPLS takes
advantage of the fact that genes in a scRNAseq study often can be naturally classied into two sets: a
control set of genes that are free of eects of the predictor variables and a target set of genes that are
of primary interest. By modeling the two sets of genes jointly using the partial least squares regression,
scPLS is capable of making full use of the data to improve the inference of confounding eects. With
extensive simulations and comparisons with other methods, we demonstrate the eectiveness of
scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benets in removing
technical confounding eects as well as for removing cell cycle eects.
Single-cell RNA sequencing (scRNAseq) has emerged as a powerful tool in genomics. While the traditional RNA
sequencing, known as the bulk RNAseq, measures gene expression levels averaged across many dierent cells in
a sample of potentially heterogeneous cell population, scRNAseq can measure gene expression levels directly at
the single cell resolution. As a result, scRNAseq is less inuenced by the variation of cell type and cell composition
across dierent samples–a major confounding in the analyses of bulk RNAseq studies. Because of this benet and
its high resolution, scRNAseq provides unprecedented insights into many basic biological questions that are pre-
viously dicult to address. For example, scRNAseq has been applied to classify novel cell subtypes
1,2
and cellular
states
3,4
, reconstruct cell lineage and quantify progressive gene expression during development
58
, perform spatial
mapping and re-localization
9,10
, identify dierentially expressed genes and gene expression modulars
1113
, and
investigate the genetic basis of gene expression variation by detecting heterogenic allelic specic expressions
14,15
.
Like any other genomic sequencing experiment, scRNAseq studies are inuenced by many factors that can
introduce unwanted variation in the sequencing data and confound the down-stream analysis
16
. However, such
unwanted variation are oen exacerbated in scRNAseq experiments due to a range of scRNAseq specic con-
ditions that include amplication bias, low amount of input material and low transcript capture eciency
17
;
dropout events that are driven by both biological and technical factors
18,19
; global changes in expression due to
transcriptional bursts
20
; as well as changes in cell cycle and cell size
21
. Indeed, adjusting for confounding fac-
tors in scRNAseq data has been shown to be crucial for accurate estimation of gene expression levels and suc-
cessful down-stream analysis
1618,22,23
. However, depending on the source, adjusting for confounding factors in
scRNAseq can be non-trivial. Some confounding eects, such as read sampling noise and drop-out events, are
direct consequences of low sequencing-depth, which are random in nature and can be readily addressed by prob-
abilistic modeling using existing statistical methods
18,2225
. Other confounding eects are inherent to a particular
1
Department of Medicine, University of Chicago, Chicago, IL 60637, USA.
2
Department of Human Genetics,
University of Chicago, Chicago, IL 60637, USA.
3
Department of Biostatistics, University of Michigan, Ann Arbor, MI
48109, USA.
4
Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA. Correspondence and
requests for materials should be addressed to M.C. (email: mengjiechen@uchicago.edu) or X.Z. (email: xzhousph@
umich.edu)
Received: 18 May 2017
Accepted: 29 September 2017
Published: xx xx xxxx
OPEN

www.nature.com/scientificreports/
2
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
experimental protocol and can cause amplication bias, but can be easily mitigated by using new protocols
26
. Yet
other confounding eects are due to observable batches and can be adjusted for by including batch labels and
technician ids as covariates or dealt with other statistical methods
27,28
. However, many confounding factors are
hidden and are dicult or even impossible to measure. Common hidden confounding factors include various
technical artifacts during library preparation and sequencing, and unwanted biological confounders such as cell
cycle status. ese hidden confounding factors can cause systematic bias, are notoriously dicult to control for,
and are the focus of the present study.
To eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel sta-
tistical method, which we refer to as scPLS (single cell partial least squares). scPLS takes advantage of the fact
that genes in a scRNAseq study can oen be naturally classied into two sets: a control set of genes that are free of
eects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of
genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve
the inference of confounding factors. scPLS is closely related to and bridges between two existing subcategories
of methods for transcriptome analysis: a subcategory of methods that treat control and target genes in the same
fashion (e.g. PCA
2932
and LMM
3335
), and another subcategory of methods that use control genes alone for infer-
ring confounding factors (e.g. RUV
29,36
and scLVM
37
). By bridging between the two subcategories of methods,
scPLS enjoys robust performance across a range of application scenarios. scPLS is also computationally ecient:
with a new block-wise expectation maximization (EM) algorithm, it is scalable to thousands of cells and tens of
thousands of genes. Using simulations and two real data applications, we show how scPLS can be used to remove
confounding eects and enable accurate down-stream analysis in scRNAseq studies. Our method is implemented
as a part of the Citrus project and is freely available at: http://chenmengjie.github.io/Citrus/.
e paper is organized as follows. In the Review of Previous Methods section, we provide a brief review of
existing statistical methods for removing confounding eects in transcriptome analysis and describe how scPLS
is related to and motivated from these methods. In the Method Overview Section, we provide a methodologi-
cal description of the scPLS model, with inference details provided in the Methods Section. In the Simulations
section we present comparisons between scPLS and several existing methods using simulations. In Real Data
Applications section, we apply scPLS to two real scRNAseq data sets to remove technical confounding eects or
cell cycle eects. Finally, we conclude the paper with a summary and discussion.
Review of Previous Methods
Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden
confounding factors and control for hidden confounding eects. Based on their targeted application, these statis-
tical methods can be generally classied into two categories.
e rst category of methods are supervised and application-specic: these methods are designed to infer
the confounding factors in the presence of a known predictor variable, and to correct for the confounding eects
without removing the eects of the predictor variable. For example, scientists are oen interested in identifying
genes that are dierentially expressed between two pre-determined treatment conditions or that are associated
with a measured predictor variable of interest. To remove the confounding eects in these applications, meth-
ods, include SVA
30
, sparse regression models
38,39
, and, more recently, RUV
40,41
, are developed. Although these
application-specic methods are widely applied in many genomics studies, their usage is naturally restricted to
cases where the primary variable of interest is known. e application-specic methods become inconvenient in
cases where there are multiple variables of interest (e.g. in eQTL mapping problems). ey also become inappli-
cable when the primary variable of interest is not observed (e.g. in clustering problems).
e second category of methods are unsupervised, and are designed to infer the confounding factors without
knowing or using the predictor variable of interest. Our scPLS belongs to this category. Notable applications of
unsupervised methods in scRNAseq studies include cell type clustering and classication
18
. Existing unsuper-
vised statistical methods can be further classied into two subcategories. e rst subcategory of methods treat
all genes in the same fashion and use all of them to infer the confounding factors. For example, the principal
component analysis (PCA) or the factor model extracts the principal components or factors from all genes (or all
highly variable genes) as surrogates for the confounding factors
2932
. e inferred factors are treated as covariates
whose eects are further removed from gene expression levels before downstream analyses. Similarly, the linear
mixed models (LMMs) construct a sample relatedness matrix based on all genes to capture the inuence of the
confounding factors
3335
. e relatedness matrix are then included in the downstream analyses to control for the
confounding eects. In contrast, the second subcategory of unsupervised methods are recently developed to take
advantage of a set of control genes for inferring the confounding factors
29,37
. ese methods divide genes into
two sets: a control set of genes that are known to be free of eects of interest a priori and a target set of genes that
are of primary interest. Unlike the rst subcategory, the second subcategory of methods treat the two gene sets
dierently in inferring the confounding factors: the confounding factors are only inferred from the control set,
and are then used to remove the confounding eects in the target genes for subsequent downstream analysis. For
example, scRNAseq studies oen add ERCC spike-in controls prior to the PCR amplication and sequencing
steps. e spike-in controls can be used to capture the hidden confounding technical factors associated with the
experimental procedures, which are further used to remove technical confounding eects (e.g. reverse transcrip-
tion or PCR amplication confounding eects) from the target genes
33
. Similarly, most scRNAseq studies include
a set of control genes that are known to have varying expression levels across cell cycles. ese cell cycle genes
can be used to capture the unmeasured cell cycle status of each cell, which are further used to remove cell cycle
eects in the target genes
37
. Prominent methods in the second subcategory include the unsupervised version of
RUV
29,36
and scLVM
37
.
e two subcategories of unsupervised methods use dierent strategies to infer the confounding factors.
erefore, these two sets of methods are expected to perform well in dierent settings. Specically, the rst

www.nature.com/scientificreports/
3
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
subcategory of methods have the advantage of using information contained in all genes to accurately infer the
confounding eects. However, when the predictor variable of interest inuences a large number of genes, then
this subcategory of methods may incorrectly remove the primary eects of interest. On the other hand, the sec-
ond subcategory of methods infer confounding factors only from the control genes and are thus not prone to
mistakenly removing the primary eects of interest. However, these methods overlook one important fact–that
the hidden confounding factors not only inuence the control genes but also the target genes, i.e. the exact reason
that we need to remove such confounding eects in the rst place. Because the confounding factors inuence
both control and target genes, using control genes alone to infer the confounding factors can be suboptimal as it
fails to use the information from target genes.
To more eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel
statistical method, which we refer to as scPLS (single cell partial least squares). scPLS bridges between the two
subcategories of unsupervised methods and eectively includes each as a special case. Like the rst subcategory
of methods, scPLS models both control and target genes jointly to infer the confounding factors. Like the second
subcategory of methods, scPLS is capable of taking advantage of a control set to guild the inference of confound-
ing factors. scPLS builds upon the partial least squares regression model and relies on a key modeling assumption
that only target genes contain the primary eects of interest or other systematic biological variations. By incor-
porating such systematic variations in the target genes only, we can jointly model both control and target genes
to infer the confounding eects while avoiding mis-removing the primary eects of interest. erefore, scPLS
has the potential to make full use of the data to improve the inference of confounding factors and the removal of
confounding eects.
Results
scPLS Method Overview. We provide modeling details for scPLS here. While the formulation of scPLS is
general, we focus on its application in scRNAseq. e scRNAseq data resembles that of the bulk RNAseq data and
consists of a gene expression matrix on n cells and
p q+
genes. We consider dividing the genes into two sets: a con-
trol set that contains q control genes and a target set that contains p genes of primary interest. e control genes are
selected based on the purpose of the analysis. For example, the control set would contain ERCC spike-ins if we want
to remove technical confounding factors, and would contain cell cycle genes if we want to remove cell cycle eects.
We use the following partial least squares regression to jointly model both control and target genes:
ΛεεΨ=+
~
xz,MVN(0,)
(1)
xxixixiii
~yz u ,MVN(0,)
(2)
yuyi yi yi
i
ii
ΛΛ εε Ψ=+ +
where for
i
'th individual cell,
x
i
is a q-vector of expression levels for q control genes;
y
i
is a p-vector of expression
levels for p target genes;
z
i
is
k
z
-vector of unknown confounding factors that aect both control and target genes;
the coecients of the confounding factors are represented by the
q
by
k
z
loading matrix
for the control genes
and the
p
by
k
z
loading matrix
for the target genes;
u
i
is a
k
u
-vector of unknown factors in the target genes and
potentially represents the predictors of interest or other structured variations (see below);
Λ
u
is a
p
by
k
u
loading
matrix;
ε
xi
is a
q
-vector of idiosyncratic error with covariance
diag(, ,)
xi xxq1
22
σσΨ =
;
yi
ε
is a
p
-vector of idiosyn-
cratic error with covariance
diag(, ,)
yi yyp1
22
σσΨ =
; MVN denotes the multivariate normal distribution. We
assume that
ε
xi
,
ε
yi
,
z
i
, and
u
i
are all independent from each other. Following standard latent factor models, we
further assume that
z IMVN(0, )
i
~
and
~u IMVN(0, )
i
. We model transformed data instead of the raw read
counts. We also assume that the expression levels of each gene have been centered to have mean zero, which
allows us to ignore the intercept.
scPLS includes two types of unknown latent factors. e rst set of factors,
z
i
, represents the unknown con-
founding factors that aect both control and target genes. e eects of
z
i
on the control and target genes are
captured in the loading matrices
and
, respectively. We call
z
i
the confounding factors throughout the text,
and we aim to remove the confounding eects
Λ z
y i
from the target genes. e second set of factors,
u
i
, aims to
capture a low dimensional structure of the expression level of
p
target genes. e factors
u
i
can represent the
unknown predictor variables of interest, specic experimental perturbations, cell subpopulations, gene signatures
or other intermediate factors that coordinately regulate a set of genes. erefore, the factors
u
i
can be interpreted
as cell subtypes, treatment status, transcription factors or regulators of biological pathways in dierent stud-
ies
4246
. Although
u
i
could be of direct biological interest in many data sets, we do not explicitly examine the
inferred
u
i
here. Rather, we view modeling
u
i
in the target genes as a way to better capture the complex variance
structure there and to facilitate the precise estimation of confounding factors
z
i
. For simplicity, we call
u
i
the
biological factors throughout the text, though we note that
u
i
could well represent non-biological processes such
as treatment or environmental eects. us, the expression levels of the control genes can be described by a linear
combination of the confounding factors
z
i
and residual errors; the expression levels of the target genes can be
described by a linear combination of the confounding factors
z
i
, the biological factors
u
i
and residual errors. For
both types of confounding factors, we are interested in inferring the factor eects
z
y i
Λ
and
Λ u
u i
rather than the
individual factors
z
i
and
u
i
. erefore, unlike in standard factor models, we are not concerned with the identia-
bility of the factors. Figure1 shows an illustration of scPLS.
scPLS is closely related to the two subcategories of unsupervised methods described in the previous Section.
Specically, without the biological eects term
Λ u
u i
, scPLS eectively reduces to the rst subcategory of methods
that treat all genes in the same fashion for inferring the confounding factors. Without the Equation2 term, scPLS
eectively reduces to the second subcategory of methods that use only control genes for inference. (Note that,
aer inferring the confounding factors
z
i
from Equation1, the second subcategory of methods still use a reduced

www.nature.com/scientificreports/
4
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
version of Equation2 without the biological eects term
Λ u
u i
to remove the confounding eects.) By including
both modeling terms, scPLS can robustly control for confounding eects across a range of scenarios. erefore,
scPLS provides a exible modeling framework that eectively includes the two subcategories of unsupervised
methods as special cases and has the potential to outperform these previous methods.
Simulations. We performed a simulation study to compare scPLS with other methods. Specically, we sim-
ulated gene expression levels for 50 control genes and 1,000 target genes for 200 cells. ese 200 cells come from
two equal-sized groups, representing two treatment conditions or two cell subpopulations. Among the 1,000
target genes, only 200 of them are dierentially expressed (DE) between the two groups and thus represent the
signature of the two groups. e eect sizes of the DE genes were simulated from a normal distribution and we
scaled the eects further so that the group label explains twenty percent of phenotypic variation (PVE) in expres-
sion levels in the DE genes. In addition to the group eects, we set
k k2, 5
zu
==
and simulated each element of
z
i
and
u
i
from a standard normal distribution. We simulated each element of
x
Λ
from
σ−.N(025, )
l
2
and each
element of
from
σ.N(0 25, )
l
2
. Note that
and
were simulated dierently to capture the fact that the eect
sizes of the confounding factors could be dierent for control and target genes. We simulated each element of
Λ
u
from
N(0,)
b
2
σ
. We simulated each element of
ε
xi
and
ε
yi
from a standard normal distribution. We set
σ =.04
l
2
and
σ =.06
b
2
to ensure that, in non-DE genes, the confounding factors
z
i
explain 20% PVE in either the control or the
target genes; the biological factors
u
i
explain 30% PVE of the target genes; and the residual errors to explain the
rest of PVE. To vary signal strength in the data, we also created a series of sub data sets by varying the number of
non DE genes in the data, so that the proportion of variance explained by DE genes in total equal to a xed per-
centage (PDE, in the range of 20–100%, with 10% increments). Aer we simulated gene expression levels, we
further converted these continuous values into count data by using a Poisson distribution: the nal observation
for ith cell and jth gene
c
ij
is from
µ +~c NwPoi( exp( ))
ij ij
, with
w
ij
being the continuous gene expression levels
simulated above and
N 500000, log(10/500000)µ==
, which ensures an average read count of 10. Note that,
because of the residual errors, the resulting count data are over-dispersed with respect to a Poisson distribution.
We considered three dierent simulation scenarios. In scenario I, the confounding factors
z
i
are independent
of group labels. In scenario II, the confounding factors are correlated with group labels. To simulate correlated
data, we simulated each element of
z
i
from
N(0,1)
if the corresponding sample belongs to the rst group, but
from
−.N(025,1)
if the corresponding sample belongs to the second group. Finally, we also considered a scenario
III where there is no biological factor (i.e. data were simulated eectively under the PCA modeling assumption
and all genes could be used to infer the confounding factors). We performed 10 simulation replicates for each
scenario. For scenario I and II, we further introduced dropout events that are commonly observed in scRNAseq
data. is was done by going through one gene at a time and setting the expression level for
j
th gene
c
ij
to zero
with probability
ij
π
that depends on the expression level through
l
og c
ij
1
ij
ij
=
π
π
.
We compared scPLS to four dierent methods: (1) PCA and (2) LMM (implemented in GEMMA
47,48
) use all
genes to infer the confounding eects; while (3) RUVseq (version 1.2.0); which we simply refer to as RUV in the
following text) and (4) scLVM (version 0.99.1) use only control genes to infer the confounding eects. We note
that while some of these methods are developed not specically for scRNAseq, these methods represent a range
of strategies to deal with confounding factors. We used default settings in all the above methods. We used the
Figure 1. Illustration of scPLS. We model the expression level of genes in the control set
X()
and genes in the
target set
Y()
jointly. Both control and target genes are aected by the common confounding factors (Z) with
eects
and
in the two gene sets, respectively. e target genes are also inuenced by biological factors (U)
with eects
u
Λ
. e biological factors represent intermediate factors that coordinately regulate a set of genes,
and are introduced to better capture the complex variance structure in the target genes.
E
x
and
E
y
represent
residual errors. scPLS aims to remove the confounding eects
ΛZ
y
in the target genes.

www.nature.com/scientificreports/
5
Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w
count data directly for RUV and used log transformed data (i.e.
+clog(1)
ij
) for all other methods. For PCA and
RUV, we set the number of latent factors to be the true number (i.e. 2). Such number is determined automatically
by the soware itself for scLVM, and is not needed for LMM. We compared dierent methods based on clustering
performance. In particular, for each of these methods, we obtained corrected data and applied k-means method
to cluster cells into two subpopulations. We the compared the clusters inferred from the corrected data with the
truth and used adjusted rand index (ARI) to measure clustering performance. ARI is computed across a range of
signal strength that is measured as PDE explained above. Intuitively, if a method performs well in removing con-
founding factors, then the corrected data from this method can be used to better infer the two cell subpopulations
and thus yields a higher ARI score.
Overall, scPLS performs the best in both scenarios I and II, with or without dropout events (Fig.2a). e
addition of dropout events in either of the two scenarios reduces the performance of all methods but does not
change their relative rank of performance. e superior performance of scPLS also suggests that properly using
both control and target genes can lead to eective removal of confounding eects. Among the rest of the methods,
PCA and LMM performs better than RUV and scLVM, suggesting that target genes contain a substantial amount
of information for removing confounding eects. Beside the comparison of clustering performance, for each gene
in turn, we also used dierent methods to estimate the proportion of gene expression variance contributed by
confounding factors. Consistent with the clustering performance comparison, we found that scPLS also yielded
more accurate proportion of variance estimates (Fig.2b).
To examine the robustness of scPLS, we applied scPLS to the same data but with a reduced number of control
genes (Fig.3a). Because scPLS does not completely rely on the information contained in the control genes, it
achieves robust performance even if we only use a much smaller subset of control genes. We also examined the
performance of scPLS in Scenario III where there is no biological eects (Fig.3b) and found that scPLS performs
well there. As it is oen unknown whether a low-rank structural variation exists in a real data set, our simulation
suggests that we can always include the biological factors
u
i
in the model even in the absence of such factors. In
addition, scPLS is not sensitive with respect to the number of biological factors used in tting the model, and
achieves similar power for a range of reasonable
k
u
values (Fig.3c).
Real Data Applications. Next, we applied scPLS to two real data sets. e rst dataset is used to demonstrate
the eectiveness of scPLS in removing the technical confounding eects by using ERCC spike-ins. Removing
technical confounding eects is a common and important task in transcriptome analysis. e second dataset is
used to demonstrate the eectiveness of scPLS in removing cell cycle eects by using a known set of cell cycle
genes. Removing cell cycle eects can reveal gene expression heterogeneity that is otherwise obscured.
Removing Technical Confounding Factors. e rst dataset consists of 251 samples from
22
. Among
these, 119 are mouse embryonic stem cells (mESCs), including 74 mESCs cultured in a two-inhibitor (2i)
medium and 45 mESCs cultured in a serum medium. e remaining 132 cells are control “cells” that are obtained
by mixing single cells cultured in each condition (i.e. these control “cells” are similar to bulk seq data in terms of
consisting a mixture of cell types, but are prepared and sequenced using single cell protocol). e control cells
include 76 cells cultured in 2i and 56 cells cultured in serum. Because the control cells are homogeneous within
each culture condition, when we cluster these cell, we would expect the only true cluster detectable among these
cells is the culture condition. erefore, we decide to focus on these control cells to compare the performance of
dierent methods for removing technical eects.
We obtained the raw UMI counts data directly from the authors. e data contains measurements for 92
ERCC spike-ins and 23,459 genes. Due to the low coverage of this dataset (median coverage equals one), we
ltered out lowly expressed genes and selected only genes that have at least ve counts and spike-ins that have at
least one count in more than a third of the cells. is ltering step resulted in a total of 32 ERCC spike-ins that
were used as the controls and 2,795 genes that were used as the targets.
As in the simulations, we log transformed the count data and centered the transformed values for scPLS, PCA,
LMM and scLVM. We used the count data for RUV. In this data, scPLS infers
k 1
z
=
confounding factors and
k 1
u
=
biological factors. In the target genes, the confounding factors and structured biological factors explain a
median of 18% and 30% of gene expression variance, respectively. e PVE by the confounding and biological
factors can be as high as 73.7% and 77.9%, respectively, in the target genes.
We applied scPLS and the other four methods to remove confounding eects in the data. Since control cells
are homogeneous within each culture condition, we reasoned that if the method is eective in removing con-
founding eects, then the corrected data from the corresponding method could be used to better reveal two
clusters that correspond to the two known culture conditions. For the clustering analysis, we applied the four
dierent clustering approaches on the uncorrected or corrected data from dierent methods. e four clustering
approaches include: (1) kmeans, where we applied the k-means method directly on the uncorrected or corrected
data; (2) PCA, where we extracted the top ve PCs from either the uncorrected or corrected data and then applied
the k-means method using the top PCs; (3) tSNE, where we used tSNE to either the uncorrected or corrected data
and then applied the k-means method on the extracted tSNE factors; (4) SC3, where we used a recently devel-
oped state-of-the-art single cell clustering method single cell census clustering (SC3)
49
. For all these clustering
approaches, we set the number of clusters to two and measured clustering performance by the adjusted Rand
Index (ARI). e results are shown in Table1 and are overall consistent with the simulations. Specically, scPLS
outperforms the other methods in three out of the four clustering approaches. scPLS performs slightly worse
than RUV when tSNE was used to cluster data–but tSNE works extremely poorly in this data presumably because
tSNE’s non-linearity assumption does not t the data well.

Citations
More filters
Journal ArticleDOI
TL;DR: This paper introduces a general framework, RUV*, that both unites and generalizes existing RUV approaches and provides conditions under which RUV2 and RUV4 are equivalent, and implements RUVB, a version of RUV* based on Bayesian factor analysis.
Abstract: Unwanted variation, including hidden confounding, is a well-known problem in many fields, particularly large-scale gene expression studies. Recent proposals to use control genes --- genes assumed to be unassociated with the covariates of interest --- have led to new methods to deal with this problem. Going by the moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1, RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify connections between existing methods. In particular we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework also preserves an advantage of RUV approaches --- their modularity --- which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data we found that RUVB is competitive with existing methods in terms of both power and calibration, although we also highlight the challenges of providing consistently reliable calibration among data sets.

12 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper explores the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data and demonstrates that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.
Abstract: Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes. In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result. Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.

105 citations

Journal ArticleDOI
TL;DR: A binomial mixed model and an efficient, sampling-based algorithm (MACAU: Mixed model association for count data via data augmentation) for approximate parameter estimation and p-value computation are presented to address the challenges of modeling bisulfite sequencing data.
Abstract: Identifying sources of variation in DNA methylation levels is important for understanding gene regulation. Recently, bisulfite sequencing has become a popular tool for investigating DNA methylation levels. However, modeling bisulfite sequencing data is complicated by dramatic variation in coverage across sites and individual samples, and because of the computational challenges of controlling for genetic covariance in count data. To address these challenges, we present a binomial mixed model and an efficient, sampling-based algorithm (MACAU: Mixed model association for count data via data augmentation) for approximate parameter estimation and p-value computation. This framework allows us to simultaneously account for both the over-dispersed, count-based nature of bisulfite sequencing data, as well as genetic relatedness among individuals. Using simulations and two real data sets (whole genome bisulfite sequencing (WGBS) data from Arabidopsis thaliana and reduced representation bisulfite sequencing (RRBS) data from baboons), we show that our method provides well-calibrated test statistics in the presence of population structure. Further, it improves power to detect differentially methylated sites: in the RRBS data set, MACAU detected 1.6-fold more age-associated CpG sites than a beta-binomial model (the next best approach). Changes in these sites are consistent with known age-related shifts in DNA methylation levels, and are enriched near genes that are differentially expressed with age in the same population. Taken together, our results indicate that MACAU is an efficient, effective tool for analyzing bisulfite sequencing data, with particular salience to analyses of structured populations. MACAU is freely available at www.xzlab.org/software.html.

104 citations

Journal ArticleDOI
TL;DR: It is found that the inferred phenotypes are associated with locus genotypes and environmental conditions and can explain genetic associations to genes in trans for the first time.
Abstract: Even within a defined cell type, the expression level of a gene differs in individual samples. The effects of genotype, measured factors such as environmental conditions, and their interactions have been explored in recent studies. Methods have also been developed to identify unmeasured intermediate factors that coherently influence transcript levels of multiple genes. Here, we show how to bring these two approaches together and analyse genetic effects in the context of inferred determinants of gene expression. We use a sparse factor analysis model to infer hidden factors, which we treat as intermediate cellular phenotypes that in turn affect gene expression in a yeast dataset. We find that the inferred phenotypes are associated with locus genotypes and environmental conditions and can explain genetic associations to genes in trans. For the first time, we consider and find interactions between genotype and intermediate phenotypes inferred from gene expression levels, complementing and extending established results.

86 citations

01 Jun 2014
TL;DR: In this article, the authors proposed a method to detect the presence of cancer in the human genome using gene sequencing data from National Human Genome Research Institute (U.S.). Centers of Excellence in Genomic Science (1P50HG006193-01)
Abstract: National Human Genome Research Institute (U.S.). Centers of Excellence in Genomic Science (1P50HG006193-01)

81 citations

Journal ArticleDOI
TL;DR: The possibility that tumor segmental aneuploidy makes significant contributions to variation in the lactic acidosis/hypoxia gene signatures in human cancers is suggested and latent factor analysis is demonstrated to be a powerful means to uncover such a linkage.
Abstract: Tumor microenvironmental stresses, such as hypoxia and lactic acidosis, play important roles in tumor progression. Although gene signatures reflecting the influence of these stresses are powerful approaches to link expression with phenotypes, they do not fully reflect the complexity of human cancers. Here, we describe the use of latent factor models to further dissect the stress gene signatures in a breast cancer expression dataset. The genes in these latent factors are coordinately expressed in tumors and depict distinct, interacting components of the biological processes. The genes in several latent factors are highly enriched in chromosomal locations. When these factors are analyzed in independent datasets with gene expression and array CGH data, the expression values of these factors are highly correlated with copy number alterations (CNAs) of the corresponding BAC clones in both the cell lines and tumors. Therefore, variation in the expression of these pathway-associated factors is at least partially caused by variation in gene dosage and CNAs among breast cancers. We have also found the expression of two latent factors without any chromosomal enrichment is highly associated with 12q CNA, likely an instance of “trans”-variations in which CNA leads to the variations in gene expression outside of the CNA region. In addition, we have found that factor 26 (1q CNA) is negatively correlated with HIF-1α protein and hypoxia pathways in breast tumors and cell lines. This agrees with, and for the first time links, known good prognosis associated with both a low hypoxia signature and the presence of CNA in this region. Taken together, these results suggest the possibility that tumor segmental aneuploidy makes significant contributions to variation in the lactic acidosis/hypoxia gene signatures in human cancers and demonstrate that latent factor analysis is a powerful means to uncover such a linkage.

56 citations

Frequently Asked Questions (2)
Q1. What have the authors contributed in "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?

Here, the authors present a novel statistical method, which they refer to as scPLS ( single cell partial least squares ), for robust and accurate inference of confounding effects. ScPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. With extensive simulations and comparisons with other methods, the authors demonstrate the effectiveness of scPLS. Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects. 

Exploring the use of biological factors in scPLS is an interesting avenue for future research. Therefore, it would be important to evaluate the performance of scPLS in other analysis settings in future studies. Therefore, extending their framework to modeling count data65,66 is another promising avenue for future research. One potential disadvantage of scPLS is that it does not model raw count data directly.