Posted Content•DOI•

Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes

Mengjie Chen¹, Xiang Zhou²•Institutions (2)

University of North Carolina at Chapel Hill¹, University of Michigan²

22 Mar 2016-bioRxiv (Cold Spring Harbor Laboratory)-pp 045070

TL;DR: This work develops an alternative statistical method, which it refers to as scPLS, for more accurate inference of confounding effects, based on partial least squares and models control and target genes jointly to better infer and control for confounding effects.

read less

Abstract: Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. We apply scPLS to analyze three scRNAseq data sets to further illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

...read moreread less

Summary (2 min read)

Jump to: [Target Genes] – [Review of Previous Methods] – [Results] – [Discussion] – [Methods] and [Author Contributions]

Target Genes

Single cell RNA sequencing technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations.
Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
These hidden confounding factors can cause systematic bias, are notoriously difficult to control for, and are the focus of the present study.
In the Simulations section the authors present comparisons between scPLS and several existing methods using simulations.

Review of Previous Methods

Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden confounding factors and control for hidden confounding effects.
The application-specific methods become inconvenient in cases where there are multiple variables of interest (e.g. in eQTL mapping problems).
In contrast, the second subcategory of unsupervised methods are recently developed to take advantage of a set of control genes for inferring the confounding factors29,37.
Similarly, most scRNAseq studies include a set of control genes that are known to have varying expression levels across cell cycles.
The two subcategories of unsupervised methods use different strategies to infer the confounding factors.

Results

The authors provide modeling details for scPLS here.
Consistent with the clustering performance comparison, the authors found that scPLS also yielded more accurate proportion of variance estimates (Fig. 2b).
In the target genes, the confounding factors and structured biological factors explain a median of 18% and 30% of gene expression variance, respectively.
The results are shown in Table 1 and are overall consistent with the simulations.
To demonstrate its effectiveness there, the authors applied scPLS and several other methods to a second dataset that was used for demonstrating cell cycle influence37.

Discussion

The authors have presented scPLS for removing hidden confounding effects in scRNAseq studies.
Importantly, the performance of scPLS is robust to the number of genes included in the control set and yields comparable results even when a much smaller number of control genes is used.
In fact, low-rank factors inferred from many data sets using standard factor models have been linked to important biological pathways or transcription factors42–46.
The authors have been mainly focused on comparing the performance of different confounding effects removing methods by evaluating the clustering performance as the target downstream analysis.
Like many other methods for scRNAseq21 or bulk58,59 RNAseq studies, scPLS requires a data transformation step that converts the count data into quantitative expression data.

Methods

The authors list the EM algorithm below, with detailed derivation provided later.
The naive EM algorithm is computationally expensive: it scales quadratically with the number of genes and linearly with the number of cells/samples.
Therefore, the authors apply the EM-in-chunks algorithm with chunk size 500 throughout the rest of the paper.
In the E step, the authors calculate the expectation of the log likelihood function for complete data.
Ten replicates were performed for each setting on an Intel Xeon E5-2670 2.6 GHz CPU.

Author Contributions

The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations, also known as Publisher's note.
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w

www.nature.com/scientificreports

Controlling for Confounding Eects

in Single Cell RNA Sequencing

Studies Using both Control and

Target Genes

Mengjie Chen

1,2

& Xiang Zhou

3,4

Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased

and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many

advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the inuence

of confounding eects. Controlling for confounding eects in scRNAseq data is a crucial step for

accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS

(single cell partial least squares), for robust and accurate inference of confounding eects. scPLS takes

advantage of the fact that genes in a scRNAseq study often can be naturally classied into two sets: a

control set of genes that are free of eects of the predictor variables and a target set of genes that are

of primary interest. By modeling the two sets of genes jointly using the partial least squares regression,

scPLS is capable of making full use of the data to improve the inference of confounding eects. With

extensive simulations and comparisons with other methods, we demonstrate the eectiveness of

scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benets in removing

technical confounding eects as well as for removing cell cycle eects.

Single-cell RNA sequencing (scRNAseq) has emerged as a powerful tool in genomics. While the traditional RNA

sequencing, known as the bulk RNAseq, measures gene expression levels averaged across many dierent cells in

a sample of potentially heterogeneous cell population, scRNAseq can measure gene expression levels directly at

the single cell resolution. As a result, scRNAseq is less inuenced by the variation of cell type and cell composition

across dierent samples–a major confounding in the analyses of bulk RNAseq studies. Because of this benet and

its high resolution, scRNAseq provides unprecedented insights into many basic biological questions that are pre-

viously dicult to address. For example, scRNAseq has been applied to classify novel cell subtypes

1,2

and cellular

states

3,4

, reconstruct cell lineage and quantify progressive gene expression during development

5–8

, perform spatial

mapping and re-localization

9,10

, identify dierentially expressed genes and gene expression modulars

11–13

, and

investigate the genetic basis of gene expression variation by detecting heterogenic allelic specic expressions

14,15

Like any other genomic sequencing experiment, scRNAseq studies are inuenced by many factors that can

introduce unwanted variation in the sequencing data and confound the down-stream analysis

. However, such

unwanted variation are oen exacerbated in scRNAseq experiments due to a range of scRNAseq specic con-

ditions that include amplication bias, low amount of input material and low transcript capture eciency

;

dropout events that are driven by both biological and technical factors

18,19

; global changes in expression due to

transcriptional bursts

; as well as changes in cell cycle and cell size

. Indeed, adjusting for confounding fac-

tors in scRNAseq data has been shown to be crucial for accurate estimation of gene expression levels and suc-

cessful down-stream analysis

16–18,22,23

. However, depending on the source, adjusting for confounding factors in

scRNAseq can be non-trivial. Some confounding eects, such as read sampling noise and drop-out events, are

direct consequences of low sequencing-depth, which are random in nature and can be readily addressed by prob-

abilistic modeling using existing statistical methods

18,22–25

. Other confounding eects are inherent to a particular

Department of Medicine, University of Chicago, Chicago, IL 60637, USA.

Department of Human Genetics,

University of Chicago, Chicago, IL 60637, USA.

Department of Biostatistics, University of Michigan, Ann Arbor, MI

48109, USA.

Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA. Correspondence and

requests for materials should be addressed to M.C. (email: mengjiechen@uchicago.edu) or X.Z. (email: xzhousph@

umich.edu)

Received: 18 May 2017

Accepted: 29 September 2017

Published: xx xx xxxx

OPEN

www.nature.com/scientificreports/

Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w

experimental protocol and can cause amplication bias, but can be easily mitigated by using new protocols

. Yet

other confounding eects are due to observable batches and can be adjusted for by including batch labels and

technician ids as covariates or dealt with other statistical methods

27,28

. However, many confounding factors are

hidden and are dicult or even impossible to measure. Common hidden confounding factors include various

technical artifacts during library preparation and sequencing, and unwanted biological confounders such as cell

cycle status. ese hidden confounding factors can cause systematic bias, are notoriously dicult to control for,

and are the focus of the present study.

To eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel sta-

tistical method, which we refer to as scPLS (single cell partial least squares). scPLS takes advantage of the fact

that genes in a scRNAseq study can oen be naturally classied into two sets: a control set of genes that are free of

eects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of

genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve

the inference of confounding factors. scPLS is closely related to and bridges between two existing subcategories

of methods for transcriptome analysis: a subcategory of methods that treat control and target genes in the same

fashion (e.g. PCA

29–32

and LMM

33–35

), and another subcategory of methods that use control genes alone for infer-

ring confounding factors (e.g. RUV

29,36

and scLVM

). By bridging between the two subcategories of methods,

scPLS enjoys robust performance across a range of application scenarios. scPLS is also computationally ecient:

with a new block-wise expectation maximization (EM) algorithm, it is scalable to thousands of cells and tens of

thousands of genes. Using simulations and two real data applications, we show how scPLS can be used to remove

confounding eects and enable accurate down-stream analysis in scRNAseq studies. Our method is implemented

as a part of the Citrus project and is freely available at: http://chenmengjie.github.io/Citrus/.

e paper is organized as follows. In the Review of Previous Methods section, we provide a brief review of

existing statistical methods for removing confounding eects in transcriptome analysis and describe how scPLS

is related to and motivated from these methods. In the Method Overview Section, we provide a methodologi-

cal description of the scPLS model, with inference details provided in the Methods Section. In the Simulations

section we present comparisons between scPLS and several existing methods using simulations. In Real Data

Applications section, we apply scPLS to two real scRNAseq data sets to remove technical confounding eects or

cell cycle eects. Finally, we conclude the paper with a summary and discussion.

Review of Previous Methods

Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden

confounding factors and control for hidden confounding eects. Based on their targeted application, these statis-

tical methods can be generally classied into two categories.

e rst category of methods are supervised and application-specic: these methods are designed to infer

the confounding factors in the presence of a known predictor variable, and to correct for the confounding eects

without removing the eects of the predictor variable. For example, scientists are oen interested in identifying

genes that are dierentially expressed between two pre-determined treatment conditions or that are associated

with a measured predictor variable of interest. To remove the confounding eects in these applications, meth-

ods, include SVA

, sparse regression models

38,39

, and, more recently, RUV

40,41

, are developed. Although these

application-specic methods are widely applied in many genomics studies, their usage is naturally restricted to

cases where the primary variable of interest is known. e application-specic methods become inconvenient in

cases where there are multiple variables of interest (e.g. in eQTL mapping problems). ey also become inappli-

cable when the primary variable of interest is not observed (e.g. in clustering problems).

e second category of methods are unsupervised, and are designed to infer the confounding factors without

knowing or using the predictor variable of interest. Our scPLS belongs to this category. Notable applications of

unsupervised methods in scRNAseq studies include cell type clustering and classication

1–8

. Existing unsuper-

vised statistical methods can be further classied into two subcategories. e rst subcategory of methods treat

all genes in the same fashion and use all of them to infer the confounding factors. For example, the principal

component analysis (PCA) or the factor model extracts the principal components or factors from all genes (or all

highly variable genes) as surrogates for the confounding factors

29–32

. e inferred factors are treated as covariates

whose eects are further removed from gene expression levels before downstream analyses. Similarly, the linear

mixed models (LMMs) construct a sample relatedness matrix based on all genes to capture the inuence of the

confounding factors

33–35

. e relatedness matrix are then included in the downstream analyses to control for the

confounding eects. In contrast, the second subcategory of unsupervised methods are recently developed to take

advantage of a set of control genes for inferring the confounding factors

29,37

. ese methods divide genes into

two sets: a control set of genes that are known to be free of eects of interest a priori and a target set of genes that

are of primary interest. Unlike the rst subcategory, the second subcategory of methods treat the two gene sets

dierently in inferring the confounding factors: the confounding factors are only inferred from the control set,

and are then used to remove the confounding eects in the target genes for subsequent downstream analysis. For

example, scRNAseq studies oen add ERCC spike-in controls prior to the PCR amplication and sequencing

steps. e spike-in controls can be used to capture the hidden confounding technical factors associated with the

experimental procedures, which are further used to remove technical confounding eects (e.g. reverse transcrip-

tion or PCR amplication confounding eects) from the target genes

. Similarly, most scRNAseq studies include

a set of control genes that are known to have varying expression levels across cell cycles. ese cell cycle genes

can be used to capture the unmeasured cell cycle status of each cell, which are further used to remove cell cycle

eects in the target genes

. Prominent methods in the second subcategory include the unsupervised version of

RUV

29,36

and scLVM

e two subcategories of unsupervised methods use dierent strategies to infer the confounding factors.

erefore, these two sets of methods are expected to perform well in dierent settings. Specically, the rst

www.nature.com/scientificreports/

Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w

subcategory of methods have the advantage of using information contained in all genes to accurately infer the

confounding eects. However, when the predictor variable of interest inuences a large number of genes, then

this subcategory of methods may incorrectly remove the primary eects of interest. On the other hand, the sec-

ond subcategory of methods infer confounding factors only from the control genes and are thus not prone to

mistakenly removing the primary eects of interest. However, these methods overlook one important fact–that

the hidden confounding factors not only inuence the control genes but also the target genes, i.e. the exact reason

that we need to remove such confounding eects in the rst place. Because the confounding factors inuence

both control and target genes, using control genes alone to infer the confounding factors can be suboptimal as it

fails to use the information from target genes.

To more eectively infer and control for hidden confounding factors in scRNAseq studies, we develop a novel

statistical method, which we refer to as scPLS (single cell partial least squares). scPLS bridges between the two

subcategories of unsupervised methods and eectively includes each as a special case. Like the rst subcategory

of methods, scPLS models both control and target genes jointly to infer the confounding factors. Like the second

subcategory of methods, scPLS is capable of taking advantage of a control set to guild the inference of confound-

ing factors. scPLS builds upon the partial least squares regression model and relies on a key modeling assumption

that only target genes contain the primary eects of interest or other systematic biological variations. By incor-

porating such systematic variations in the target genes only, we can jointly model both control and target genes

to infer the confounding eects while avoiding mis-removing the primary eects of interest. erefore, scPLS

has the potential to make full use of the data to improve the inference of confounding factors and the removal of

confounding eects.

Results

scPLS Method Overview. We provide modeling details for scPLS here. While the formulation of scPLS is

general, we focus on its application in scRNAseq. e scRNAseq data resembles that of the bulk RNAseq data and

consists of a gene expression matrix on n cells and

p q+

genes. We consider dividing the genes into two sets: a con-

trol set that contains q control genes and a target set that contains p genes of primary interest. e control genes are

selected based on the purpose of the analysis. For example, the control set would contain ERCC spike-ins if we want

to remove technical confounding factors, and would contain cell cycle genes if we want to remove cell cycle eects.

We use the following partial least squares regression to jointly model both control and target genes:

ΛεεΨ=+

xz,MVN(0,)

(1)

xxixixiii

~yz u ,MVN(0,)

(2)

yuyi yi yi

ΛΛ εε Ψ=+ +

where for

'th individual cell,

is a q-vector of expression levels for q control genes;

is a p-vector of expression

levels for p target genes;

-vector of unknown confounding factors that aect both control and target genes;

the coecients of the confounding factors are represented by the

loading matrix

for the control genes

and the

loading matrix

for the target genes;

is a

-vector of unknown factors in the target genes and

potentially represents the predictors of interest or other structured variations (see below);

is a

matrix;

is a

-vector of idiosyncratic error with covariance

diag(, ,)

xi xxq1

σσΨ =

;

is a

-vector of idiosyn-

cratic error with covariance

diag(, ,)

yi yyp1

σσΨ =

; MVN denotes the multivariate normal distribution. We

assume that

, and

are all independent from each other. Following standard latent factor models, we

further assume that

z IMVN(0, )

and

~u IMVN(0, )

. We model transformed data instead of the raw read

counts. We also assume that the expression levels of each gene have been centered to have mean zero, which

allows us to ignore the intercept.

scPLS includes two types of unknown latent factors. e rst set of factors,

, represents the unknown con-

founding factors that aect both control and target genes. e eects of

on the control and target genes are

captured in the loading matrices

and

, respectively. We call

the confounding factors throughout the text,

and we aim to remove the confounding eects

Λ z

y i

from the target genes. e second set of factors,

, aims to

capture a low dimensional structure of the expression level of

target genes. e factors

can represent the

unknown predictor variables of interest, specic experimental perturbations, cell subpopulations, gene signatures

or other intermediate factors that coordinately regulate a set of genes. erefore, the factors

can be interpreted

as cell subtypes, treatment status, transcription factors or regulators of biological pathways in dierent stud-

ies

42–46

. Although

could be of direct biological interest in many data sets, we do not explicitly examine the

inferred

here. Rather, we view modeling

in the target genes as a way to better capture the complex variance

structure there and to facilitate the precise estimation of confounding factors

. For simplicity, we call

the

biological factors throughout the text, though we note that

could well represent non-biological processes such

as treatment or environmental eects. us, the expression levels of the control genes can be described by a linear

combination of the confounding factors

and residual errors; the expression levels of the target genes can be

described by a linear combination of the confounding factors

, the biological factors

and residual errors. For

both types of confounding factors, we are interested in inferring the factor eects

y i

and

Λ u

u i

rather than the

individual factors

and

. erefore, unlike in standard factor models, we are not concerned with the identia-

bility of the factors. Figure1 shows an illustration of scPLS.

scPLS is closely related to the two subcategories of unsupervised methods described in the previous Section.

Specically, without the biological eects term

Λ u

u i

, scPLS eectively reduces to the rst subcategory of methods

that treat all genes in the same fashion for inferring the confounding factors. Without the Equation2 term, scPLS

eectively reduces to the second subcategory of methods that use only control genes for inference. (Note that,

aer inferring the confounding factors

from Equation1, the second subcategory of methods still use a reduced

www.nature.com/scientificreports/

Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w

version of Equation2 without the biological eects term

Λ u

u i

to remove the confounding eects.) By including

both modeling terms, scPLS can robustly control for confounding eects across a range of scenarios. erefore,

scPLS provides a exible modeling framework that eectively includes the two subcategories of unsupervised

methods as special cases and has the potential to outperform these previous methods.

Simulations. We performed a simulation study to compare scPLS with other methods. Specically, we sim-

ulated gene expression levels for 50 control genes and 1,000 target genes for 200 cells. ese 200 cells come from

two equal-sized groups, representing two treatment conditions or two cell subpopulations. Among the 1,000

target genes, only 200 of them are dierentially expressed (DE) between the two groups and thus represent the

signature of the two groups. e eect sizes of the DE genes were simulated from a normal distribution and we

scaled the eects further so that the group label explains twenty percent of phenotypic variation (PVE) in expres-

sion levels in the DE genes. In addition to the group eects, we set

k k2, 5

and simulated each element of

and

from a standard normal distribution. We simulated each element of

from

σ−.N(025, )

and each

element of

from

σ.N(0 25, )

. Note that

and

were simulated dierently to capture the fact that the eect

sizes of the confounding factors could be dierent for control and target genes. We simulated each element of

from

N(0,)

. We simulated each element of

and

from a standard normal distribution. We set

σ =.04

and

σ =.06

to ensure that, in non-DE genes, the confounding factors

explain 20% PVE in either the control or the

target genes; the biological factors

explain 30% PVE of the target genes; and the residual errors to explain the

rest of PVE. To vary signal strength in the data, we also created a series of sub data sets by varying the number of

non DE genes in the data, so that the proportion of variance explained by DE genes in total equal to a xed per-

centage (PDE, in the range of 20–100%, with 10% increments). Aer we simulated gene expression levels, we

further converted these continuous values into count data by using a Poisson distribution: the nal observation

for ith cell and jth gene

is from

µ +~c NwPoi( exp( ))

ij ij

, with

being the continuous gene expression levels

simulated above and

N 500000, log(10/500000)µ==

, which ensures an average read count of 10. Note that,

because of the residual errors, the resulting count data are over-dispersed with respect to a Poisson distribution.

We considered three dierent simulation scenarios. In scenario I, the confounding factors

are independent

of group labels. In scenario II, the confounding factors are correlated with group labels. To simulate correlated

data, we simulated each element of

from

N(0,1)

if the corresponding sample belongs to the rst group, but

from

−.N(025,1)

if the corresponding sample belongs to the second group. Finally, we also considered a scenario

III where there is no biological factor (i.e. data were simulated eectively under the PCA modeling assumption

and all genes could be used to infer the confounding factors). We performed 10 simulation replicates for each

scenario. For scenario I and II, we further introduced dropout events that are commonly observed in scRNAseq

data. is was done by going through one gene at a time and setting the expression level for

th gene

to zero

with probability

that depends on the expression level through

og c

π−

We compared scPLS to four dierent methods: (1) PCA and (2) LMM (implemented in GEMMA

47,48

) use all

genes to infer the confounding eects; while (3) RUVseq (version 1.2.0); which we simply refer to as RUV in the

following text) and (4) scLVM (version 0.99.1) use only control genes to infer the confounding eects. We note

that while some of these methods are developed not specically for scRNAseq, these methods represent a range

of strategies to deal with confounding factors. We used default settings in all the above methods. We used the

Figure 1. Illustration of scPLS. We model the expression level of genes in the control set

X()

and genes in the

target set

Y()

jointly. Both control and target genes are aected by the common confounding factors (Z) with

eects

and

in the two gene sets, respectively. e target genes are also inuenced by biological factors (U)

with eects

. e biological factors represent intermediate factors that coordinately regulate a set of genes,

and are introduced to better capture the complex variance structure in the target genes.

and

represent

residual errors. scPLS aims to remove the confounding eects

ΛZ

in the target genes.

www.nature.com/scientificreports/

Scientific RepoRts | 7: 13587 | DOI:10.1038/s41598-017-13665-w

count data directly for RUV and used log transformed data (i.e.

+clog(1)

) for all other methods. For PCA and

RUV, we set the number of latent factors to be the true number (i.e. 2). Such number is determined automatically

by the soware itself for scLVM, and is not needed for LMM. We compared dierent methods based on clustering

performance. In particular, for each of these methods, we obtained corrected data and applied k-means method

to cluster cells into two subpopulations. We the compared the clusters inferred from the corrected data with the

truth and used adjusted rand index (ARI) to measure clustering performance. ARI is computed across a range of

signal strength that is measured as PDE explained above. Intuitively, if a method performs well in removing con-

founding factors, then the corrected data from this method can be used to better infer the two cell subpopulations

and thus yields a higher ARI score.

Overall, scPLS performs the best in both scenarios I and II, with or without dropout events (Fig.2a). e

addition of dropout events in either of the two scenarios reduces the performance of all methods but does not

change their relative rank of performance. e superior performance of scPLS also suggests that properly using

both control and target genes can lead to eective removal of confounding eects. Among the rest of the methods,

PCA and LMM performs better than RUV and scLVM, suggesting that target genes contain a substantial amount

of information for removing confounding eects. Beside the comparison of clustering performance, for each gene

in turn, we also used dierent methods to estimate the proportion of gene expression variance contributed by

confounding factors. Consistent with the clustering performance comparison, we found that scPLS also yielded

more accurate proportion of variance estimates (Fig.2b).

To examine the robustness of scPLS, we applied scPLS to the same data but with a reduced number of control

genes (Fig.3a). Because scPLS does not completely rely on the information contained in the control genes, it

achieves robust performance even if we only use a much smaller subset of control genes. We also examined the

performance of scPLS in Scenario III where there is no biological eects (Fig.3b) and found that scPLS performs

well there. As it is oen unknown whether a low-rank structural variation exists in a real data set, our simulation

suggests that we can always include the biological factors

in the model even in the absence of such factors. In

addition, scPLS is not sensitive with respect to the number of biological factors used in tting the model, and

achieves similar power for a range of reasonable

values (Fig.3c).

Real Data Applications. Next, we applied scPLS to two real data sets. e rst dataset is used to demonstrate

the eectiveness of scPLS in removing the technical confounding eects by using ERCC spike-ins. Removing

technical confounding eects is a common and important task in transcriptome analysis. e second dataset is

used to demonstrate the eectiveness of scPLS in removing cell cycle eects by using a known set of cell cycle

genes. Removing cell cycle eects can reveal gene expression heterogeneity that is otherwise obscured.

Removing Technical Confounding Factors. e rst dataset consists of 251 samples from

. Among

these, 119 are mouse embryonic stem cells (mESCs), including 74 mESCs cultured in a two-inhibitor (2i)

medium and 45 mESCs cultured in a serum medium. e remaining 132 cells are control “cells” that are obtained

by mixing single cells cultured in each condition (i.e. these control “cells” are similar to bulk seq data in terms of

consisting a mixture of cell types, but are prepared and sequenced using single cell protocol). e control cells

include 76 cells cultured in 2i and 56 cells cultured in serum. Because the control cells are homogeneous within

each culture condition, when we cluster these cell, we would expect the only true cluster detectable among these

cells is the culture condition. erefore, we decide to focus on these control cells to compare the performance of

dierent methods for removing technical eects.

We obtained the raw UMI counts data directly from the authors. e data contains measurements for 92

ERCC spike-ins and 23,459 genes. Due to the low coverage of this dataset (median coverage equals one), we

ltered out lowly expressed genes and selected only genes that have at least ve counts and spike-ins that have at

least one count in more than a third of the cells. is ltering step resulted in a total of 32 ERCC spike-ins that

were used as the controls and 2,795 genes that were used as the targets.

As in the simulations, we log transformed the count data and centered the transformed values for scPLS, PCA,

LMM and scLVM. We used the count data for RUV. In this data, scPLS infers

k 1

confounding factors and

k 1

biological factors. In the target genes, the confounding factors and structured biological factors explain a

median of 18% and 30% of gene expression variance, respectively. e PVE by the confounding and biological

factors can be as high as 73.7% and 77.9%, respectively, in the target genes.

We applied scPLS and the other four methods to remove confounding eects in the data. Since control cells

are homogeneous within each culture condition, we reasoned that if the method is eective in removing con-

founding eects, then the corrected data from the corresponding method could be used to better reveal two

clusters that correspond to the two known culture conditions. For the clustering analysis, we applied the four

dierent clustering approaches on the uncorrected or corrected data from dierent methods. e four clustering

approaches include: (1) kmeans, where we applied the k-means method directly on the uncorrected or corrected

data; (2) PCA, where we extracted the top ve PCs from either the uncorrected or corrected data and then applied

the k-means method using the top PCs; (3) tSNE, where we used tSNE to either the uncorrected or corrected data

and then applied the k-means method on the extracted tSNE factors; (4) SC3, where we used a recently devel-

oped state-of-the-art single cell clustering method single cell census clustering (SC3)

. For all these clustering

approaches, we set the number of clusters to two and measured clustering performance by the adjusted Rand

Index (ARI). e results are shown in Table1 and are overall consistent with the simulations. Specically, scPLS

outperforms the other methods in three out of the four clustering approaches. scPLS performs slightly worse

than RUV when tSNE was used to cluster data–but tSNE works extremely poorly in this data presumably because

tSNE’s non-linearity assumption does not t the data well.

HTML Viewer

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls

[...]

David Gerard, Matthew Stephens

23 May 2017-arXiv: Methodology

TL;DR: This paper introduces a general framework, RUV*, that both unites and generalizes existing RUV approaches and provides conditions under which RUV2 and RUV4 are equivalent, and implements RUVB, a version of RUV* based on Bayesian factor analysis.

...read moreread less

Abstract: Unwanted variation, including hidden confounding, is a well-known problem in many fields, particularly large-scale gene expression studies. Recent proposals to use control genes --- genes assumed to be unassociated with the covariates of interest --- have led to new methods to deal with this problem. Going by the moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1, RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify connections between existing methods. In particular we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework also preserves an advantage of RUV approaches --- their modularity --- which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data we found that RUVB is competitive with existing methods in terms of both power and calibration, although we also highlight the challenges of providing consistently reliable calibration among data sets.

...read moreread less

12 citations

Frequently Asked Questions (2)

Q1. What have the authors contributed in "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?

Here, the authors present a novel statistical method, which they refer to as scPLS ( single cell partial least squares ), for robust and accurate inference of confounding effects. ScPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. With extensive simulations and comparisons with other methods, the authors demonstrate the effectiveness of scPLS. Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.

Q2. What are the future works mentioned in the paper "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?

Exploring the use of biological factors in scPLS is an interesting avenue for future research. Therefore, it would be important to evaluate the performance of scPLS in other analysis settings in future studies. Therefore, extending their framework to modeling count data65,66 is another promising avenue for future research. One potential disadvantage of scPLS is that it does not model raw count data directly.

Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes

Summary (2 min read)

Target Genes

Review of Previous Methods

Results

Discussion

Methods

Author Contributions

Citations

References

Related Papers (5)

Frequently Asked Questions (2)

Q1. What have the authors contributed in "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?

Q2. What are the future works mentioned in the paper "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?