What are the sources of unwanted variation in scRNA-seq?

Some sources of unwanted variation include read depth, capture efficiency, amplification biases, batch effects, and cell cycle [Hicks et al., 2018, Phipson et al., 2017, Lun and Marioni, 2017, Dabney and Meyer, 2012, Kolodziejczyk et al., 2015].

What is the reason why the spike-ins do not have the same variability?

As cells need to be lysed and RNA needs to be extracted from the cells, the authors believe that technical factors affect the measurements for endogenous genes while the spike-ins do not experience the same variability.

What is the likely explanation for the stably expressed cytosolic ribo?

Based on their analysis, the cytosolic ribosomal genes appear to be stably expressed proportional to the total RNA content of a cell.

What is the reason that the normalization of cells by total cell count is necessary?

for this reason – that normalization by total cell count is effectively necessary to adjust for global technical effects, but that normalization by cell total also obscures absolutely stable expression – absolutely stable genes are especially difficult to identify.

What is the likely explanation for the stably expressed genes?

Expression patterns of cytosolic ribosomal genes observed in bulk GTEx experiments further support the conclusion that cytosolic ribosomal genes are proportionally stably expressed.

What is the problem with introducing and measuring spike-ins?

The technical challenge of accurately introducing and measuring a smaller volume of spike-ins reduces their effectiveness as negative controls in scRNA-seq [Robinson and Oshlack, 2010].

What is the correlation between the cytosolic ribosomal genes and the beta?

The expressions of the cytosolic ribosomal genes indicate that they could be effective as reference genes, as most cells express cytosolic ribosomal genes and at a fairly high level (Supplementary Figure 7).

(Open Access) Stably expressed genes in single-cell RNA sequencing. (2020) | Julie M. Deeke

Q: What are the contributions in "Stably expressed genes in single-cell rna-sequencing" ?

Edu 1. CC-BY 4. 0 International license under a not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Q: What are the main reasons for using external spike-in references?

Both Brennecke et al. [2013] and Grün et al. [2014] propose using external spike-in references to remove some of the technical noise present in the data.

Stably expressed genes in single-cell RNA-sequencing

Julie M. Deeke

, Johann A. Gagnon-Bartsch

Department of Statistics, University of Michigan, Ann Arbor

* Corresponding author

Abstract

Motivation: In single-cell RNA-sequencing (scRNA-seq) experiments, RNA transcripts

are extracted and measured from isolated cells to understand gene expression at the cellular

level. Measurements from this technology are aﬀected by many technical artifacts, including

batch eﬀects. In analogous bulk gene expression experiments, external references, e.g., synthetic

gene spike-ins often from the External RNA Controls Consortium (ERCC), may be incorpo-

rated to the experimental protocol for use in adjusting measurements for technical artifacts.

In scRNA-seq experiments, the use of external spike-ins is controversial due to dissimilarities

with endogenous genes and uncertainty about suﬃcient precision of their introduction. Instead,

endogenous genes with highly stable expression could be used as references within scRNA-seq

to help normalize the data. First, however, a speciﬁc notion of stable expression at the single

cell level needs to be formulated; genes could be stable in absolute expression, in proportion to

cell volume, or in proportion to total gene expression. Diﬀerent types of stable genes will be

useful for diﬀerent normalizations and will need diﬀerent methods for discovery.

Results: We compile gene sets whose products are associated with cellular structures and

record these gene sets for future reuse and analysis. We ﬁnd that genes whose ﬁnal product are

associated with the cytosolic ribosome have expressions that are highly stable with respect to

the total RNA content. Notably, these genes appear to be stable in bulk measurements as well.

Supplementary information: The Supplement is available on bioRxiv, and the gene set

database is available through GitHub.

Contact: johanngb@umich.edu

.CC-BY 4.0 International licenseunder a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

1 Introduction

Single-cell RNA-sequencing (scRNA-seq) experiments measure gene expression at the cellular

level, capturing details at a resolution previously not possible. However, challenges arise due to

unwanted variation that scRNA-seq experiences. Some sources of unwanted variation include

read depth, capture eﬃciency, ampliﬁcation biases, batch eﬀects, and cell cycle [Hicks et al.,

2018, Phipson et al., 2017, Lun and Marioni, 2017, Dabney and Meyer, 2012, Kolodziejczyk

et al., 2015]. Methods have been developed to remove some sources of unwanted variation,

often using certain sets of reference genes to aid in removing either speciﬁc or general eﬀects.

For example, Chen and Zhou [2017] use Bayesian methods to identify control genes that are

unassociated with a factor of interest and adjust the target genes based on the control genes.

Buettner et al. [2015] uses genes that have been annotated as associated with the cell cycle to

remove cell cycle eﬀects from the data. Both Brennecke et al. [2013] and Gr¨un et al. [2014]

propose using external spike-in references to remove some of the technical noise present in the

data. Finally, Lin et al. [2018] propose using stably expressed genes as a form of negative

controls to remove unwanted variation using a procedure called scMerge.

Commercially generated, synthetic, external spike-in references (often External RNA Con-

trols Consortium (ERCC) spike-ins) can be used in bulk gene expression experiments [Baker

et al., 2005, Jiang et al., 2011, Risso et al., 2014, Pine et al., 2016]. The ability to incorporate

external spike-in references in scRNA-seq varies based on the cell isolation protocol. Spike-ins

are not typically included in droplet-based isolation protocols but can be in plate- and well-

based isolation methods including the Fluidigm C1 system [Macosko et al., 2015, Bacher and

Kendziorski, 2016, Lun et al., 2016, Tung et al., 2017].

Limitations to the use of spike-ins are not limited to scRNA-seq contexts. A critique in both

single cell and bulk experiments is that the spike-ins possess qualities that are dissimilar to

endogenous genes [Gr¨un and vanOudenaarden, 2015, Tung et al., 2017]. Spike-ins are designed

to exhibit artiﬁcially wide ranging characteristics, like length and proportion of guanine and

cytosine bases in the nucleic acid sequence, in order to understand how these characteristics

might aﬀect downstream results. Speciﬁc to scRNA-seq, the quantity of solution added for

each cell is much smaller, so minor pipetting errors (with the Fluidigm C1 system) or other

volume errors aﬀect results much more than with a larger quantity of solution [Tung et al.,

2017]. The technical challenge of accurately introducing and measuring a smaller volume of

spike-ins reduces their eﬀectiveness as negative controls in scRNA-seq [Robinson and Oshlack,

2010].

Endogenous genes that are reasonably stably expressed have been proposed for use in nor-

malization of microarray data [Eisenberg and Levanon, 2003, 2013, Gagnon-Bartsch and Speed,

2012]. However, it is unclear that the single cell expression of these same genes exhibit the

same stability as in bulk experiments. For example, a gene may be expressed with a bursting

mechanism, increasing its variability [Jiang et al., 2017, Suter et al., 2011, Fukaya et al., 2016].

Bulk expression data might identify the gene as stably expressed, but that same classiﬁcation

at the single cell level would be inappropriate.

There is a need to discover single cell-speciﬁc stably expressed genes. Lin et al. [2017] propose

a method of creating an index at the single cell level for generating a set of stably expressed

genes across all cell conditions. Desired characteristics of stably expressed genes from Lin et al.

[2017] include a distribution with a small proportion of measurements with low values and a

small variance among the measurements with high values as estimated from parameters of a

Gamma-Gaussian model.

The goals of this paper are: (1) to clarify the notion of ”stable expression” at the single cell

level, and in particular to deﬁne multiple such notions, (2) to propose a method in which to

identify a set of genes that exhibit stable expression, (3) to organize sets of genes based on the

cellular component with which the ﬁnal gene product is associated, and (4) to suggest the set

of cytosolic ribosomal genes as stably expressed with respect to total RNA content.

.CC-BY 4.0 International licenseunder a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

2 Approach

2.1 Notions of Stability

We ﬁrst consider explicitly what it means for a gene to be stable. The idea of stable expression

has previously been addressed either implicitly or without much elaboration. However, the

notion of stability at the single cell level is inherently ambiguous and requires a precise deﬁnition.

We consider multiple notions of stability.

One notion of a stably expressed gene would be that the gene is expressed at a constant

absolute level. In other words, the number of RNA molecules present within each cell should be

approximately constant, e.g., each cell has about 10 RNA molecules of that gene. We refer to

this notion of stability as ”absolute stability.” Genes that are absolutely stable could replace the

external spike-ins, as they are expected to be present at a ﬁxed absolute amount in each cell.

Like spike-ins, these genes could be used to pick up certain technical eﬀects, such as reaction

eﬃciency.

A second notion of stable expression would be that genes are expressed at a constant pro-

portion with respect to cell volume; that is, they are stable in terms of concentration. We refer

to this as ”stable in concentration.” In addition to picking up technical eﬀects, genes that are

stable in concentration could also pick up and adjust for eﬀects that are associated with cell

size.

Yet another notion of a stably expressed gene would be that the gene is expressed at a

constant proportion with respect to the total RNA content of the cell from all genes. We

refer to this as ”stable in proportion to total RNA content,” or, when it is clear from context,

simply as ”proportionally stable.” In practice, we expect that sets of genes that are stable in

concentration will be similar to sets of genes that are proportionally stable, provided that total

RNA scales with cell size; in that case, both are likely to pick up cell size eﬀects.

The notions of stability described above are biological in nature; they make no reference to

measurements of gene expression, such as those provided by scRNA-seq data. In a hypothetical,

extremely high quality dataset, these biological notions of stability would map clearly to features

of the data. An absolutely stable gene would have a small standard deviation in terms of raw

counts; a proportionally stable gene would have a small standard deviation after dividing the

raw counts by total cell count.

Real data, however, is subject to technical factors that strongly aﬀect the observed counts.

In particular, some factors, like reaction eﬃciency, have a strong global eﬀect on all counts for

a given cell, eﬀectively introducing a random scaling factor for each cell. Thus in real data,

genes that are absolutely stable will not necessarily appear particularly stable in terms of their

raw counts.

Normalizing the raw counts by the total cell count can adjust for the random scaling eﬀect,

and such normalizations are common (e.g., rpkm is a variant of this). After normalizing by total

cell count, however, the notion of stability that is most relevant is proportional stability. That

is, genes that appear stable in the normalized data would be those genes that are proportionally

stable, and genes that are in truth absolutely stable would not necessarily appear stable in the

data. Indeed, for this reason – that normalization by total cell count is eﬀectively necessary to

adjust for global technical eﬀects, but that normalization by cell total also obscures absolutely

stable expression – absolutely stable genes are especially diﬃcult to identify. Note also that

similar comments apply to eﬀorts to discover stably expressed genes in bulk tissue.

Importantly, note that a stably expressed gene can be viewed as the opposite of a diﬀer-

entially expressed gene or a highly variable gene. That is, the notion of stability, whether

implicit or explicit, also implies the notion of instability or variability. In practical terms, the

normalization that is applied to the data may not simply ”clean” the data, but also alter the

biological interpretation of the data, and determine which biological questions can (and cannot)

be answered by the data. For example, normalizing the data against a set of absolutely stable

genes would allow one to identify ”absolutely diﬀerentially expressed genes,” while normalizing

the data against a set of proportionally stable genes would allow one to identify ”proportionally

diﬀerentially expressed genes.” These two sets of diﬀerentially expressed genes could be quite

diﬀerent. Thus, ﬁnding sets of genes that exhibit diﬀerent notions of stability would allow for

diﬀerent types of normalization, which provide diﬀerent biological insights.

.CC-BY 4.0 International licenseunder a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

2.2 Localization of Stable Genes

The ﬁnal product of a gene (protein, ribosomal RNA, etc.) is often localized to speciﬁc struc-

ture(s) within the cell, e.g., nucleus, cell membrane, etc. We may therefore associate a gene

with the location(s) where that gene’s ﬁnal product(s) are active.

We hypothesize that certain structures may be enriched with genes that are absolutely

stable, while other structures may be enriched with genes that are stable in concentration. For

example, because each cell has one nucleus, there may be a set of nuclear genes that are constant

in absolute expression. In contrast, there may be a set of genes enriched in the cytosol that are

reasonably constant in concentration.

We perform our analysis under the hypothesis that structures are important for identifying

stably expressed genes. We therefore create gene sets for each cell structure and assess the sets

as a whole.

3 Methods

3.1 Mapping Genes to Cell Structures

The Gene Ontology Consortium maintains a database that speciﬁes the cellular component(s)

with which each gene’s ﬁnal product is associated [Xin et al., 2016, Wu et al., 2013]. Many

of the annotations provided by the Gene Ontology Consortium are highly detailed (e.g., ”mi-

tochondrial respiratory chain complex I”); we coarsen these annotations into ten categories

corresponding to major cellular structures; see the Supplement for details. The ten categories

are: nucleus, endoplasmic reticulum, Golgi body, cytoplasm, membrane, ribosome, mitochon-

dria, mitochondrial ribosome, ribonucleoprotein complex, and cytosolic ribosome. We allow

a single gene to be associated with more than one cellular structure, and some genes are not

associated with any category. We will refer to these sets of genes as ”nuclear genes”, ”cytoplasm

genes,” etc.

3.2 Expression Data

We downloaded six datasets containing human cells processed by three diﬀerent lab groups and

from four diﬀerent tissue types from Gene Expression Omnibus; see the Supplement for more

details [Edgar et al., 2002, Arguel et al., 2017, Das et al., 2017, Tung et al., 2017]. We selected

these datasets because they were conducted with the Fluidigm C1 platform and included ERCC

spike-ins. Note also that all but one of these datasets make use of unique molecular identiﬁers

(UMIs) [Kivioja et al., 2012]. We ﬁlter to the genes that are expressed at least once in all

datasets to ensure that genes are expressed in a wide variety of tissue types and to compare

the datasets more directly.

3.3 Absolute Stability

Genes that are absolutely stable would ideally appear stable in the data in terms of raw counts;

however, as noted in Section 2, due to technical factors that result in strong variation in library

size from well to well, simply looking for genes that are stable in terms of raw counts is not

a feasible way to discover absolutely stable genes. Instead, we leverage the stable absolute

”expression” of the ERCC spike-ins, and ﬁnd absolutely stable genes by looking for those genes

that have a high correlation with the ERCCs.

More speciﬁcally, for each of the six datasets we perform the following analysis. We begin

with raw counts of UMIs. We transform the raw counts by log + 1. In addition, for each cell

we also compute the sum of the raw UMIs of all ERCCs; we refer to this as the ”ERCC total.”

Finally, for each gene, we compute Pearson’s correlation (across all cells) between that gene’s

(log + 1) expression and the log of the ERCC total.

We then summarize the correlations by ﬁnding, for each gene, the mean of the correlations

across the six datasets. Thus, for each gene, we now have an average correlation of that gene’s

expression with the ERCC total, and we regard this as a measure of that gene’s absolute

stability. To see if any cellular structures are enriched for stably expressed genes, we plot

.CC-BY 4.0 International licenseunder a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

histograms of the correlations subsetted by structure, and inspect these histograms to see if

any structures have especially high correlations.

3.4 Proportional Stability

We also attempt to ﬁnd genes that are proportionally stable. Our method is similar to the one

for absolute stability but with an adjusted cell total replacing the ERCC total. For a given

structure S, we sum a cell’s overall measured expression after removing the set of all genes

associated with structure S; we refer to this as an ”adjusted cell total”. For each gene, we

compute Pearson’s correlation (across all cells) between that gene’s (log + 1) expression and

the log of the adjusted cell total.

Again, we summarize correlations by ﬁnding, for each gene, the mean of the correlations

across the six datasets. For each structure, we plot a histogram of that structure’s correlations

with that structure’s adjusted cell total.

4 Results

4.1 Absolute Stability

Figure 1 shows a histogram of the correlations with the ERCC total; histograms of correlations

separated by gene sets can be found in Supplementary Figures 2 and 3. The most notable

observation is that the all correlations were smaller in absolute value than 0.3. The weak

correlations indicate that no gene captures the same artifacts as the spike-ins, and vice versa.

One possible explanation for this is that the spike-ins are not exposed to some technical eﬀect(s)

that endogenous genes are aﬀected by. As cells need to be lysed and RNA needs to be extracted

from the cells, we believe that technical factors aﬀect the measurements for endogenous genes

while the spike-ins do not experience the same variability.

The spike-in measurements exhibit lower variability than the biological measurements within

our datasets. We compared the biological cell total to the ERCC cell total on a log scale for

each of the datasets (Figure 2). We see that the variability for the ERCC measurements is

considerably smaller than the variability for the biological total of a cell. Figures examining the

mean and standard deviation of each of the genes and ERCC spike-ins separates demonstrate

that, again, the ERCC spike-ins have much lower variability when compared to biological genes

with similar overall expression (Supplementary Figure 1).

Overall, the spike-ins appear to have measurements that are more absolutely stable than

the endogenous genes. Since each dataset captures one type of cell and Figure 2 displays cell

totals, technical eﬀects likely contribute most of the variation. The technical factors aﬀecting

the biological cell totals appear larger than technical factors aﬀecting the spike-ins, indicating

that they do not appear to be capturing the same technical eﬀects that are aﬀecting the bio-

logical measurements. Using spike-ins to identify genes that are absolutely stably expressed is

inappropriate.

4.2 Proportional Stability

We examine the measures of proportional stability in Figure 3 and Supplementary Figure 4.

Unlike the correlations with the ERCC measurements, we see large correlations, with values

ranging from -0.03 to 0.95. The cytosolic ribosomal genes exhibit the highest measures of

proportional stability based on their correlations.

Table 1 displays the distribution of cell structures among the genes in our dataset. We also

show the distribution of diﬀerent subsets of genes, including the genes of Eisenberg and Levanon

[2003] and of Lin et al. [2017]. Of the genes with the highest average correlations, 44 of the

top 45 are cytosolic ribosomal genes. 61 of 101 cytosolic ribosomal genes are in the top 100

genes by correlation; for comparison, 69 of 4,402 nuclear genes, 51 of 470 housekeeping genes

from Eisenberg and Levanon [2003], and 37 of 967 single cell housekeeping genes from Lin et al.

[2017] are in the 100 genes with the highest correlations.

Ribosomal genes, from which the cytosolic ribosomal genes are a subset, have been previously

identiﬁed as a plausible option for reference genes for microarray experiments [Thorrez et al.,

.CC-BY 4.0 International licenseunder a

not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

Stably expressed genes in single-cell RNA sequencing.

Figures

Citations

Single-cell transcriptomes of developing and adult olfactory receptor neurons in Drosophila .

Enhancer control of transcriptional bursting

Single-cell transcriptomes of developing and adult olfactory receptor neurons in Drosophila

Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications

Introduction to the Special Issue on GIW/ABACBS 2019.

References

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

A scaling normalization method for differential expression analysis of RNA-seq data

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

Massively parallel digital transcriptional profiling of single cells

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

Related Papers (5)

Gene Expression Recovery For Single Cell RNA Sequencing

SAVER: gene expression recovery for single-cell RNA sequencing.

SAVER: Gene expression recovery for UMI-based single cell RNA sequencing

From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing

RNA-sequence analysis of human B-cells

Frequently Asked Questions (12)

Q1. What are the contributions in "Stably expressed genes in single-cell rna-sequencing" ?

Q2. What are the sources of unwanted variation in scRNA-seq?

Q3. What are the characteristics of stably expressed genes?

Q4. What is the reason why the spike-ins do not have the same variability?

Q5. What is the way to find stable genes?

Q6. What are the main reasons for using external spike-in references?

Q7. What is the likely explanation for the stably expressed cytosolic ribo?

Q8. What are the goals of this paper?

Q9. What is the reason that the normalization of cells by total cell count is necessary?

Q10. What is the likely explanation for the stably expressed genes?

Q11. What is the problem with introducing and measuring spike-ins?

Q12. What is the correlation between the cytosolic ribosomal genes and the beta?