scispace - formally typeset
Open AccessJournal ArticleDOI

Stably expressed genes in single-cell RNA sequencing.

Reads0
Chats0
TLDR
It is found that genes whose final products are associated with the cytosolic ribosome have expressions that are highly stable with respect to the total RNA content, and these genes appear to be stable in bulk measurements as well.
Abstract
Motivation In single-cell RNA-sequencing (scRNA-seq) experiments, RNA transcripts are extracted and measured from isolated cells to understand gene expression at the cellular level. Measurements from this technology are affected by many technical artifacts, including batch effects. In analogous bulk gene expression experiments, external references, e.g. synthetic gene spike-ins often from the External RNA Controls Consortium (ERCC), may be incorporated to the experimental protocol for use in adjusting measurements for technical artifacts. In scRNA-seq experiments, the use of external spike-ins is controversial due to dissimilarities with endogenous genes and uncertainty about sufficient precision of their introduction. Instead, endogenous genes with highly stable expression could be used as references within scRNA-seq to help normalize the data. First, however, a specific notion of stable expression at the single-cell level needs to be formulated; genes could be stable in absolute expression, in proportion to cell volume, or in proportion to total gene expression. Different types of stable genes will be useful for different normalizations and will need different methods for discovery. Results We compile gene sets whose products are associated with cellular structures and record these gene sets for future reuse and analysis. We find that genes whose final products are associated with the cytosolic ribosome have expressions that are highly stable with respect to the total RNA content. Notably, these genes appear to be stable in bulk measurements as well. Supplementary information Supplementary data are available through GitHub (johanngb/sc-stable).

read more

Content maybe subject to copyright    Report

Stably expressed genes in single-cell RNA-sequencing
Julie M. Deeke
1
, Johann A. Gagnon-Bartsch
1*
,
1
Department of Statistics, University of Michigan, Ann Arbor
* Corresponding author
Abstract
Motivation: In single-cell RNA-sequencing (scRNA-seq) experiments, RNA transcripts
are extracted and measured from isolated cells to understand gene expression at the cellular
level. Measurements from this technology are affected by many technical artifacts, including
batch effects. In analogous bulk gene expression experiments, external references, e.g., synthetic
gene spike-ins often from the External RNA Controls Consortium (ERCC), may be incorpo-
rated to the experimental protocol for use in adjusting measurements for technical artifacts.
In scRNA-seq experiments, the use of external spike-ins is controversial due to dissimilarities
with endogenous genes and uncertainty about sufficient precision of their introduction. Instead,
endogenous genes with highly stable expression could be used as references within scRNA-seq
to help normalize the data. First, however, a specific notion of stable expression at the single
cell level needs to be formulated; genes could be stable in absolute expression, in proportion to
cell volume, or in proportion to total gene expression. Different types of stable genes will be
useful for different normalizations and will need different methods for discovery.
Results: We compile gene sets whose products are associated with cellular structures and
record these gene sets for future reuse and analysis. We find that genes whose final product are
associated with the cytosolic ribosome have expressions that are highly stable with respect to
the total RNA content. Notably, these genes appear to be stable in bulk measurements as well.
Supplementary information: The Supplement is available on bioRxiv, and the gene set
database is available through GitHub.
Contact: johanngb@umich.edu
1
.CC-BY 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

1 Introduction
Single-cell RNA-sequencing (scRNA-seq) experiments measure gene expression at the cellular
level, capturing details at a resolution previously not possible. However, challenges arise due to
unwanted variation that scRNA-seq experiences. Some sources of unwanted variation include
read depth, capture efficiency, amplification biases, batch effects, and cell cycle [Hicks et al.,
2018, Phipson et al., 2017, Lun and Marioni, 2017, Dabney and Meyer, 2012, Kolodziejczyk
et al., 2015]. Methods have been developed to remove some sources of unwanted variation,
often using certain sets of reference genes to aid in removing either specific or general effects.
For example, Chen and Zhou [2017] use Bayesian methods to identify control genes that are
unassociated with a factor of interest and adjust the target genes based on the control genes.
Buettner et al. [2015] uses genes that have been annotated as associated with the cell cycle to
remove cell cycle effects from the data. Both Brennecke et al. [2013] and Gr¨un et al. [2014]
propose using external spike-in references to remove some of the technical noise present in the
data. Finally, Lin et al. [2018] propose using stably expressed genes as a form of negative
controls to remove unwanted variation using a procedure called scMerge.
Commercially generated, synthetic, external spike-in references (often External RNA Con-
trols Consortium (ERCC) spike-ins) can be used in bulk gene expression experiments [Baker
et al., 2005, Jiang et al., 2011, Risso et al., 2014, Pine et al., 2016]. The ability to incorporate
external spike-in references in scRNA-seq varies based on the cell isolation protocol. Spike-ins
are not typically included in droplet-based isolation protocols but can be in plate- and well-
based isolation methods including the Fluidigm C1 system [Macosko et al., 2015, Bacher and
Kendziorski, 2016, Lun et al., 2016, Tung et al., 2017].
Limitations to the use of spike-ins are not limited to scRNA-seq contexts. A critique in both
single cell and bulk experiments is that the spike-ins possess qualities that are dissimilar to
endogenous genes [Gr¨un and vanOudenaarden, 2015, Tung et al., 2017]. Spike-ins are designed
to exhibit artificially wide ranging characteristics, like length and proportion of guanine and
cytosine bases in the nucleic acid sequence, in order to understand how these characteristics
might affect downstream results. Specific to scRNA-seq, the quantity of solution added for
each cell is much smaller, so minor pipetting errors (with the Fluidigm C1 system) or other
volume errors affect results much more than with a larger quantity of solution [Tung et al.,
2017]. The technical challenge of accurately introducing and measuring a smaller volume of
spike-ins reduces their effectiveness as negative controls in scRNA-seq [Robinson and Oshlack,
2010].
Endogenous genes that are reasonably stably expressed have been proposed for use in nor-
malization of microarray data [Eisenberg and Levanon, 2003, 2013, Gagnon-Bartsch and Speed,
2012]. However, it is unclear that the single cell expression of these same genes exhibit the
same stability as in bulk experiments. For example, a gene may be expressed with a bursting
mechanism, increasing its variability [Jiang et al., 2017, Suter et al., 2011, Fukaya et al., 2016].
Bulk expression data might identify the gene as stably expressed, but that same classification
at the single cell level would be inappropriate.
There is a need to discover single cell-specific stably expressed genes. Lin et al. [2017] propose
a method of creating an index at the single cell level for generating a set of stably expressed
genes across all cell conditions. Desired characteristics of stably expressed genes from Lin et al.
[2017] include a distribution with a small proportion of measurements with low values and a
small variance among the measurements with high values as estimated from parameters of a
Gamma-Gaussian model.
The goals of this paper are: (1) to clarify the notion of ”stable expression” at the single cell
level, and in particular to define multiple such notions, (2) to propose a method in which to
identify a set of genes that exhibit stable expression, (3) to organize sets of genes based on the
cellular component with which the final gene product is associated, and (4) to suggest the set
of cytosolic ribosomal genes as stably expressed with respect to total RNA content.
2
.CC-BY 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

2 Approach
2.1 Notions of Stability
We first consider explicitly what it means for a gene to be stable. The idea of stable expression
has previously been addressed either implicitly or without much elaboration. However, the
notion of stability at the single cell level is inherently ambiguous and requires a precise definition.
We consider multiple notions of stability.
One notion of a stably expressed gene would be that the gene is expressed at a constant
absolute level. In other words, the number of RNA molecules present within each cell should be
approximately constant, e.g., each cell has about 10 RNA molecules of that gene. We refer to
this notion of stability as ”absolute stability.” Genes that are absolutely stable could replace the
external spike-ins, as they are expected to be present at a fixed absolute amount in each cell.
Like spike-ins, these genes could be used to pick up certain technical effects, such as reaction
efficiency.
A second notion of stable expression would be that genes are expressed at a constant pro-
portion with respect to cell volume; that is, they are stable in terms of concentration. We refer
to this as ”stable in concentration.” In addition to picking up technical effects, genes that are
stable in concentration could also pick up and adjust for effects that are associated with cell
size.
Yet another notion of a stably expressed gene would be that the gene is expressed at a
constant proportion with respect to the total RNA content of the cell from all genes. We
refer to this as ”stable in proportion to total RNA content,” or, when it is clear from context,
simply as ”proportionally stable.” In practice, we expect that sets of genes that are stable in
concentration will be similar to sets of genes that are proportionally stable, provided that total
RNA scales with cell size; in that case, both are likely to pick up cell size effects.
The notions of stability described above are biological in nature; they make no reference to
measurements of gene expression, such as those provided by scRNA-seq data. In a hypothetical,
extremely high quality dataset, these biological notions of stability would map clearly to features
of the data. An absolutely stable gene would have a small standard deviation in terms of raw
counts; a proportionally stable gene would have a small standard deviation after dividing the
raw counts by total cell count.
Real data, however, is subject to technical factors that strongly affect the observed counts.
In particular, some factors, like reaction efficiency, have a strong global effect on all counts for
a given cell, effectively introducing a random scaling factor for each cell. Thus in real data,
genes that are absolutely stable will not necessarily appear particularly stable in terms of their
raw counts.
Normalizing the raw counts by the total cell count can adjust for the random scaling effect,
and such normalizations are common (e.g., rpkm is a variant of this). After normalizing by total
cell count, however, the notion of stability that is most relevant is proportional stability. That
is, genes that appear stable in the normalized data would be those genes that are proportionally
stable, and genes that are in truth absolutely stable would not necessarily appear stable in the
data. Indeed, for this reason that normalization by total cell count is effectively necessary to
adjust for global technical effects, but that normalization by cell total also obscures absolutely
stable expression absolutely stable genes are especially difficult to identify. Note also that
similar comments apply to efforts to discover stably expressed genes in bulk tissue.
Importantly, note that a stably expressed gene can be viewed as the opposite of a differ-
entially expressed gene or a highly variable gene. That is, the notion of stability, whether
implicit or explicit, also implies the notion of instability or variability. In practical terms, the
normalization that is applied to the data may not simply ”clean” the data, but also alter the
biological interpretation of the data, and determine which biological questions can (and cannot)
be answered by the data. For example, normalizing the data against a set of absolutely stable
genes would allow one to identify ”absolutely differentially expressed genes,” while normalizing
the data against a set of proportionally stable genes would allow one to identify ”proportionally
differentially expressed genes.” These two sets of differentially expressed genes could be quite
different. Thus, finding sets of genes that exhibit different notions of stability would allow for
different types of normalization, which provide different biological insights.
3
.CC-BY 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

2.2 Localization of Stable Genes
The final product of a gene (protein, ribosomal RNA, etc.) is often localized to specific struc-
ture(s) within the cell, e.g., nucleus, cell membrane, etc. We may therefore associate a gene
with the location(s) where that gene’s final product(s) are active.
We hypothesize that certain structures may be enriched with genes that are absolutely
stable, while other structures may be enriched with genes that are stable in concentration. For
example, because each cell has one nucleus, there may be a set of nuclear genes that are constant
in absolute expression. In contrast, there may be a set of genes enriched in the cytosol that are
reasonably constant in concentration.
We perform our analysis under the hypothesis that structures are important for identifying
stably expressed genes. We therefore create gene sets for each cell structure and assess the sets
as a whole.
3 Methods
3.1 Mapping Genes to Cell Structures
The Gene Ontology Consortium maintains a database that specifies the cellular component(s)
with which each gene’s final product is associated [Xin et al., 2016, Wu et al., 2013]. Many
of the annotations provided by the Gene Ontology Consortium are highly detailed (e.g., ”mi-
tochondrial respiratory chain complex I”); we coarsen these annotations into ten categories
corresponding to major cellular structures; see the Supplement for details. The ten categories
are: nucleus, endoplasmic reticulum, Golgi body, cytoplasm, membrane, ribosome, mitochon-
dria, mitochondrial ribosome, ribonucleoprotein complex, and cytosolic ribosome. We allow
a single gene to be associated with more than one cellular structure, and some genes are not
associated with any category. We will refer to these sets of genes as ”nuclear genes”, ”cytoplasm
genes,” etc.
3.2 Expression Data
We downloaded six datasets containing human cells processed by three different lab groups and
from four different tissue types from Gene Expression Omnibus; see the Supplement for more
details [Edgar et al., 2002, Arguel et al., 2017, Das et al., 2017, Tung et al., 2017]. We selected
these datasets because they were conducted with the Fluidigm C1 platform and included ERCC
spike-ins. Note also that all but one of these datasets make use of unique molecular identifiers
(UMIs) [Kivioja et al., 2012]. We filter to the genes that are expressed at least once in all
datasets to ensure that genes are expressed in a wide variety of tissue types and to compare
the datasets more directly.
3.3 Absolute Stability
Genes that are absolutely stable would ideally appear stable in the data in terms of raw counts;
however, as noted in Section 2, due to technical factors that result in strong variation in library
size from well to well, simply looking for genes that are stable in terms of raw counts is not
a feasible way to discover absolutely stable genes. Instead, we leverage the stable absolute
”expression” of the ERCC spike-ins, and find absolutely stable genes by looking for those genes
that have a high correlation with the ERCCs.
More specifically, for each of the six datasets we perform the following analysis. We begin
with raw counts of UMIs. We transform the raw counts by log + 1. In addition, for each cell
we also compute the sum of the raw UMIs of all ERCCs; we refer to this as the ”ERCC total.”
Finally, for each gene, we compute Pearson’s correlation (across all cells) between that gene’s
(log + 1) expression and the log of the ERCC total.
We then summarize the correlations by finding, for each gene, the mean of the correlations
across the six datasets. Thus, for each gene, we now have an average correlation of that gene’s
expression with the ERCC total, and we regard this as a measure of that gene’s absolute
stability. To see if any cellular structures are enriched for stably expressed genes, we plot
4
.CC-BY 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

histograms of the correlations subsetted by structure, and inspect these histograms to see if
any structures have especially high correlations.
3.4 Proportional Stability
We also attempt to find genes that are proportionally stable. Our method is similar to the one
for absolute stability but with an adjusted cell total replacing the ERCC total. For a given
structure S, we sum a cell’s overall measured expression after removing the set of all genes
associated with structure S; we refer to this as an ”adjusted cell total”. For each gene, we
compute Pearson’s correlation (across all cells) between that gene’s (log + 1) expression and
the log of the adjusted cell total.
Again, we summarize correlations by finding, for each gene, the mean of the correlations
across the six datasets. For each structure, we plot a histogram of that structure’s correlations
with that structure’s adjusted cell total.
4 Results
4.1 Absolute Stability
Figure 1 shows a histogram of the correlations with the ERCC total; histograms of correlations
separated by gene sets can be found in Supplementary Figures 2 and 3. The most notable
observation is that the all correlations were smaller in absolute value than 0.3. The weak
correlations indicate that no gene captures the same artifacts as the spike-ins, and vice versa.
One possible explanation for this is that the spike-ins are not exposed to some technical effect(s)
that endogenous genes are affected by. As cells need to be lysed and RNA needs to be extracted
from the cells, we believe that technical factors affect the measurements for endogenous genes
while the spike-ins do not experience the same variability.
The spike-in measurements exhibit lower variability than the biological measurements within
our datasets. We compared the biological cell total to the ERCC cell total on a log scale for
each of the datasets (Figure 2). We see that the variability for the ERCC measurements is
considerably smaller than the variability for the biological total of a cell. Figures examining the
mean and standard deviation of each of the genes and ERCC spike-ins separates demonstrate
that, again, the ERCC spike-ins have much lower variability when compared to biological genes
with similar overall expression (Supplementary Figure 1).
Overall, the spike-ins appear to have measurements that are more absolutely stable than
the endogenous genes. Since each dataset captures one type of cell and Figure 2 displays cell
totals, technical effects likely contribute most of the variation. The technical factors affecting
the biological cell totals appear larger than technical factors affecting the spike-ins, indicating
that they do not appear to be capturing the same technical effects that are affecting the bio-
logical measurements. Using spike-ins to identify genes that are absolutely stably expressed is
inappropriate.
4.2 Proportional Stability
We examine the measures of proportional stability in Figure 3 and Supplementary Figure 4.
Unlike the correlations with the ERCC measurements, we see large correlations, with values
ranging from -0.03 to 0.95. The cytosolic ribosomal genes exhibit the highest measures of
proportional stability based on their correlations.
Table 1 displays the distribution of cell structures among the genes in our dataset. We also
show the distribution of different subsets of genes, including the genes of Eisenberg and Levanon
[2003] and of Lin et al. [2017]. Of the genes with the highest average correlations, 44 of the
top 45 are cytosolic ribosomal genes. 61 of 101 cytosolic ribosomal genes are in the top 100
genes by correlation; for comparison, 69 of 4,402 nuclear genes, 51 of 470 housekeeping genes
from Eisenberg and Levanon [2003], and 37 of 967 single cell housekeeping genes from Lin et al.
[2017] are in the 100 genes with the highest correlations.
Ribosomal genes, from which the cytosolic ribosomal genes are a subset, have been previously
identified as a plausible option for reference genes for microarray experiments [Thorrez et al.,
5
.CC-BY 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
The copyright holder for this preprint (which wasthis version posted November 21, 2018. ; https://doi.org/10.1101/475426doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Single-cell transcriptomes of developing and adult olfactory receptor neurons in Drosophila .

TL;DR: The authors used single-cell transcriptomes of Drosophila olfactory (ORNs), thermo-, and hygro-sensory neurons at an early developmental and adult stage using singlecell and single-nucleus RNA sequencing.
Journal Article

Enhancer control of transcriptional bursting

TL;DR: In this article, the authors examined transcriptional bursting in living Drosophila embryos and found that different developmental enhancers positioned downstream of synthetic reporter genes produce transcriptional bursts with similar amplitudes and duration but generate very different bursting frequencies.
Posted ContentDOI

Single-cell transcriptomes of developing and adult olfactory receptor neurons in Drosophila

TL;DR: This work generated single-cell transcriptomes of Drosophila olfactory receptor neurons, thermosensory and hygrosensory neurons from the third antennal segment at an early developmental and adult stage to uncover type-specific and broadly expressed genes that could modulate adult sensory responses.
Journal ArticleDOI

Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications

TL;DR: In this article , the authors present a workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNAseq data sets, and advanced data analysis that should be tailored to specific scientific questions.
References
More filters
Journal ArticleDOI

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
Journal ArticleDOI

A scaling normalization method for differential expression analysis of RNA-seq data

TL;DR: A simple and effective method for performing normalization is outlined and dramatically improved results for inferring differential expression in simulated and publicly available data sets are shown.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Stably expressed genes in single-cell rna-sequencing" ?

Edu 1. CC-BY 4. 0 International license under a not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

Some sources of unwanted variation include read depth, capture efficiency, amplification biases, batch effects, and cell cycle [Hicks et al., 2018, Phipson et al., 2017, Lun and Marioni, 2017, Dabney and Meyer, 2012, Kolodziejczyk et al., 2015]. 

Desired characteristics of stably expressed genes from Lin et al. [2017] include a distribution with a small proportion of measurements with low values and a small variance among the measurements with high values as estimated from parameters of a Gamma-Gaussian model. 

As cells need to be lysed and RNA needs to be extracted from the cells, the authors believe that technical factors affect the measurements for endogenous genes while the spike-ins do not experience the same variability. 

Genes that are absolutely stable would ideally appear stable in the data in terms of raw counts; however, as noted in Section 2, due to technical factors that result in strong variation in library size from well to well, simply looking for genes that are stable in terms of raw counts is not a feasible way to discover absolutely stable genes. 

Both Brennecke et al. [2013] and Grün et al. [2014] propose using external spike-in references to remove some of the technical noise present in the data. 

Based on their analysis, the cytosolic ribosomal genes appear to be stably expressed proportional to the total RNA content of a cell. 

The goals of this paper are: (1) to clarify the notion of ”stable expression” at the single cell level, and in particular to define multiple such notions, (2) to propose a method in which to identify a set of genes that exhibit stable expression, (3) to organize sets of genes based on the cellular component with which the final gene product is associated, and (4) to suggest the set of cytosolic ribosomal genes as stably expressed with respect to total RNA content. 

for this reason – that normalization by total cell count is effectively necessary to adjust for global technical effects, but that normalization by cell total also obscures absolutely stable expression – absolutely stable genes are especially difficult to identify. 

Expression patterns of cytosolic ribosomal genes observed in bulk GTEx experiments further support the conclusion that cytosolic ribosomal genes are proportionally stably expressed. 

The technical challenge of accurately introducing and measuring a smaller volume of spike-ins reduces their effectiveness as negative controls in scRNA-seq [Robinson and Oshlack, 2010]. 

The expressions of the cytosolic ribosomal genes indicate that they could be effective as reference genes, as most cells express cytosolic ribosomal genes and at a fairly high level (Supplementary Figure 7).