scispace - formally typeset
Open AccessPosted ContentDOI

xCell: Digitally portraying the tissue cellular heterogeneity landscape

TLDR
XCell as mentioned in this paper is a gene-signature based method for inferring 64 immune and stroma cell types from 1,822 transcriptomic profiles of pure human cells from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating closely related cell types.
Abstract
Tissues are a complex milieu consisting of numerous cell types. For example, understanding the cellular heterogeneity the tumor microenvironment is an emerging field of research. Numerous methods have been published in recent years for the enumeration of cell subsets from tissue expression profiles. However, the available methods suffer from three major problems: inferring cell subset based on gene sets learned and verified from limited sources; displaying only partial portrayal of the full cellular heterogeneity; and insufficient validation in mixed tissues. To address these issues we developed xCell, a novel gene-signature based method for inferring 64 immune and stroma cell types. We first curated and harmonized 1,822 transcriptomic profiles of pure human cell types from various sources, employed a curve fitting approach for linear comparison of cell types, and introduced a novel spillover compensation technique for separating between closely related cell types. We test the ability of our model learned from pure cell types to infer enrichments of cell types in mixed tissues, using both comprehensive in silico analyses, and by comparison to cytometry immunophenotyping to show that our scores outperform previously published methods. Finally, we explore the cell type enrichments in tumor samples and show that the cellular heterogeneity of the tumor microenvironment uniquely characterizes different cancer types. We provide our method for inferring cell type abundances as a public resource to allow researchers to portray the cellular heterogeneity landscape of tissue expression profiles: http://xCell.ucsf.edu/.

read more

Content maybe subject to copyright    Report

M ETHOD Open Access
xCell: digitally portraying the tissue cellular
heterogeneity landscape
Dvir Aran
*
, Zicheng Hu and Atul J. Butte
*
Abstract
Tissues are complex milieus consisting of numerous cell types. Several recent methods have attempted to enumerate
cell subsets from transcriptomes. However, the available methods have used limited sources for training and give only
a partial portrayal of the full cellular landscape. Here we present xCell, a novel gene signature-based method, and use it
to infer 64 immune and stromal cell types. We harmonized 1822 pure human cell type transcriptomes from various
sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover
compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry
immunophenotyping, we show that xCell outperforms other methods. xCell is available at http://xCell.ucsf.edu/.
Background
In addition to malignant proliferating cells, tumors are
also composed of numerous distinct non-cancerous cell
types and activation states of those cell types. Together
these are termed the tumor microenvironment, which
has been in the research spotlight in recent years and is
being further explored by novel techniques. The most
studied set of non-cancerous cell types are the tumor-
infiltrating lymphocytes (TILs). However, TILs are only
part of a variety of innate and adaptive immune cells,
stromal cells, and many other cell types that are found
in the tumor and interact with the malignant cells. This
complex and dynamic microenvironment is now recog-
nized to be important both in promoting and inhibiting
tumor growth, invasion, and metastasis [1, 2]. Under-
standing the cellular heterogeneity composing the tumor
microenvironment is key for improving existing treat-
ments, the discovery of predictive biomarkers, and
development of novel therapeutic strategies.
Traditional approaches for dissecting the cellular het-
erogeneity in liquid tissues are difficult to apply in solid
tumors [3]. Therefore, in the past decade, several
methods have been published for digitally dissecting the
tumor microenvironment using gene expression profiles
[47] (reviewed in [8]). Recently, a multitude of studies
have been published applying published and novel
techniques on publicly available tumor sample resources,
such as The Cancer Genome Atlas (TCGA) [6, 913].
Two general types of techniques are used: deconvolving
the complete cellular composition and assessing enrich-
ments of individual cell types.
At least seven major issues raise concerns that the in
silico methods could be prone to errors and cannot
reliably portray the cellular heterogeneity of the tumor
microenvironment. First, current techniques depend on
the expression profiles of purified cell types to identify
reference genes and therefore rely heavily on the data
source from which the references are inferred and could
this be inclined to overfit these data. Second, current
methods focus on only a very narrow range of the tumor
microenvironment, usually a subset of immune cell
types , and thus do not accou nt for the further richness
of cell types in the microenvironment, including blood
vessels and othe r different forms of cell subsets [14, 15].
A third problem is the ability of cancer cells to imitate
other cell types by expressing immune-specific genes,
such as a macrophage-like expression pattern in tumors
with parainflammation [16]; only a few of the methods
take this into account. Fourth, the ability of existing
methods to estimate cell abundance has not yet been
comprehensively validated in mixed samples. Cytometry
is a common method for countin g cell types in a
mixture and, when performed in combination with gene
expression profiling, can allow validation of the estima-
tions. However, in most studies that included cytometry
validation, these analyses were performed on only a very
* Correspondence: dvir.aran@ucsf.edu; atul.butte@ucsf.edu
Institute for Computational Health Sciences, University of California, San
Francisco, California 94158, USA
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Aran et al. Genome Biology (2017) 18:220
DOI 10.1186/s13059-017-1349-1

limited number of cell types and a limited number of
samples [7, 13].
A fifth challenge is that deconvolution approaches are
prone to many different biases because of the strict de-
pendencies among all cell types that are inferred. This
could highly affect reliability when analyzing tumor
samples, which are prone to form non-conventional ex-
pression profiles. A sixth problem comes with inferring
an increasing number of closely related cell types [10].
Finally, deconvolution analysis heavily relies on the
structure of the reference matrix, which limits its appli-
cation to the resource used to develop the matrix. One
such deconvolution approach is CIBESORT, the most
comprehensive study to date, which allows the enumer-
ation of 22 immu ne subsets [7]. Newman et al. [7] per-
formed adequate evaluation across data sources and
validated the estimations using cytometry immunophe-
notyping. However, the shortcomings of deconvolution
approaches are apparent in CIBERSORT, which is
limited to Affymetrix microarray studies.
On the othe r hand, gene set enrichment analysis
(GSEA) is a simple technique which can be easily ap-
plied across data types and can be quickly applied for
cancer studies. In GSEA each gene signatu re is used
independently of all other signatures and it is thus pro-
tected from the limitations of deconvolution approaches.
However, because of this independence, it is many times
hard to differentiate between closely related cell types.
In addition, gene signature-based methods only provide
enrichment scores and thus do not allow comparison
across cell types and cannot provide insights into the
abundance of cell types in the mixture.
Here, we present xCell, a novel method that integrates
the advantages of gene set enrichment with deconvolution
approaches. We present a compendium of newly gener-
ated gene signatures for 64 cell types, spanning multiple
adaptive and innate immunity cells, hematopoietic
progenitors, epithelial cells, and extracellular matrix cells
derived from thousands of expression profiles. Using in
silico mixtures, we transform the enrichment scores to a
linear scale, and using a spillover compensation technique
we reduce dependencies between closely related cell types.
We evaluate these adjusted scores in RNA-seq and micro-
array data from primary cell type samples from various
independent sources. We examine their ability to digitally
dissect the tumor microenvironment by in silico analyses,
and perform the most comprehensive comparison to date
with cytometry immunophenotyping. We compare our in-
ferences with available methods and show that scores
from xCell are more reliable for digital dissection of mixed
tissues. Finally, we apply our method on TCGA tumor
samples to portray a full tumor microenvironment land-
scape across thousands of samples. We provide these esti-
mations to the community and hope that this resource
will allow researchers to gain a better perspective of the
complex cellular heterogeneity in tumor tissues.
Results
Generating a gene signature compendium of cell types
To generate our compe ndium of gene signatures for cell
types , we collected gene expres sion profiles from six
sources: the FANTOM5 project, from which we anno-
tated 719 samples from 39 cell types analyzed by the
Cap Analysis Gene Expression (CAGE) technique [17];
the ENCODE project, from which we annotated 115
samples from 17 cell types analyzed by RNA-seq [18];
the Blueprint project, from which we annotated 144
samples from 28 cell types analyzed by RNA-seq [19];
the IRIS project, from which we annotated 95 samples
from 13 cell types analyzed by Affymetrix microarrays
[20]; the Novershtern et al. [21] study, from which we
annotated 180 samples from 24 cell types analyzed by
Affymetrix microarrays; and the Human Primary Cells
Atlas (HPCA), a collection of Affym etrix microarrays
composed of many different Gene Expression Omnibus
(GEO) datasets, from which we annotated 569 samples
from 41 cell types [22] (Fig. 1a). Altogether we collected
and curated gene expression profiles from 1822 samples
of pure cell types, annotated to 64 distinct cell types and
cell subsets (Fig. 1b; Additional file 1). Of those, 54 cell
types were found in at least two of these data sources. For
cell types with five or more samples in a data source, we
left one sample out for testing. All together, 97 samples
were left out, and all of the model training described
below was performed on the remaining 1725 samples.
Our strategy for selecting reliable cell type gene signa-
tures is shown in Fig. 1c (see Additional file 2: Figure S1
and Methods for a full description and technical
details). For each data source independently we identi-
fied genes that are overexpress ed in one cell type com-
pared to all other cell types. We applied different
thresholds for choosing sets of genes to represent the
cell type gene signatu res; hence, from each source, we
generated dozens of signatures per cell type. This
scheme yielded 6573 gene signatures corresponding to
64 cell types. Importantly, since our primary aim is to
develop a tool for studying the cellular heterogeneity in
the tumor microenvironment, we applied a methodology
we previously developed [16] to filter out genes that tend
to be overexpressed in a set of 634 carcinoma cell lines
from the Cancer Cell Line Encyclopedia (CCLE) [23].
Next, we used single-sample GSEA (ssGSEA) to score
each sample based on all signatures. ssGSEA is a well-
known method for determining a single, aggregate score
of the enrichment of a set of genes in the top of a
ranked gene expression profile [24]. To choose the most
reliable signatures we tested their performance in identi-
fying the corresponding cell type in each of the data
Aran et al. Genome Biology (2017) 18:220 Page 2 of 14

a
b
c
Fig. 1 (See legend on next page.)
Aran et al. Genome Biology (2017) 18:220 Page 3 of 14

sources. To prevent overfitting, each signature learned
from one data source was tested in other sources, but not
in the data source from which it was originally inferred.
To reduce biases resulting from a small number of genes
and from the analysis of different platforms, instead of
one signature per cell type, the top three ranked signatures
from each data source were chosen. Altogether we gener-
ated 489 gene signatures corresponding to 64 cell types
spanning multiple adaptive and innate immunity cells,
hematopoietic progenitors, epithelial cells, and extracellu-
lar matrix cells (Additional file 3). Obser ving the scores in
the 97 test primary cell type samples affirmed their ability
to identify the corresponding cell type compared to other
cell types across data sources (Additional file 2: Figure S2).
We defined the raw enrichment score per cell type to be
the average ssGSEA score from all the cell types corre-
sponding signatures.
Spillover compensation between closely related cell types
Our primary objective is to accurately identify enrich-
ment of cell types in mixtures. To imitate such ad-
mixtures, we performed an array of simulations of
gene expression combinations for different cell types
to assess the accuracy and sensitivity of our gene sig-
natures. We generated such in silico expression pro-
files using different data sources and different set s of
cell type s in mixtures and by choosing randomly one
sample per cell type from all available samples in the
data source. The simulations revealed that our raw
scores reliably predict even small changes in the pro-
portions of cell types , distinguish between most cell
types, and are reliable in different transcriptomic ana-
lysis platforms (Additional file 2: Figure S3). However,
the simulations also revealed tha t raw scores of RNA-
seq samples are not linearly a ssociated with the abun-
dance and th at th ey do not allow comparisons across
cell types (Additional file 2: Figure S4). Thus, using the
training samples we generated synthetic expression pro-
files by mixing the cell type of interest with other, non-
related cell types. We then fit a formula that transforms
the raw scores to cell type abundances. We found that the
transformed scores showed resemblance to the known
fractions of the cell types in simulations, thus enabling
comparison of scores across cell types, and not just across
samples (Additional file 2: Figure S5).
The simulations also revealed another limitation of the
raw scores: closely related cell types tend to have correl-
ating scores (Additional file 2: Figure S5). That is, scores
may show enrichment for a cell type due to a spillover
effect between closely related cell types. This problem
mimics the spillover problem in flow cytometry, in
which fluorescent signals correlate with each other due
to spectrum overlaps. Inspired by the compensation
method used in flow cytometry studies [25], we lever-
aged our simulations to generate a spillover matrix that
allows correcting for correlations between cell types. To
better compensate for low abundances in mixtures, we
created a simulated dataset where each sample contains
25% of the cell type of interest with the rest from a non-
related cell type and produced a spillover matrix, a
representation of the dependencies of scores between
different cell types.
Applying the spillover correction procedure on the
pure cell types (Fig. 2a) and simulated expression pro-
files (Fig. 2b, c; Additional file 2: Figures S5 and S6)
showed that this method was able to successfully reduce
associations betw een closely related cell types. For ex-
ample, we generated simulated mixtures using an inde-
pendent data source of multiple cell types that was not
used for the development of the method (GSE60424)
[26], and used our method to infer the underlying
abundances. We observed decent performance in recap-
itulating the cell type distributions. However, before cor-
recting for spillovers, there were false associations
between CD4+ and CD8+ T cells, as well as between
monocytes and neutrophils. The spillover correction was
able to reduce these associations significantly without
harming the correlations on the diagonal (Fig. 2b). In
addition, we generated simulated mixtures using the train-
ing samples (Additional file 2: Figure S5) and the test
samples (Additional file 2: Figure S6). In the 18 simu-
lated mixtures using the test samples, we observed an
overall average decrease of 17.1% in significant correla-
tions off the diagonal (Fig. 2c; Additional file 2: Figure S5).
Unexpectedly, following the spillover compensation we
obser ved slightly improved associations on the diagonal
between the scores and the underlying abundances (1.4%
average improvement).
Finally, many of the cell types we estimate are not ex-
pected to be in a given mixture; however, the pipeline
(See figure on previous page.)
Fig. 1 xCell study design. a A summary of the data sources used in the study to generate the gene signatures, showing the number of pure cell
types and number of samples curated from them. b Our compendium of 64 human cell type gene signatures grouped into five cell type families.
c The xCell pipeline. Using the data sources and based on different thresholds, we derived gene signatures for 64 cell types. Of this collection of
6573 signatures, we chose the 489 most reliable cell types, three for each cell type from each data source where available. The raw score is then
the average single-sample GSEA (ssGSEA) score of all signatures corresponding to the cell type. Using simulations of gene expression for each cell
type, we derived a function to transform the non-linear association between the scores to a linear scale. Using the simulations we also derive the
dependencies between cell type scores and apply a spillover compensation method to adjust the scores
Aran et al. Genome Biology (2017) 18:220 Page 4 of 14

a
b
c
Fig. 2 (See legend on next page.)
Aran et al. Genome Biology (2017) 18:220 Page 5 of 14

Citations
More filters
Journal ArticleDOI

Comprehensive analysis of normal adjacent to tumor transcriptomes.

TL;DR: A pan-cancer mechanism of pro-inflammatory signals from the tumor stimulates an inflammatory response in the adjacent endothelium, and the authors find that NAT presents a unique state, potentially due to inflammatory response of the NAT to the tumour tissue.
Journal ArticleDOI

Synergy between the KEAP1/NRF2 and PI3K Pathways Drives Non-Small-Cell Lung Cancer with an Altered Immune Microenvironment

TL;DR: It is demonstrated that inactivation of Keap1 and Pten in the mouse lung promotes adenocarcinoma formation and the ability to exploit both metabolic and immune characteristics in the detection and treatment of lung tumors harboring KEAP1/NRF2 pathway alterations is highlighted.
Journal ArticleDOI

Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications.

TL;DR: A weighting strategy is introduced, based on a zero-inflated negative binomial model, that identifies excess zero counts and generates gene- and cell-specific weights to unlock bulk RNA-seq DE pipelines for zero- inflated data, boosting performance for scRNA-seq.
References
More filters
Journal Article

An integrated encyclopedia of DNA elements in the human genome.

ENCODEConsortium
- 01 Jan 2012 - 
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Journal ArticleDOI

Robust enumeration of cell subsets from tissue expression profiles

TL;DR: CIBERSORT outperformed other methods with respect to noise, unknown mixture content and closely related cell types when applied to enumeration of hematopoietic subsets in RNA mixtures from fresh, frozen and fixed tissues, including solid tumors.
Journal ArticleDOI

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity

TL;DR: The results indicate that large, annotated cell-line collections may help to enable preclinical stratification schemata for anticancer agents and the generation of genetic predictions of drug response in the preclinical setting and their incorporation into cancer clinical trial design could speed the emergence of ‘personalized’ therapeutic regimens.
Journal ArticleDOI

GSVA: gene set variation analysis for microarray and RNA-seq data.

TL;DR: This work introduces Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner and constitutes a starting point to build pathway-centric models of biology.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in "Xcell: digitally portraying the tissue cellular heterogeneity landscape" ?

Here the authors present xCell, a novel gene signature-based method, and use it to infer 64 immune and stromal cell types. The authors harmonized 1822 pure human cell type transcriptomes from various sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry immunophenotyping, the authors show that xCell outperforms other methods. The most studied set of non-cancerous cell types are the tumorinfiltrating lymphocytes ( TILs ). Edu Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA © The Author ( s ). This artic International License ( http: //creativecommons reproduction in any medium, provided you g the Creative Commons license, and indicate if ( http: //creativecommons. However, in most studies that included cytometry validation, these analyses were performed on only a very le is distributed under the terms of the Creative Commons Attribution 4. 0. org/licenses/by/4. 0/ ), which permits unrestricted use, distribution, and ive appropriate credit to the original author ( s ) and the source, provide a link to changes were made. The Creative Commons Public Domain Dedication waiver ro/1. 0/ ) applies to the data made available in this article, unless otherwise stated. Here, the authors present xCell, a novel method that integrates the advantages of gene set enrichment with deconvolution approaches. The authors present a compendium of newly generated gene signatures for 64 cell types, spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells, and extracellular matrix cells derived from thousands of expression profiles. The authors examine their ability to digitally dissect the tumor microenvironment by in silico analyses, and perform the most comprehensive comparison to date with cytometry immunophenotyping. The authors compare their inferences with available methods and show that scores from xCell are more reliable for digital dissection of mixed tissues. The authors provide these estimations to the community and hope that this resource will allow researchers to gain a better perspective of the complex cellular heterogeneity in tumor tissues. Together these are termed the tumor microenvironment, which has been in the research spotlight in recent years and is being further explored by novel techniques. Second, current methods focus on only a very narrow range of the tumor microenvironment, usually a subset of immune cell types, and thus do not account for the further richness of cell types in the microenvironment, including blood vessels and other different forms of cell subsets [ 14, 15 ]. 

The authors provide a simple web tool, xCell ( http: //xCell. ucsf. edu/ ), to the community and hope that further studies will utilize it for the discovery of novel predictive and prognostic biomarkers, and new therapeutic targets. 

Using in silico mixtures, the authors transform the enrichment scores to a linear scale, and using a spillover compensation technique the authors reduce dependencies between closely related cell types. 

Their method, which is gene signature-based, is more reliable due to its reliance on a group of signatures for each cell type, learned from multiple data sources, which increases the ability to distinguish the signal from the noise. 

their compensation technique was able to completely remove associations between cell types, while previously published signatures showed considerate dependencies between closely related cell types, such as between CD8+ 

Using simulations of gene expression for each cell type, the authors derived a function to transform the non-linear association between the scores to a linear scale. 

To reduce biases resulting from a small number of genes and from the analysis of different platforms, instead of one signature per cell type, the top three ranked signatures from each data source were chosen. 

Applying this procedure to the test simulated mixtures enabled detection of about half of the non-expected nonnegligible scores as non-significant (46.9% change—from 56.4% non-negligible scores to 28.8% with p value > 0.2), while detecting as non-significant only 15.3% of nonnegligible scores for cell types used for generating the mixture (from 88.6% non-negligible scores to 75.1%) (Additional file 4). 

The simulations also revealed another limitation of the raw scores: closely related cell types tend to have correlating scores (Additional file 2: Figure S5). 

other explanations for the lower success when inferring abundances in real samples are possible—it may well be possible that the expression patterns of marker genes in mixtures are different to those in purified cells. 

Despite the generally improved ability of xCell to estimate cell populations, the authors do note that in some cases the correlations the authors observed were relatively low, emphasizing the difficulty of estimating cell subsets in mixed samples, and the need for cautious examination and further validation of findings.