What are the contributions in "Xcell: digitally portraying the tissue cellular heterogeneity landscape" ?

Here the authors present xCell, a novel gene signature-based method, and use it to infer 64 immune and stromal cell types. The authors harmonized 1822 pure human cell type transcriptomes from various sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry immunophenotyping, the authors show that xCell outperforms other methods. The most studied set of non-cancerous cell types are the tumorinfiltrating lymphocytes ( TILs ). Edu Institute for Computational Health Sciences, University of California, San Francisco, California 94158, USA © The Author ( s ). This artic International License ( http: //creativecommons reproduction in any medium, provided you g the Creative Commons license, and indicate if ( http: //creativecommons. However, in most studies that included cytometry validation, these analyses were performed on only a very le is distributed under the terms of the Creative Commons Attribution 4. 0. org/licenses/by/4. 0/ ), which permits unrestricted use, distribution, and ive appropriate credit to the original author ( s ) and the source, provide a link to changes were made. The Creative Commons Public Domain Dedication waiver ro/1. 0/ ) applies to the data made available in this article, unless otherwise stated. Here, the authors present xCell, a novel method that integrates the advantages of gene set enrichment with deconvolution approaches. The authors present a compendium of newly generated gene signatures for 64 cell types, spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells, and extracellular matrix cells derived from thousands of expression profiles. The authors examine their ability to digitally dissect the tumor microenvironment by in silico analyses, and perform the most comprehensive comparison to date with cytometry immunophenotyping. The authors compare their inferences with available methods and show that scores from xCell are more reliable for digital dissection of mixed tissues. The authors provide these estimations to the community and hope that this resource will allow researchers to gain a better perspective of the complex cellular heterogeneity in tumor tissues. Together these are termed the tumor microenvironment, which has been in the research spotlight in recent years and is being further explored by novel techniques. Second, current methods focus on only a very narrow range of the tumor microenvironment, usually a subset of immune cell types, and thus do not account for the further richness of cell types in the microenvironment, including blood vessels and other different forms of cell subsets [ 14, 15 ].

What is the reliable method for predicting cell type enrichment?

Their method, which is gene signature-based, is more reliable due to its reliance on a group of signatures for each cell type, learned from multiple data sources, which increases the ability to distinguish the signal from the noise.

What did the xCell compensation technique remove from the simulated mixtures?

their compensation technique was able to completely remove associations between cell types, while previously published signatures showed considerate dependencies between closely related cell types, such as between CD8+

how many nonnegligible scores were detected in the test mixtures?

Applying this procedure to the test simulated mixtures enabled detection of about half of the non-expected nonnegligible scores as non-significant (46.9% change—from 56.4% non-negligible scores to 28.8% with p value > 0.2), while detecting as non-significant only 15.3% of nonnegligible scores for cell types used for generating the mixture (from 88.6% non-negligible scores to 75.1%) (Additional file 4).

What are the reasons for the lower success when inferring abundances in real samples?

other explanations for the lower success when inferring abundances in real samples are possible—it may well be possible that the expression patterns of marker genes in mixtures are different to those in purified cells.

What is the significance of the correlations between xCell and cell populations?

Despite the generally improved ability of xCell to estimate cell populations, the authors do note that in some cases the correlations the authors observed were relatively low, emphasizing the difficulty of estimating cell subsets in mixed samples, and the need for cautious examination and further validation of findings.

(Open Access) xCell: Digitally portraying the tissue cellular heterogeneity landscape (2017) | Dvir Aran

Q: What are the future works in "Xcell: digitally portraying the tissue cellular heterogeneity landscape" ?

The authors provide a simple web tool, xCell ( http: //xCell. ucsf. edu/ ), to the community and hope that further studies will utilize it for the discovery of novel predictive and prognostic biomarkers, and new therapeutic targets.

Q: How do the authors reduce dependencies between closely related cell types?

Using in silico mixtures, the authors transform the enrichment scores to a linear scale, and using a spillover compensation technique the authors reduce dependencies between closely related cell types.

Q: What is the function to transform the non-linear association between the scores?

Using simulations of gene expression for each cell type, the authors derived a function to transform the non-linear association between the scores to a linear scale.

M ETHOD Open Access

xCell: digitally portraying the tissue cellular

heterogeneity landscape

Dvir Aran

, Zicheng Hu and Atul J. Butte

Abstract

Tissues are complex milieus consisting of numerous cell types. Several recent methods have attempted to enumerate

cell subsets from transcriptomes. However, the available methods have used limited sources for training and give only

a partial portrayal of the full cellular landscape. Here we present xCell, a novel gene signature-based method, and use it

to infer 64 immune and stromal cell types. We harmonized 1822 pure human cell type transcriptomes from various

sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover

compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry

immunophenotyping, we show that xCell outperforms other methods. xCell is available at http://xCell.ucsf.edu/.

Background

In addition to malignant proliferating cells, tumors are

also composed of numerous distinct non-cancerous cell

types and activation states of those cell types. Together

these are termed the tumor microenvironment, which

has been in the research spotlight in recent years and is

being further explored by novel techniques. The most

studied set of non-cancerous cell types are the tumor-

infiltrating lymphocytes (TILs). However, TILs are only

part of a variety of innate and adaptive immune cells,

stromal cells, and many other cell types that are found

in the tumor and interact with the malignant cells. This

complex and dynamic microenvironment is now recog-

nized to be important both in promoting and inhibiting

tumor growth, invasion, and metastasis [1, 2]. Under-

standing the cellular heterogeneity composing the tumor

microenvironment is key for improving existing treat-

ments, the discovery of predictive biomarkers, and

development of novel therapeutic strategies.

Traditional approaches for dissecting the cellular het-

erogeneity in liquid tissues are difficult to apply in solid

tumors [3]. Therefore, in the past decade, several

methods have been published for digitally dissecting the

tumor microenvironment using gene expression profiles

[4–7] (reviewed in [8]). Recently, a multitude of studies

have been published applying published and novel

techniques on publicly available tumor sample resources,

such as The Cancer Genome Atlas (TCGA) [6, 9–13].

Two general types of techniques are used: deconvolving

the complete cellular composition and assessing enrich-

ments of individual cell types.

At least seven major issues raise concerns that the in

silico methods could be prone to errors and cannot

reliably portray the cellular heterogeneity of the tumor

microenvironment. First, current techniques depend on

the expression profiles of purified cell types to identify

reference genes and therefore rely heavily on the data

source from which the references are inferred and could

this be inclined to overfit these data. Second, current

methods focus on only a very narrow range of the tumor

microenvironment, usually a subset of immune cell

types , and thus do not accou nt for the further richness

of cell types in the microenvironment, including blood

vessels and othe r different forms of cell subsets [14, 15].

A third problem is the ability of cancer cells to “imitate”

other cell types by expressing immune-specific genes,

such as a macrophage-like expression pattern in tumors

with parainflammation [16]; only a few of the methods

take this into account. Fourth, the ability of existing

methods to estimate cell abundance has not yet been

comprehensively validated in mixed samples. Cytometry

is a common method for countin g cell types in a

mixture and, when performed in combination with gene

expression profiling, can allow validation of the estima-

tions. However, in most studies that included cytometry

validation, these analyses were performed on only a very

* Correspondence: dvir.aran@ucsf.edu; atul.butte@ucsf.edu

Institute for Computational Health Sciences, University of California, San

Francisco, California 94158, USA

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to

the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Aran et al. Genome Biology (2017) 18:220

DOI 10.1186/s13059-017-1349-1

limited number of cell types and a limited number of

samples [7, 13].

A fifth challenge is that deconvolution approaches are

prone to many different biases because of the strict de-

pendencies among all cell types that are inferred. This

could highly affect reliability when analyzing tumor

samples, which are prone to form non-conventional ex-

pression profiles. A sixth problem comes with inferring

an increasing number of closely related cell types [10].

Finally, deconvolution analysis heavily relies on the

structure of the reference matrix, which limits its appli-

cation to the resource used to develop the matrix. One

such deconvolution approach is CIBESORT, the most

comprehensive study to date, which allows the enumer-

ation of 22 immu ne subsets [7]. Newman et al. [7] per-

formed adequate evaluation across data sources and

validated the estimations using cytometry immunophe-

notyping. However, the shortcomings of deconvolution

approaches are apparent in CIBERSORT, which is

limited to Affymetrix microarray studies.

On the othe r hand, gene set enrichment analysis

(GSEA) is a simple technique which can be easily ap-

plied across data types and can be quickly applied for

cancer studies. In GSEA each gene signatu re is used

independently of all other signatures and it is thus pro-

tected from the limitations of deconvolution approaches.

However, because of this independence, it is many times

hard to differentiate between closely related cell types.

In addition, gene signature-based methods only provide

enrichment scores and thus do not allow comparison

across cell types and cannot provide insights into the

abundance of cell types in the mixture.

Here, we present xCell, a novel method that integrates

the advantages of gene set enrichment with deconvolution

approaches. We present a compendium of newly gener-

ated gene signatures for 64 cell types, spanning multiple

adaptive and innate immunity cells, hematopoietic

progenitors, epithelial cells, and extracellular matrix cells

derived from thousands of expression profiles. Using in

silico mixtures, we transform the enrichment scores to a

linear scale, and using a spillover compensation technique

we reduce dependencies between closely related cell types.

We evaluate these adjusted scores in RNA-seq and micro-

array data from primary cell type samples from various

independent sources. We examine their ability to digitally

dissect the tumor microenvironment by in silico analyses,

and perform the most comprehensive comparison to date

with cytometry immunophenotyping. We compare our in-

ferences with available methods and show that scores

from xCell are more reliable for digital dissection of mixed

tissues. Finally, we apply our method on TCGA tumor

samples to portray a full tumor microenvironment land-

scape across thousands of samples. We provide these esti-

mations to the community and hope that this resource

will allow researchers to gain a better perspective of the

complex cellular heterogeneity in tumor tissues.

Results

Generating a gene signature compendium of cell types

To generate our compe ndium of gene signatures for cell

types , we collected gene expres sion profiles from six

sources: the FANTOM5 project, from which we anno-

tated 719 samples from 39 cell types analyzed by the

Cap Analysis Gene Expression (CAGE) technique [17];

the ENCODE project, from which we annotated 115

samples from 17 cell types analyzed by RNA-seq [18];

the Blueprint project, from which we annotated 144

samples from 28 cell types analyzed by RNA-seq [19];

the IRIS project, from which we annotated 95 samples

from 13 cell types analyzed by Affymetrix microarrays

[20]; the Novershtern et al. [21] study, from which we

annotated 180 samples from 24 cell types analyzed by

Affymetrix microarrays; and the Human Primary Cells

Atlas (HPCA), a collection of Affym etrix microarrays

composed of many different Gene Expression Omnibus

(GEO) datasets, from which we annotated 569 samples

from 41 cell types [22] (Fig. 1a). Altogether we collected

and curated gene expression profiles from 1822 samples

of pure cell types, annotated to 64 distinct cell types and

cell subsets (Fig. 1b; Additional file 1). Of those, 54 cell

types were found in at least two of these data sources. For

cell types with five or more samples in a data source, we

left one sample out for testing. All together, 97 samples

were left out, and all of the model training described

below was performed on the remaining 1725 samples.

Our strategy for selecting reliable cell type gene signa-

tures is shown in Fig. 1c (see Additional file 2: Figure S1

and “Methods” for a full description and technical

details). For each data source independently we identi-

fied genes that are overexpress ed in one cell type com-

pared to all other cell types. We applied different

thresholds for choosing sets of genes to represent the

cell type gene signatu res; hence, from each source, we

generated dozens of signatures per cell type. This

scheme yielded 6573 gene signatures corresponding to

64 cell types. Importantly, since our primary aim is to

develop a tool for studying the cellular heterogeneity in

the tumor microenvironment, we applied a methodology

we previously developed [16] to filter out genes that tend

to be overexpressed in a set of 634 carcinoma cell lines

from the Cancer Cell Line Encyclopedia (CCLE) [23].

Next, we used single-sample GSEA (ssGSEA) to score

each sample based on all signatures. ssGSEA is a well-

known method for determining a single, aggregate score

of the enrichment of a set of genes in the top of a

ranked gene expression profile [24]. To choose the most

reliable signatures we tested their performance in identi-

fying the corresponding cell type in each of the data

Aran et al. Genome Biology (2017) 18:220 Page 2 of 14

Fig. 1 (See legend on next page.)

Aran et al. Genome Biology (2017) 18:220 Page 3 of 14

sources. To prevent overfitting, each signature learned

from one data source was tested in other sources, but not

in the data source from which it was originally inferred.

To reduce biases resulting from a small number of genes

and from the analysis of different platforms, instead of

one signature per cell type, the top three ranked signatures

from each data source were chosen. Altogether we gener-

ated 489 gene signatures corresponding to 64 cell types

spanning multiple adaptive and innate immunity cells,

hematopoietic progenitors, epithelial cells, and extracellu-

lar matrix cells (Additional file 3). Obser ving the scores in

the 97 test primary cell type samples affirmed their ability

to identify the corresponding cell type compared to other

cell types across data sources (Additional file 2: Figure S2).

We defined the raw enrichment score per cell type to be

the average ssGSEA score from all the cell types’ corre-

sponding signatures.

Spillover compensation between closely related cell types

Our primary objective is to accurately identify enrich-

ment of cell types in mixtures. To imitate such ad-

mixtures, we performed an array of simulations of

gene expression combinations for different cell types

to assess the accuracy and sensitivity of our gene sig-

natures. We generated such in silico expression pro-

files using different data sources and different set s of

cell type s in mixtures and by choosing randomly one

sample per cell type from all available samples in the

data source. The simulations revealed that our raw

scores reliably predict even small changes in the pro-

portions of cell types , distinguish between most cell

types, and are reliable in different transcriptomic ana-

lysis platforms (Additional file 2: Figure S3). However,

the simulations also revealed tha t raw scores of RNA-

seq samples are not linearly a ssociated with the abun-

dance and th at th ey do not allow comparisons across

cell types (Additional file 2: Figure S4). Thus, using the

training samples we generated synthetic expression pro-

files by mixing the cell type of interest with other, non-

related cell types. We then fit a formula that transforms

the raw scores to cell type abundances. We found that the

transformed scores showed resemblance to the known

fractions of the cell types in simulations, thus enabling

comparison of scores across cell types, and not just across

samples (Additional file 2: Figure S5).

The simulations also revealed another limitation of the

raw scores: closely related cell types tend to have correl-

ating scores (Additional file 2: Figure S5). That is, scores

may show enrichment for a cell type due to a “spillover

effect” between closely related cell types. This problem

mimics the spillover problem in flow cytometry, in

which fluorescent signals correlate with each other due

to spectrum overlaps. Inspired by the compensation

method used in flow cytometry studies [25], we lever-

aged our simulations to generate a spillover matrix that

allows correcting for correlations between cell types. To

better compensate for low abundances in mixtures, we

created a simulated dataset where each sample contains

25% of the cell type of interest with the rest from a non-

related cell type and produced a spillover matrix, a

representation of the dependencies of scores between

different cell types.

Applying the spillover correction procedure on the

pure cell types (Fig. 2a) and simulated expression pro-

files (Fig. 2b, c; Additional file 2: Figures S5 and S6)

showed that this method was able to successfully reduce

associations betw een closely related cell types. For ex-

ample, we generated simulated mixtures using an inde-

pendent data source of multiple cell types that was not

used for the development of the method (GSE60424)

[26], and used our method to infer the underlying

abundances. We observed decent performance in recap-

itulating the cell type distributions. However, before cor-

recting for spillovers, there were false associations

between CD4+ and CD8+ T cells, as well as between

monocytes and neutrophils. The spillover correction was

able to reduce these associations significantly without

harming the correlations on the diagonal (Fig. 2b). In

addition, we generated simulated mixtures using the train-

ing samples (Additional file 2: Figure S5) and the test

samples (Additional file 2: Figure S6). In the 18 simu-

lated mixtures using the test samples, we observed an

overall average decrease of 17.1% in significant correla-

tions off the diagonal (Fig. 2c; Additional file 2: Figure S5).

Unexpectedly, following the spillover compensation we

obser ved slightly improved associations on the diagonal

between the scores and the underlying abundances (1.4%

average improvement).

Finally, many of the cell types we estimate are not ex-

pected to be in a given mixture; however, the pipeline

(See figure on previous page.)

Fig. 1 xCell study design. a A summary of the data sources used in the study to generate the gene signatures, showing the number of pure cell

types and number of samples curated from them. b Our compendium of 64 human cell type gene signatures grouped into five cell type families.

c The xCell pipeline. Using the data sources and based on different thresholds, we derived gene signatures for 64 cell types. Of this collection of

6573 signatures, we chose the 489 most reliable cell types, three for each cell type from each data source where available. The raw score is then

the average single-sample GSEA (ssGSEA) score of all signatures corresponding to the cell type. Using simulations of gene expression for each cell

type, we derived a function to transform the non-linear association between the scores to a linear scale. Using the simulations we also derive the

dependencies between cell type scores and apply a spillover compensation method to adjust the scores

Aran et al. Genome Biology (2017) 18:220 Page 4 of 14

Fig. 2 (See legend on next page.)

Aran et al. Genome Biology (2017) 18:220 Page 5 of 14

xCell: Digitally portraying the tissue cellular heterogeneity landscape

Figures

Citations

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

Comprehensive analysis of normal adjacent to tumor transcriptomes.

Additional file 1: Figures S1â S25. of Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures

Synergy between the KEAP1/NRF2 and PI3K Pathways Drives Non-Small-Cell Lung Cancer with an Altered Immune Microenvironment

Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications.

References

An integrated encyclopedia of DNA elements in the human genome.

Robust enumeration of cell subsets from tissue expression profiles

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity

GSVA: gene set variation analysis for microarray and RNA-seq data.

Type, Density, and Location of Immune Cells Within Human Colorectal Tumors Predict Clinical Outcome

Related Papers (5)

Cellular deconvolution of GTEx tissues powers eQTL studies to discover thousands of novel disease and cell-type associated regulatory variants

Applying unmixing to gene expression data for tumor phylogeny inference

Ontology based molecular signatures for immune cell types via gene expression analysis.

An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples

Tissue-specific functional effect prediction of genetic variation and applications to complex trait genetics

Frequently Asked Questions (11)

Q1. What are the contributions in "Xcell: digitally portraying the tissue cellular heterogeneity landscape" ?

Q2. What are the future works in "Xcell: digitally portraying the tissue cellular heterogeneity landscape" ?

Q3. How do the authors reduce dependencies between closely related cell types?

Q4. What is the reliable method for predicting cell type enrichment?

Q5. What did the xCell compensation technique remove from the simulated mixtures?

Q6. What is the function to transform the non-linear association between the scores?

Q7. How did the authors reduce biases resulting from a small number of genes?

Q8. how many nonnegligible scores were detected in the test mixtures?

Q9. What is the main limitation of the simulations?

Q10. What are the reasons for the lower success when inferring abundances in real samples?

Q11. What is the significance of the correlations between xCell and cell populations?