Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.

doi:10.1038/S41596-021-00534-0

1

Tutorial: guidelines for annotating single-cell transcriptomic maps

using automated and manual methods

Zoe A. Clarke

1,2,

*, Tallulah S. Andrews

2,3,4

*, Jawairia Atif

3,4

*, Delaram Pouyabahar

1,2,

*,

Brendan T. Innes

1,2

, Sonya A. MacParland

3,4,5

+, Gary D. Bader

1,2,6,7

+

1 - Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada

2 - The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada

3 - Ajmera Transplant Centre, Toronto General Hospital Research Institute, Toronto, Ontario,

Canada

4 - Department of Immunology, University of Toronto, Toronto, Ontario Canada

5 - Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto,

Ontario, Canada

6 - Department of Computer Science, University of Toronto, Toronto, Ontario, Canada

7 - Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada

*equal contribution

+corresponding: s.macparland@utoronto.ca, gary.bader@utoronto.ca

2

Abstract

Single-cell transcriptomics can profile thousands of cells in a single experiment and identify novel

cell types, states and dynamics in a wide range of tissues and organisms. Standard experimental

protocols and analysis workflows have been developed to create single-cell transcriptomic maps

from tissues. This Tutorial focuses on how to interpret these data to identify cell types, states and

other biologically relevant patterns with the objective of creating an annotated map of cells. We

recommend a three-step workflow including automatic cell annotation (wherever possible),

manual cell annotation and verification. Frequently encountered challenges are discussed, as well

as strategies to address them. Guiding principles and specific recommendations for software tools

and resources that can be used for each step are covered and an R notebook is included to help

run the recommended workflow. Basic familiarity with computer software is assumed and basic

knowledge of programming (e.g. in the R language) is recommended.

3

Introduction

Single-cell genomics enables the molecular profiling of thousands of cells in a single

experiment

1–3

to create comprehensive maps of cellular heterogeneity in multicellular systems

4,5

.

In particular, single-cell RNA sequencing (scRNA-seq) and single-nuclei RNA sequencing

(snRNA-seq) can be used to measure single-cell transcriptomes and map novel cell types

6

,

states

7

and dynamics

8

in a wide range of tissues and organisms.

Single-cell transcriptomics data are often presented as a two-dimensional “map”

organizing cells based on the similarity of their gene expression profiles. Data visualized in this

way naturally identifies groups (or “clusters”) of highly similar cells, as well as gradients and other

transcript-based patterns. Such artifacts must be interpreted and annotated to define cell types

and states to support biological discovery (Figure 1). Standard experimental protocols and

analysis workflows detail how to create single-cell transcriptomic maps from tissues

9–12

. Briefly,

tissues are dissociated into single cells and profiled using a single-cell transcriptomic technology.

Computational analysis is then used to perform quality control filtering on the results (e.g.

removing low-quality cells), quantify the expression of each mapped gene in each cell

13

, identify

groups of similar cells using a clustering algorithm

14–18

, and visualize all cells in two dimensions

using techniques such as tSNE

19

or UMAP

20

to produce an unannotated “single-cell map” image

(Box 1)

21

. To interpret this map biologically, it is necessary to determine which cell types or cell

states are represented by clusters or other patterns (e.g. gradients) observed in the data. These

interpretations can then be labeled on the map, which helps place them in a conceptual framework

useful for better understanding tissue biology. This Tutorial offers a guide to the map interpretation

and labeling process, starting from clustered data and resulting in a completely annotated single-

cell map (Figure 1). The general workflow for annotating cells in scRNA-seq data has three major

steps: automatic annotation, manual annotation, and verification (Figure 2). First, automatic

annotation uses a predefined set of “marker genes” (i.e. genes that are specifically expressed in

a known cell type) or reference single-cell data (i.e. an existing expertly annotated single-cell map)

to identify and label individual cells or cell clusters by matching their gene expression patterns

(signatures) to those of known cell types. A second major step is manual annotation, which

involves studying genes and gene functions specific to each cell cluster or pattern to verify

automatic cell annotations and identify novel cell types and states. Finally, verification can confirm

the identity and function of select cell types using independent methods, such as new validation

experiments.

Step 1: Automatic cell annotation

Automatic cell annotation is an efficient way to label cells or cell clusters using a computer

algorithm and an appropriate set of prior biological knowledge. The general principle is to identify

a gene expression signal (pattern, signature) in a single cell or cell cluster that matches a

characteristic gene expression signature of a known cell type or state; the cell or cluster is then

assigned the respective label. Labels often have an associated confidence score.

There are two major automatic cell annotation approaches. One is to use known marker

genes for each of the cell types that are likely to be found in the sample to be annotated (referred

4

to as “marker-based automatic annotation”). In this case, known relationships between marker

genes and cell types are obtained from databases, such as SCSig

22

, PanglaoDB

23

, and

CellMarker

24

, or manually from the literature. Then cells or clusters are labeled according to the

marker genes they characteristically express. The second approach is to compare single-cell

RNA-seq data to be annotated (the ‘query’ data set) to an existing, similar, expertly annotated

scRNA-seq data set (the ‘reference’ data set), and transfer the label from a reference cell or

cluster to a sufficiently similar one in the query (referred to as “reference-based automatic

annotation”). Reference single-cell data are obtained from sources such as Gene Expression

Omnibus (GEO)

25

, the Single Cell Expression Atlas

26

or cell atlas projects

27,28

.

Automatic cell annotation methods can be applied to individual cells (either before or after

clustering) or to clusters of cells, which occurs only after clustering the cells. In the case of

annotating clusters, the gene expression profile for each cluster is determined by averaging the

expression profiles of all cells within the cluster. Annotating individual cells is ideal, as this reduces

the chance of missing important differences between cells. However, some scRNA-seq

experimental data are based on low numbers of transcript reads per cell, so there may be

insufficient data for cell-based annotation to function correctly, making clustered data sets easier

to work with. Annotating clusters is faster, as there are fewer clusters than cells to process; it can

also be more accurate than the single-cell approach, considering it is based on more reliable

expression level estimates averaged across all cells in a cluster. However, not all cells can be

easily grouped into clusters, especially for dynamic systems like developing tissues

29

or tissues

that contain gene expression gradients

30,31

.

A major challenge with automatic cell annotation is that many cell types do not have well-

characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some

cells. Automated methods typically work better for major cell types and may not be able to

effectively distinguish subtypes. Thus, automatic cell annotation is useful to quickly identify known

cell types and highlight unknown cell types for further exploration. The main caveats and

recommendations for automatic cell annotation are summarised in Table 1.

Marker-based automatic annotation

Marker-based automatic annotation labels cells or cell clusters based on the characteristic

expression of known marker genes. To be successful, the marker gene or gene set (a collection

of marker genes) should be specifically and consistently expressed in a given cell, cluster or class

of cells (e.g. immune cells). Markers are readily available for well-characterized organisms and

cell types (e.g. human PBMC samples

32

). Marker-based automatic annotation works well once a

relevant and sufficiently large set of marker genes is collected

33

.

To label individual cells, one of the most reliable marker-based annotation tools is Semi-

supervised Category Identification and Assignment (SCINA)

34

. SCINA assumes each marker

follows a bimodal gene expression distribution, where one peak corresponds to cells from the

associated cell type and the other peak contains the rest of the cells in the experiment. A cell of

a particular type is assumed to have expression in the upper part of this distribution for all the

markers of that cell type, consequently requiring markers provided as input to SCINA to be specific

to only one cell type. AUCell

35

is another good marker-based labeling method that classifies

5

individual cells or clusters. AUCell ranks the genes in each cell by decreasing expression value,

and cells are labeled according to their most active (highly expressed) marker gene sets. AUCell

works best with cell types that have a sufficiently large set of marker genes such that multiple

markers are detected in each cell. It has the advantage of scoring a whole set of marker genes at

once, which may increase sensitivity over methods that examine each marker gene

independently.

To label whole clusters, Gene Set Variation Analysis

36

(GSVA) has been benchmarked to

be fast and reliable

37

. GSVA works similarly to AUCell - given a database of marker gene sets, it

identifies sets that are enriched in the gene expression profile of a cluster. The GSVA software

has a practical advantage that it can annotate all clusters in one operation.

Marker-based automatic cell annotation methods often have the advantage that they will

only assign labels to cells associated with known markers and other cells will remain unlabeled

33

.

However, this depends on the specific tool and the parameters used; see Table 2 and

Supplementary table 1 for details on which tools have the option to leave cells unlabelled. A

disadvantage of these tools is that markers are not easily accessible for all cell types.

Reference-based automatic cell annotation

Reference-based cell annotation is based on the concept of “guilt-by-association”,

whereby a cell or cluster label in the reference data is transferred to an unlabeled cell or cluster

in the query data with a similar gene expression profile. Consequently, this approach is only

possible if high-quality and relevant annotated reference single-cell data are available. Studying

the original clustering and annotation steps performed on the reference data can help determine

its quality, and ensure that errors in the reference will not be propagated to new data. Tissue-

specific reference data can be obtained from public databases (e.g. the Gene Expression

Omnibus

25

or the Expression Atlas

26

) or large cell atlas projects (e.g. the Human Cell Atlas

27

, the

Tabula Muris or Mouse Cell Atlas

5

, or others

4,28,38–40

), although the required associated cell

annotations are not always easily available. These atlases typically contain hundreds of

thousands of cells and dozens of different annotated cell types.

scmap

41

is one of the best performing tools for reference-based automatic cell or cluster

annotation, in terms of both accuracy of assigned labels and avoiding incorrect labeling of novel

cell types

33

. Other tools for reference-based automatic annotation include SingleCellNet

42

and

SingleR

43

. SingleCellNet has high accuracy when all cell types are well represented in the

reference data but with low accuracy if the reference data are incomplete or represent a poor

match

33

. The main advantage to SingleR is that a reasonable, general reference data set is

included with the tool, but this may not perform as well as a reference specifically matched to the

query data set. An alternative to using specific software packages for reference-based cell

annotation is to train a machine learning tool, such as a support vector machine (SVM)

44

or

random forest classifier

45

, on selected reference data. This model can then be applied to classify

cells or clusters as specific cell types in novel data. These methods can outperform any of the

prepackaged automatic-annotation software tools

33

but require substantial computational

expertise to use.

Another approach to reference-based cell annotation is to integrate a query data set with

a reference data set using an integration algorithm, enabling clusters to be identified that span

Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.

Figures

Citations

Single‐cell RNA sequencing technologies and applications: A brief overview

Best practices for single-cell analysis across modalities

Biologically informed deep learning to query gene programs in single-cell atlases

Advancing CAR T cell therapy through the use of multidimensional omics data

Harmonized single-cell landscape, intercellular crosstalk and tumor architecture of glioblastoma

References

Random Forests

Scikit-learn: Machine Learning in Python

Support-Vector Networks

Visualizing Data using t-SNE

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

Related Papers (5)

Tools for the analysis of high-dimensional single-cell RNA sequencing data.

Chipster: user-friendly analysis software for microarray and other high-throughput data

Cross-species and cross-platform gene expression studies with the Bioconductor-compliant R package 'annotationTools'

Facilitating functional annotation of chicken microarray data

Bioinformatics analysis of microarray data.