scispace - formally typeset
Open AccessJournal ArticleDOI

Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.

TLDR
In this paper, the authors provide guidelines for interpreting single-cell transcriptomic maps to identify cell types, states and other biologically relevant patterns with the objective of creating an annotated map of cells.
Abstract
Single-cell transcriptomics can profile thousands of cells in a single experiment and identify novel cell types, states and dynamics in a wide variety of tissues and organisms. Standard experimental protocols and analysis workflows have been developed to create single-cell transcriptomic maps from tissues. This tutorial focuses on how to interpret these data to identify cell types, states and other biologically relevant patterns with the objective of creating an annotated map of cells. We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification. Frequently encountered challenges are discussed, as well as strategies to address them. Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow. Basic familiarity with computer software is assumed, and basic knowledge of programming (e.g., in the R language) is recommended. This tutorial provides guidelines for interpreting single-cell transcriptomic maps to identify cell types, states and other biologically relevant patterns.

read more

Content maybe subject to copyright    Report

1
Tutorial: guidelines for annotating single-cell transcriptomic maps
using automated and manual methods
Zoe A. Clarke
1,2,
*, Tallulah S. Andrews
2,3,4
*, Jawairia Atif
3,4
*, Delaram Pouyabahar
1,2,
*,
Brendan T. Innes
1,2
, Sonya A. MacParland
3,4,5
+, Gary D. Bader
1,2,6,7
+
1 - Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
2 - The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
3 - Ajmera Transplant Centre, Toronto General Hospital Research Institute, Toronto, Ontario,
Canada
4 - Department of Immunology, University of Toronto, Toronto, Ontario Canada
5 - Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto,
Ontario, Canada
6 - Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
7 - Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
*equal contribution
+corresponding: s.macparland@utoronto.ca, gary.bader@utoronto.ca

2
Abstract
Single-cell transcriptomics can profile thousands of cells in a single experiment and identify novel
cell types, states and dynamics in a wide range of tissues and organisms. Standard experimental
protocols and analysis workflows have been developed to create single-cell transcriptomic maps
from tissues. This Tutorial focuses on how to interpret these data to identify cell types, states and
other biologically relevant patterns with the objective of creating an annotated map of cells. We
recommend a three-step workflow including automatic cell annotation (wherever possible),
manual cell annotation and verification. Frequently encountered challenges are discussed, as well
as strategies to address them. Guiding principles and specific recommendations for software tools
and resources that can be used for each step are covered and an R notebook is included to help
run the recommended workflow. Basic familiarity with computer software is assumed and basic
knowledge of programming (e.g. in the R language) is recommended.

3
Introduction
Single-cell genomics enables the molecular profiling of thousands of cells in a single
experiment
1–3
to create comprehensive maps of cellular heterogeneity in multicellular systems
4,5
.
In particular, single-cell RNA sequencing (scRNA-seq) and single-nuclei RNA sequencing
(snRNA-seq) can be used to measure single-cell transcriptomes and map novel cell types
6
,
states
7
and dynamics
8
in a wide range of tissues and organisms.
Single-cell transcriptomics data are often presented as a two-dimensional “map”
organizing cells based on the similarity of their gene expression profiles. Data visualized in this
way naturally identifies groups (or “clusters”) of highly similar cells, as well as gradients and other
transcript-based patterns. Such artifacts must be interpreted and annotated to define cell types
and states to support biological discovery (Figure 1). Standard experimental protocols and
analysis workflows detail how to create single-cell transcriptomic maps from tissues
9–12
. Briefly,
tissues are dissociated into single cells and profiled using a single-cell transcriptomic technology.
Computational analysis is then used to perform quality control filtering on the results (e.g.
removing low-quality cells), quantify the expression of each mapped gene in each cell
13
, identify
groups of similar cells using a clustering algorithm
1418
, and visualize all cells in two dimensions
using techniques such as tSNE
19
or UMAP
20
to produce an unannotated “single-cell map” image
(Box 1)
21
. To interpret this map biologically, it is necessary to determine which cell types or cell
states are represented by clusters or other patterns (e.g. gradients) observed in the data. These
interpretations can then be labeled on the map, which helps place them in a conceptual framework
useful for better understanding tissue biology. This Tutorial offers a guide to the map interpretation
and labeling process, starting from clustered data and resulting in a completely annotated single-
cell map (Figure 1). The general workflow for annotating cells in scRNA-seq data has three major
steps: automatic annotation, manual annotation, and verification (Figure 2). First, automatic
annotation uses a predefined set of “marker genes” (i.e. genes that are specifically expressed in
a known cell type) or reference single-cell data (i.e. an existing expertly annotated single-cell map)
to identify and label individual cells or cell clusters by matching their gene expression patterns
(signatures) to those of known cell types. A second major step is manual annotation, which
involves studying genes and gene functions specific to each cell cluster or pattern to verify
automatic cell annotations and identify novel cell types and states. Finally, verification can confirm
the identity and function of select cell types using independent methods, such as new validation
experiments.
Step 1: Automatic cell annotation
Automatic cell annotation is an efficient way to label cells or cell clusters using a computer
algorithm and an appropriate set of prior biological knowledge. The general principle is to identify
a gene expression signal (pattern, signature) in a single cell or cell cluster that matches a
characteristic gene expression signature of a known cell type or state; the cell or cluster is then
assigned the respective label. Labels often have an associated confidence score.
There are two major automatic cell annotation approaches. One is to use known marker
genes for each of the cell types that are likely to be found in the sample to be annotated (referred

4
to as “marker-based automatic annotation”). In this case, known relationships between marker
genes and cell types are obtained from databases, such as SCSig
22
, PanglaoDB
23
, and
CellMarker
24
, or manually from the literature. Then cells or clusters are labeled according to the
marker genes they characteristically express. The second approach is to compare single-cell
RNA-seq data to be annotated (the ‘query’ data set) to an existing, similar, expertly annotated
scRNA-seq data set (the ‘reference’ data set), and transfer the label from a reference cell or
cluster to a sufficiently similar one in the query (referred to as “reference-based automatic
annotation”). Reference single-cell data are obtained from sources such as Gene Expression
Omnibus (GEO)
25
, the Single Cell Expression Atlas
26
or cell atlas projects
27,28
.
Automatic cell annotation methods can be applied to individual cells (either before or after
clustering) or to clusters of cells, which occurs only after clustering the cells. In the case of
annotating clusters, the gene expression profile for each cluster is determined by averaging the
expression profiles of all cells within the cluster. Annotating individual cells is ideal, as this reduces
the chance of missing important differences between cells. However, some scRNA-seq
experimental data are based on low numbers of transcript reads per cell, so there may be
insufficient data for cell-based annotation to function correctly, making clustered data sets easier
to work with. Annotating clusters is faster, as there are fewer clusters than cells to process; it can
also be more accurate than the single-cell approach, considering it is based on more reliable
expression level estimates averaged across all cells in a cluster. However, not all cells can be
easily grouped into clusters, especially for dynamic systems like developing tissues
29
or tissues
that contain gene expression gradients
30,31
.
A major challenge with automatic cell annotation is that many cell types do not have well-
characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some
cells. Automated methods typically work better for major cell types and may not be able to
effectively distinguish subtypes. Thus, automatic cell annotation is useful to quickly identify known
cell types and highlight unknown cell types for further exploration. The main caveats and
recommendations for automatic cell annotation are summarised in Table 1.
Marker-based automatic annotation
Marker-based automatic annotation labels cells or cell clusters based on the characteristic
expression of known marker genes. To be successful, the marker gene or gene set (a collection
of marker genes) should be specifically and consistently expressed in a given cell, cluster or class
of cells (e.g. immune cells). Markers are readily available for well-characterized organisms and
cell types (e.g. human PBMC samples
32
). Marker-based automatic annotation works well once a
relevant and sufficiently large set of marker genes is collected
33
.
To label individual cells, one of the most reliable marker-based annotation tools is Semi-
supervised Category Identification and Assignment (SCINA)
34
. SCINA assumes each marker
follows a bimodal gene expression distribution, where one peak corresponds to cells from the
associated cell type and the other peak contains the rest of the cells in the experiment. A cell of
a particular type is assumed to have expression in the upper part of this distribution for all the
markers of that cell type, consequently requiring markers provided as input to SCINA to be specific
to only one cell type. AUCell
35
is another good marker-based labeling method that classifies

5
individual cells or clusters. AUCell ranks the genes in each cell by decreasing expression value,
and cells are labeled according to their most active (highly expressed) marker gene sets. AUCell
works best with cell types that have a sufficiently large set of marker genes such that multiple
markers are detected in each cell. It has the advantage of scoring a whole set of marker genes at
once, which may increase sensitivity over methods that examine each marker gene
independently.
To label whole clusters, Gene Set Variation Analysis
36
(GSVA) has been benchmarked to
be fast and reliable
37
. GSVA works similarly to AUCell - given a database of marker gene sets, it
identifies sets that are enriched in the gene expression profile of a cluster. The GSVA software
has a practical advantage that it can annotate all clusters in one operation.
Marker-based automatic cell annotation methods often have the advantage that they will
only assign labels to cells associated with known markers and other cells will remain unlabeled
33
.
However, this depends on the specific tool and the parameters used; see Table 2 and
Supplementary table 1 for details on which tools have the option to leave cells unlabelled. A
disadvantage of these tools is that markers are not easily accessible for all cell types.
Reference-based automatic cell annotation
Reference-based cell annotation is based on the concept of “guilt-by-association”,
whereby a cell or cluster label in the reference data is transferred to an unlabeled cell or cluster
in the query data with a similar gene expression profile. Consequently, this approach is only
possible if high-quality and relevant annotated reference single-cell data are available. Studying
the original clustering and annotation steps performed on the reference data can help determine
its quality, and ensure that errors in the reference will not be propagated to new data. Tissue-
specific reference data can be obtained from public databases (e.g. the Gene Expression
Omnibus
25
or the Expression Atlas
26
) or large cell atlas projects (e.g. the Human Cell Atlas
27
, the
Tabula Muris or Mouse Cell Atlas
5
, or others
4,28,3840
), although the required associated cell
annotations are not always easily available. These atlases typically contain hundreds of
thousands of cells and dozens of different annotated cell types.
scmap
41
is one of the best performing tools for reference-based automatic cell or cluster
annotation, in terms of both accuracy of assigned labels and avoiding incorrect labeling of novel
cell types
33
. Other tools for reference-based automatic annotation include SingleCellNet
42
and
SingleR
43
. SingleCellNet has high accuracy when all cell types are well represented in the
reference data but with low accuracy if the reference data are incomplete or represent a poor
match
33
. The main advantage to SingleR is that a reasonable, general reference data set is
included with the tool, but this may not perform as well as a reference specifically matched to the
query data set. An alternative to using specific software packages for reference-based cell
annotation is to train a machine learning tool, such as a support vector machine (SVM)
44
or
random forest classifier
45
, on selected reference data. This model can then be applied to classify
cells or clusters as specific cell types in novel data. These methods can outperform any of the
prepackaged automatic-annotation software tools
33
but require substantial computational
expertise to use.
Another approach to reference-based cell annotation is to integrate a query data set with
a reference data set using an integration algorithm, enabling clusters to be identified that span

Citations
More filters
Journal ArticleDOI

Single‐cell RNA sequencing technologies and applications: A brief overview

TL;DR: In this article , the authors provide a concise overview about the scRNA-seq technology, experimental and computational procedures for transforming the biological and molecular processes into computational and statistical data, and highlight a few examples on how scRNAseq can provide unique information for better understanding health and diseases.
Journal ArticleDOI

Best practices for single-cell analysis across modalities

TL;DR: In this article , the authors summarize independent benchmarking studies of unimodal and multimodal single-cell analysis across modalities to suggest comprehensive best-practice workflows for the most common analysis steps.
Journal ArticleDOI

Biologically informed deep learning to query gene programs in single-cell atlases

TL;DR: ExpiMap as mentioned in this paper learns to map cells into biologically understandable components representing known 'gene programs' by learning the activity of each cell for a gene program while simultaneously refining them and learning de novo programs.
Posted ContentDOI

Harmonized single-cell landscape, intercellular crosstalk and tumor architecture of glioblastoma

TL;DR: The GBmap represents a framework that allows the streamlined integration and interpretation of new data and provides a platform for exploratory analysis, hypothesis generation and testing and the results uncover the sources of pro-angiogenic signaling and the multifaceted role of mesenchymal-like cancer cells.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
Related Papers (5)