scispace - formally typeset
Open AccessPosted ContentDOI

A Single-cell RNA-seq Training and Analysis Suite using the Galaxy Framework

Reads0
Chats0
TLDR
The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users to run flexible analyses on both 10x and alternative platforms and provides rich and reproducible scRNA-seq workflows with a wide range of robust tools.
Abstract
Background: The vast ecosystem of single-cell RNA-seq tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatic community leans once more towards the large computing requirements and the statistically-driven methods needed to process and understand these ever-growing datasets. Results: Here we outline several Galaxy workflows and learning resources for scRNA-seq, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatic framework provides tools, workflows and trainings that not only enable users to perform one-click 10x preprocessing, but also empowers them to demultiplex raw sequencing data manually. The downstream analysis supports a wide range of high quality interoperable suites separated into common stages of analysis: inspection, filtering, normalization, confounder removal and clustering. The teaching resources cover an assortment of different concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal. Conclusions: The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users to run flexible analyses on both 10x and alternatively derived datasets. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy Community provide a means for users to learn, publish and teach scRNA-seq analysis.

read more

Content maybe subject to copyright    Report

GigaScience, 2020, 1–9
doi: xx.xxxx/xxxx
Manuscript in Preparation
Paper
P A P E R
A single-cell RNA-seq Training and Analysis Suite
using the Galaxy Framework
Mehmet Tekman
1
*
, Bérénice Batut
1
*
, Alexander Ostrovsky
2
, Christophe
Antoniewski
3
, Dave Clements
2
, Fidel Ramirez
4
, Graham J Etherington
5
,
Hans-Rudolf Hotz
6
, Jelle Scholtalbers
7
, Jonathan R Manning
8
, Lea
Bellenger
3
, Maria A Doyle
9
, Mohammad Heydarian
2
, Ni Huang
8,10
, Nicola
Soranzo
5
, Pablo Moreno
8
, Stefan Mautner
1
, Irene Papatheodorou
8
, Anton
Nekrutenko
11
, James Taylor
2
, Daniel Blankenberg
12
, Rolf Backofen
1
and
Björn Grüning
1
*
1
Chair of Bioinformatics, University of Freiburg, Freiburg, Germany, and
2
Department of Biology, Johns
Hopkins University, Baltimore, Maryland, USA and
3
ARTbio, Sorbonne Université, CNRS FR 3631, Inserm US
037, Paris, France and Institut de Biologie Paris Seine, Paris, France and
4
Boehringer Ingelheim, Biberach,
Germany and
5
Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom and
6
Friedrich
Miescher Institute for Biomedical Research, Basel, Switzerland and Swiss Institute of Bioinformatics, Basel,
Switzerland and
7
European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany and
8
European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton, United
Kingdom and
9
Research Computing Facility, Peter MacCallum Cancer Centre, Melbourne, Victoria 3000,
Australia and Sir Peter MacCallum Department of Oncology, The University of Melbourne, Victoria 3010,
Australia and
10
Wellcome Sanger Institute, Cambridge, United Kingdom and
11
Department of Biochemistry
and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA and
12
Genomic
Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, USA
*
tekman@informatik.uni-freiburg.de; berenice.batut@gmail.com; gruening@informatik.uni-freiburg.de
These authors contributed equally to this work.
Abstract
Background The vast ecosystem of single-cell RNA-seq tools has until recently been plagued by an excess of diverging
analysis strategies, inconsistent le formats, and compatibility issues between dierent software suites. The uptake of 10x
Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large
computing requirements and the statistically-driven methods needed to process and understand these ever-growing
datasets.
Results Here we outline several Galaxy workows and learning resources for scRNA-seq, with the aim of providing a
comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap
between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework
provides tools, workows and trainings that not only enable users to perform one-click 10x preprocessing, but also
empowers them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream
analysis supports a wide range of high-quality interoperable suites separated into common stages of analysis: inspection,
ltering, normalization, confounder removal and clustering. The teaching resources cover an assortment of dierent
concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal.
Conclusions
The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users
to run exible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with
the frequent training workshops hosted by the Galaxy Community provide a means for users to learn, publish and teach
scRNA-seq analysis.
Key words: scRNA; Galaxy; resources; HPC; single-cell; 10x; Training; Web;
1
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

2
|
GigaScience, 2020, Vol. 00, No. 0
Key Points
Single-cell RNA-seq has stabilised towards 10x Genomics datasets.
Galaxy provides rich and reproducible scRNA-seq workows with a wide range of robust tools.
The Galaxy Training Network provides tutorials for the processing of both 10x and non-10x datasets.
Background
Single-Cell RNA-seq and cellular heterogeneity. The continuing
rise in single cell technologies has led to previously unprece-
dented levels of analysis into cell heterogeneity within tissue
samples, providing new insights into developmental and dif-
ferentiation pathways for a wide range of disciplines. Gene
expression studies are now performed at a cellular level of res-
olution, which compared to bulk RNA-seq methods, allows re-
searchers to model their tissue samples as distributions of dif-
ferent expressions instead of as an average.
Pathways from Single-cell data. The various expression proles
uncovered within tissue samples infer discrete cell types which
are related to one another across an “expression landscape”.
The relationships between the more distinct proles are in-
ferred via distance-metrics or manifold learning techniques.
Ultimately, the aim is to model the continuous biological pro-
cess of cell dierentiation from multipotent stem cells to dis-
tinct mature cell types, and infer lineage and dierentiation
pathways between transient cell types [1].
Elucidating Cell Identity. Trajectory analysis which integrates
the up or down regulation of signicant genes along lineage
branches can then be performed in order to uncover the factors
and extracellular triggers that can coerce a pluripotent cell to
become biased towards one cell fate outcome compared to an-
other. This undertaking has created a new frontier of explo-
ration in cell biology, where researchers have assembled refer-
ence maps for dierent cell lines for the purpose of fully record-
ing these cell dynamics and their characteristics in which to
create a global “atlas” of cells [2, 3].
Pitfalls and Technical Challenges
Sequencing sensitivity and Normalization. With each new protocol
comes a host of new technical problems to overcome. The rst
wave of software utilities to deal with the analysis of single cell
datasets were statistical packages, aimed at tackling the issue
of “dropout events” during sequencing, which would manifest
as a high prevalence of zero-entries in over 80% of the feature-
count matrix. These zeroes were problematic, since they could
not be trivially ignored as their presence stated that either the
cell did not produce any molecules for that transcript, or that
the sequencer simply did not detect them. Normalisation tech-
niques originally developed for bulk RNA-seq had to be adapted
to accommodate for this uncertainty, and new ones were cre-
ated that harnessed hurdle models, data imputation via mani-
fold learning techniques, or by pooling subsets of cells together
and building general linear models [4].
Improvements in sequencing. With the downstream analysis
packages attempting to solve the dropouts via stochastic meth-
ods, the upstream sequencing technologies also aspired to solve
the capture eciency via new well, droplet, and ow cytome-
try based protocols, all of which lend importance to the process
of barcoding sequencing reads.
In each protocol, cells are tagged with cell barcodes such
that any reads derived from them can be unambiguously as-
signed to the cell of origin. The inclusion of unique molecular
identiers (UMIs) are also employed to mitigate the eects of
amplication bias of transcripts within the same cell. The de-
tection, extraction, and (de-)multiplexing of cell barcodes and
UMIs is therefore one of the rst hurdles researchers encounter
when receiving raw FASTQ data from a sequencing facility.
The Burgeoning Software Ecosystem
Since its conception, several dierent packages and many
pipelines have been developed to assist researchers in the anal-
ysis of scRNA-seq [5, 6]. The vast majority of these pack-
ages were written for the R programming language since many
of the novel normalisation methods developed to handle the
dropout events depended on statistical packages that were
primarily R-based [7]. Standalone analysis suites emerged
as the dierent authors of these packages rapidly expanded
their methods to encapsulate all facets of the single-cell anal-
ysis, often creating compatibility issues with previous package
versions. The Bioconductor repository provided some much-
needed stability in this regard by hosting stable releases, but
researchers were still prone to building directly from reposi-
tory sources in order to reap the benets of new features in the
upstream versions [8, 9].
Nonexchangeable Data Formats. Another issue was the prolif-
eration of the many dierent and quickly evolving R-based
le formats for processing and storing the data, such as
SingleCellExperiment as used by the Scater suite, SCSeq from
RaceID, and SeuratObject from Seurat [10, 11]. Many new pack-
ages would cater only towards one format or suite, leading to
the common problem that data processed in one suite could not
be reliably processed by methods in another. This incompati-
bility between packages fuelled a choice of one analysis suite
over another, or conversely required researchers to dig deeper
into the internal semantics of R S4 objects in order to manually
slot data components together [12]. These problems only accel-
erated the rapid development of these suites, leading to further
version instability. As a result of this analysis diversity, there
are many tutorials on how to perform scRNA-seq analysis each
oriented around one of these pipelines [13].
Error propagation and Analysis Uncertainty. Dierent pipelines
produce dierent results, where the stochastic nature of the
analyses means that any uncertainty in a crucial quality control
step upstream, such as ltering or the removal of unwanted
variability, can propagate forward into the downstream sec-
tions to yield wildly dierent results on the same data. This
uncertainty, and the statistically-driven methods to overcome
Compiled on: August 28, 2020.
Draft manuscript prepared by the author.
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

Tekman et al.
|
3
them, leaves a wide knowledge gap for researchers simply try-
ing to understand the underlying dynamics of cell identity.
Rise of 10x Genomics
10x Launch. In 2015, 10x Genomics released their GemCode prod-
uct, which was a droplet-seq based protocol capable of se-
quencing tens of thousands of cells with an average cell quality
higher than other facilities [14]. This unprecedented level of
throughput steadily gained traction amongst researchers and
startups seeking to perform single-cell analysis, and thus 10x
datasets began to prevail in the eld.
10x Analysis Software. 10x Genomics provided software that was
able to perform much of the pre-processing, and provided
feature-count matrices in a transparent HDF5-based format
which provided a means of ecient matrix storage and ex-
change, and conclusively removed the restriction for down-
stream analysis modules to be written in R.
ScanPy, a popular alternative. The ScanPy suite [15], written in
Python, using its own HDF5-based AnnData format became a
valid alternative for analysing 10x datasets. The Seurat devel-
opers had similar aspirations and soon adopted the LOOM format,
another HDF5 variant. However, the popularity of ScanPy rose
as it began to integrate the methods of other standalone pack-
ages into its codebase, making it the natural choice for users
who wanted to achieve more without compromising on com-
patibility between dierent suites [9].
Solutions in the Cloud
Accessible Science. As the size of datasets scaled, so did the
computing resources required to analyse them, both in terms
of the processing power and in storage. Galaxy is an open-
source biocomputing infrastructure that exemplies the three
main tenets of science: reproducibility, peer review, and open-
access - all freely accessible within the web-browser [16]. It
hosts a wide range of highly-cited bioinformatics tools with
many dierent versions, and enables users to freely create their
own workows via a seamless drag-and-drop interface.
Reproducible Workows. Galaxy can make use of Conda or Con-
tainers to setup tool environments in order to ensure that the
bioinformatics tools will always be able to run, even when the
library dependencies for a tool have changed, by building tools
under locked version dependencies and bundling them together
in a self-contained environment [17]. These environments pro-
vide a concise solution for the package version instability that
plagues scRNA-seq analysis notebooks, both in terms of repro-
ducibility and analysis exibility. A user could keep the quality
control results obtained from an older version of ScanPy, whilst
running a newer ScanPy version at the clustering stage to reap
the benets of the later improvements in that algorithm. By
allowing the user to select from multiple versions of the same
tool, and by further permitting dierent versions of the tools
within a workow, Galaxy enables an unprecedented level of
free-ow analysis by letting researchers pick and choose the
best aspects of a tool without worrying about the underlying
software libraries [18]. The burden of software incompatibility
and choice of programming language that plagued the scRNA-
seq analysis ecosystem before, are now completely alleviated
from the user.
User-driven Custom Workows. Analyses are not limited to one
pipeline either, as the datasets which are passed between tools
can easily be interpreted by a dierent tool that is capable of
reading that dataset. In the case of scRNA-seq, Galaxy can
convert between CSV, MTX, LOOM and AnnData formats. This inter-
exchange of modules from dierent tools further extends the
exibility of the analysis by again letting the user decide which
component of a tool would be best suited for a specic part of
an analysis.
Training Resources. Galaxy also provides a wide range of learn-
ing resources, with the aim of guiding users step-by-step
through an analysis, often reproducing the results of published
works. The teaching and training materials are part of the
Galaxy Training Network (GTN), which is a worldwide collabo-
rative eort to produce high-quality teaching material in order
to educate users in how to analyse their data, and in turn to
train others of the same materials via easily deployable work-
shops backed by monthly stable releases of the GTN materials
[19]. Training materials are provided on a wide variety of dier-
ent topics, and workshops are hosted regularly, as advertised
on the Galaxy Events web portal. The GTN has grown rapidly
since its conception and gains new volunteers every year who
each contribute and coordinate training and teaching events,
maintain topic and subtopics, translate tutorials into multiple
languages, and provide peer review on new material [20].
Methods
Stable Workows in Galaxy. The analysis of scRNA-seq within
Galaxy was a two-pronged eort concentrated on bringing
high quality single-cell tools into Galaxy, and providing the
necessary workows and training to accompany them. As men-
tioned in the previous section, this eort needed to overcome
incompatible le format issues, unstable packages due to rapid
development, and needed to establish a standardised basis for
the analysis.
Tutorials. The tutorials are split into two main parts as out-
lined in Figure 1: rst, the pre-processing stage which con-
structs a count matrix from the initial sequencing data; sec-
ond, the cluster-based downstream analysis on the count ma-
trix. These stages are very dierent from one another in terms
of their information content, since the pre-processing stage re-
quires the researcher to be more familiar with wetlab sequenc-
ing protocols than your average bioinformatician would nor-
mally know, and the downstream analysis stage requires the
researcher to be familiar with statistics concepts that a wet-
lab scientist might not be too familiar with. The tutorials are
designed to broadly appeal to both the biologist and the statis-
tician, as well as complete beginners to the entire topic.
Pre-processing Workows
The pre-processing scRNA-seq materials tackle the two most
common use-cases that researchers will encounter when they
rst begin the eld: processing scRNA-seq data from 10x Ge-
nomics, and processing data generated from alternative pro-
tocols. For instance, microwell-based protocols have been
known to yield more features and display lower levels of
dropouts compared to 10x, and so we accommodate for them
by providing a more customizable path through the pre-
processing stage [21].
Barcode Extraction. Before the era of 10x Genomics, scRNA-seq
data had to be demultiplexed, mapped, and quantied. The de-
multiplexing stage entails an intimate knowledge of cell bar-
codes and Unique Molecular Identiers (UMIs) which are pro-
tocol dependent, and expects that the bioinformatician knows
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

4
|
GigaScience, 2020, Vol. 00, No. 0
Figure 1. The main stages of single-cell analysis, separated broadly into the up-
per and lower stages of pre-processing and downstream analysis, respectively.
The upper part illustrates the two main routes to generating a count-matrix
from sequencing data; via one-click quantication solutions, or through man-
ual demultiplexing. The lower part describes the four main stages required
to perform cluster-based analysis from the count-matrix, through ltering,
normalisation, confounder removal, and embedding.
exactly where and how the data was generated. One common
pitfall at this very rst stage is estimating how many cells to
expect from the FASTQ input data, and this requires three cru-
cial pieces of information: which reads contain the barcodes (or
precisely, which subset of both the forward and reverse reads
contains the barcodes); of these barcodes, which specic ones
were actually used for the analysis; and how to resolve barcode
mismatches/errors.
Barcode Estimation. Naive strategies involve using a known bar-
code template and querying against the FASTQ data to prole the
number of reads that align to a specic barcode, often employ-
ing ’knee’ methods to estimate this amount [22]. However,
this approach is not robust, since certain cells are more likely
to be over-represented compared to others, and some cell bar-
codes may contain more unmappable reads compared to others,
meaning that the metric of higher library read depth is not nec-
essarily correlated with a better-dened cell. Ultimately, the
bioinformatician must inquire directly with the sequencing lab
as to which cell barcodes were used, as these are often not spe-
cic to the protocol but to the technician who designed them,
with the idea that they should not align to a specic reference
genome or transcriptome.
One-click Pre-processing
Quantication with Cell Ranger. 10x Genomics simplied the
scRNA package ecosystem by using a language independent le
format, and streamlining much of the barcode particularities
with their Cell Ranger pipeline, allowing researchers to focus
more on the internal biological variability of their datasets [23].
Quantication with STARsolo. The pre-processing workow (ti-
tled "10X StarSolo Workow") in Galaxy uses RNA STARsolo util-
ity as a drop-in replacement for Cell Ranger, because not only
is it a feature update of the already existing RNA STAR tool in
Galaxy, but because it boasts a ten-fold speedup in comparison
to Cell Ranger and does not require Illumina lane-read informa-
tion to perform the processing [24, 25].
Other Approaches. The pre-processing workows for these
“one-click” solutions consume the same datasets and yield
approximately the same count matrices by following simi-
lar modes of barcode discovery and quantication. Within
Galaxy, there is also Alevin (paired with Salmon) and scPipe
which can both also perform the necessary demultiplexing,
(alignment-free) mapping, and quantication stages in a sin-
gle step [26, 27, 28].
Flexible Pre-processing
CELSeq2 Barcoding. The custom pre-processing workow (ti-
tled "CELSeq2: Single Batch mm10") is modelled after the CEL-
seq2 protocol using the barcoding strategies of the Freiburg
Max-Planck Institute laboratory as its main template, but the
workow is actually exible to accommodate any droplet or
well-based protocol such as SMART-seq2, and Drop-seq [29].
Manual Demultiplexing and Quantication. The training picto-
graphically guides users through the concepts of extracting cell
barcodes from the protocol, explains the signicance of UMIs
in the process of read deduplication with illustrative examples,
and instructs the user in the process of performing further
quality controls on their data during the post-mapping process
via RNA STAR and other tools that are native to Galaxy.
Training the User. At each stage, the user’s knowledge is queried
via question prompts and expandable answer box dialogs, as
well as helpful hints for future processing in comment boxes,
all written in the transparent Markdown specication devel-
oped for contributing to the GTN.
Downstream Workows
Common Stages of Analysis. The downstream modules are de-
ned by the ve main stages of downstream scRNA-seq anal-
ysis: ltering, normalisation, confounder removal, clustering,
and trajectory inference. There are three workows to aid in
this process (two of which are shown in Figure 2), each sport-
ing a dierent well-established scRNA-seq pipeline tool.
Quality Control with Scater. The Scater pipeline follows a
visualise-lter-visualise paradigm which provides an intuitive
means to perform quality control on a count matrix by use of
repeated incremental changes on a dataset through the use of
PCA and library size based metrics [30]. Once this pre-analysis
stage is complete, the full downstream analysis (comprising
the ve stages mentioned above) can be performed by work-
ows based on the following suites: RaceID and ScanPy.
Downstream Analysis with the RaceID Suite. RaceID was developed
initially to analyse rare cell transcriptomes whilst being ro-
bust against noise, and thus is ideal for working with smaller
datasets in the range of 300 to 1000 cells. Due to its complex
cell lineage and fate predictions models, it can also be used on
larger datasets with some scaling costs.
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

Tekman et al.
|
5
Downstream Analysis with the ScanPy Suite. ScanPy was devel-
oped as the Python alternative to the innumerable R-based
packages for scRNA-seq which was the dominant language for
such analyses, and it was one of the rst packages with na-
tive 10x genomics support. Since then it has grown substan-
tially, and has been re-implementing much of the newer R-
based methods released in BioConductor as “recipe” modules,
thereby providing a single source to perform many dierent
types of the same analysis.
The workows derived from both these suites emulate the
ve main stages of analysis mentioned previously, where lter-
ing, normalisation, and confounder removal are typically sep-
arated into distinct stages, though sometimes merged into one
step depending on the tool.
Filtering
Cell and Gene Removal. During the ltering stage, the initial
count matrix removes low-quality or unwanted cells using
commonly used parameters such as minimum gene detection
sensitivity and minimum library size, and low-quality genes
are also removed under similar metrics, where the minimum
number of cells for a gene to be included is decided. The Scater
pre-analysis workow can also be used here to provide a PCA-
based method of feature selection so that only the highly vari-
able genes are left in the analysis.
Disadvantages of Filtering. There is always the danger of over-
ltering a dataset, whereby setting overzealous lower-bound
thresholds on gene variability, can have the undesired eect of
removing essential housekeeping genes. These relatively uni-
formly expressed genes are often required for setting a baseline
to which the more desired dierentially expressed genes can
be selected from. It is therefore important that the user rst
performs a naive analysis and only later rene their ltering
thresholds to boost the biological signal.
Normalisation
Library Size Normalisation. The normalisation step aims to re-
move any technical factors that are not relevant to the analysis,
such as the library size, where cells sharing the same identity
are likely to dier from one another more by the number of
transcripts they exhibit, than due to more relevant biological
factors.
Intrinsic Cell Factors. The rst and foremost is cell capture e-
ciency, where dierent cells produce more or less transcripts
based on the amplication and coverage conditions they are
sequenced in. The second is the presence of dropout events
which manifest as a prevalence of “zeroes” in the nal count
matrix. Whether a “zero” is imputable to the lack of detection
of an existing molecule or to the absence of the molecule in the
cell is uncertain. This uncertainty alone has led to a wide se-
lection of dierent normalisation techniques that try to model
this expression either via hurdle models, or imputing the data
via manifold learning techniques, or working around the issue
by pooling subsets of cells together [31].
In this regard, both the RaceID and ScanPy workows of-
fer many dierent normalisation techniques, and users are en-
couraged to take advantage of the branching workow model
of Galaxy to explore all possible options.
Confounder Removal
Regression of Cell Cycle Eects. Other sources of variability stem
from unwanted biological contributions known as confounder
eects, such as cell cycle eects and transcription. Depending
on what stage of the cell cycle a cell was sequenced at, two cells
of the same type might cluster dierently because one might
have more transcripts due to it being in the M-phase of the cell
Figure 2. Downstream analysis workows as shown in the Galaxy Workow
Editor for (top) RaceID and (bottom) ScanPy, each displaying modules symbol-
izing the ve main stages of analysis.
cycle. Library sizes notwithstanding, it is the variability in spe-
cic cell cycle genes that can be the main driving factor in the
overall variability. Thankfully, these eects are easy to regress
out, and we replicate an entire standalone ScanPy workow
dedicated to detecting and visualising the eects based on the
original notebook [32].
Transcriptional Bursting. The transcription eects are harder to
model, as these are semi-stochastic and are as of yet still not
well understood. In bulk RNA-seq the expression of genes un-
dergoing transcription are averaged to give “high” or “low”
signals producing a global eect that gives the false impres-
sion that transcription is a continuous process. The reality is
more complex, where cells undergo transcription in “bursts” of
activity followed by periods of no activity, at irregular intervals
[33]. At the bulk level these discrete processes are smoothed to
give a continuous eect, but at the cell level it could mean that
even two directly adjacent cells of the same type normalised to
the same number of transcripts can still have dierent levels
of expression for a gene due to this process. This is not some-
thing that can be countered for, but it does educate the users
in which factors they can or cannot control in an analysis, and
how much variability they can expect to see.
Clustering and Projection
Dimension Reduction and Clustering. Once a user has obtained a
count matrix they are condent with, they are then guided
through the process of dimension reduction (with choice of dif-
ferent distance metrics), choosing a suitable low-dimensional
embedding, and performing clustering through commonly-
used techniques such as k-means, hierarchical, and neighbour-
hood community detection. These complex techniques are il-
lustrated in layman’s terms through the use of helpful images
and community examples. For example, the GTN ScanPy tu-
torial explains the Louvain clustering approach[34] via a stan-
dalone slide deck to assist in the workow [35].
Commonly-used Embeddings. The clustering and the cluster in-
spection stages are notably separated into distinct utilities here,
with the understanding that the same initial clustering can ap-
pear dissimilar under dierent projections, e.g. t-distributed
Stochastic Network Embedding (tSNE) against Uniform Mani-
fold Approximation and Projection (UMAP) [36, 37]. Ultimately
the user is encouraged to play with the plotting parameters to
yield the best looking clusters.
Static Plots or Interactive Environments. Cluster inspection tools
are available that allow users to easily generate static plots
tailored to pipeline-specic information as originally dened
.CC-BY 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

Citations
More filters

Revealing the vectors of cellular identity with single-cell genomics

TL;DR: Single-cell genomics has now made it possible to create a comprehensive atlas of human cells and has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry.
Journal ArticleDOI

From bench to bedside: Single-cell analysis for cancer immunotherapy

TL;DR: A review of sample processing and computational analysis regarding their application to translational cancer immunotherapy research can be found in this paper, where the authors identify predictors of response using single-cell technologies.
Journal ArticleDOI

Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-Practice

TL;DR: A selection of available textual and graphical workflow systems and their support for six key challenges that a domain expert faces in transforming their problem into a computational workflow, and then into an executable implementation are discussed.
References
More filters
Journal ArticleDOI

STAR: ultrafast universal RNA-seq aligner

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

Fast unfolding of communities in large networks

TL;DR: This work proposes a heuristic method that is shown to outperform all other known community detection methods in terms of computation time and the quality of the communities detected is very good, as measured by the so-called modularity.
Journal ArticleDOI

Fast unfolding of communities in large networks

TL;DR: In this paper, the authors proposed a simple method to extract the community structure of large networks based on modularity optimization, which is shown to outperform all other known community detection methods in terms of computation time.
Journal ArticleDOI

Salmon provides fast and bias-aware quantification of transcript expression

TL;DR: Salmon is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "A single-cell rna-seq training and analysis suite using the galaxy framework" ?

Here the authors outline several Galaxy work ows and learning resources for scRNA-seq, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, work ows and trainings that not only enable users to perform one-click 10x preprocessing, but also empowers them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users to run exible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy Community provide a means for users to learn, publish and teach scRNA-seq analysis. 

ScanPy was developed as the Python alternative to the innumerable R-based packages for scRNA-seq which was the dominant language for such analyses, and it was one of the rst packages with native 10x genomics support. 

The downstream modules are dened by the ve main stages of downstream scRNA-seq analysis: ltering, normalisation, confounder removal, clustering, and trajectory inference. 

The rst wave of software utilities to deal with the analysis of single cell datasets were statistical packages, aimed at tackling the issue of “dropout events” during sequencing, which would manifest as a high prevalence of zero-entries in over 80% of the featurecount matrix. 

The teaching and training materials are part of the Galaxy Training Network (GTN), which is a worldwide collaborative e ort to produce high-quality teaching material in order to educate users in how to analyse their data, and in turn to train others of the same materials via easily deployable workshops backed by monthly stable releases of the GTN materials [19]. 

Other sources of variability stem from unwanted biological contributions known as confounder e ects, such as cell cycle e ects and transcription. 

One common pitfall at this very rst stage is estimating how many cells to expect from the FASTQ input data, and this requires three crucial pieces of information: which reads contain the barcodes (or precisely, which subset of both the forward and reverse reads contains the barcodes); of these barcodes, which speci c ones were actually used for the analysis; and how to resolve barcode mismatches/errors. 

The normalisation step aims to remove any technical factors that are not relevant to the analysis, such as the library size, where cells sharing the same identity are likely to di er from one another more by the number of transcripts they exhibit, than due to more relevant biological factors. 

The GTN also makes use of language translation tools to provide international interpretations of the trainingmaterials in order to reach a wider more internationally diverse audience. 

The advent of scRNA-seq analysis within the Galaxy framework re-echoes the e orts to standardise the analysis of scRNA-seq with the promise of presenting reproducible research. 

The tutorials are designed to broadly appeal to both the biologist and the statistician, as well as complete beginners to the entire topic. 

Standalone analysis suites emerged as the di erent authors of these packages rapidly expanded their methods to encapsulate all facets of the single-cell analysis, often creating compatibility issues with previous package versions. 

the Galaxy framework abstracts the user from the many nontrivial technicalities of the analysis, and exposes them to a legible interface of tools that they can pick and choose from. 

The analysis of scRNA-seq within Galaxy was a two-pronged e ort concentrated on bringing high quality single-cell tools into Galaxy, and providing the necessary work ows and training to accompany them. 

CC-BY 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 

This incompatibility between packages fuelled a choice of one analysis suite over another, or conversely required researchers to dig deeper into the internal semantics of R S4 objects in order to manually slot data components together [12]. 

These tutorials can also declare prerequisites, so that users can review required concepts from previous tutorials, e.g. quality control checks from bulk RNA-seq still being used in scRNA-seq.