What is the role of language translation in the training materials?

The GTN also makes use of language translation tools to provide international interpretations of the trainingmaterials in order to reach a wider more internationally diverse audience.

What is the future of scRNA sequencing?

The advent of scRNA-seq analysis within the Galaxy framework re-echoes the e orts to standardise the analysis of scRNA-seq with the promise of presenting reproducible research.

What is the main purpose of the Galaxy framework?

the Galaxy framework abstracts the user from the many nontrivial technicalities of the analysis, and exposes them to a legible interface of tools that they can pick and choose from.

What are the prerequisites for a tutorial?

These tutorials can also declare prerequisites, so that users can review required concepts from previous tutorials, e.g. quality control checks from bulk RNA-seq still being used in scRNA-seq.

(Open Access) A Single-cell RNA-seq Training and Analysis Suite using the Galaxy Framework (2020) | Mehmet Tekman

Q: What are the contributions mentioned in the paper "A single-cell rna-seq training and analysis suite using the galaxy framework" ?

Here the authors outline several Galaxy work ows and learning resources for scRNA-seq, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework provides tools, work ows and trainings that not only enable users to perform one-click 10x preprocessing, but also empowers them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users to run exible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with the frequent training workshops hosted by the Galaxy Community provide a means for users to learn, publish and teach scRNA-seq analysis.

Q: What was the primary language for scRNA-seq?

ScanPy was developed as the Python alternative to the innumerable R-based packages for scRNA-seq which was the dominant language for such analyses, and it was one of the rst packages with native 10x genomics support.

Q: What are the main stages of scRNA-seq analysis?

The downstream modules are dened by the ve main stages of downstream scRNA-seq analysis: ltering, normalisation, confounder removal, clustering, and trajectory inference.

Q: What are the main problems of the rst wave of software utilities?

The rst wave of software utilities to deal with the analysis of single cell datasets were statistical packages, aimed at tackling the issue of “dropout events” during sequencing, which would manifest as a high prevalence of zero-entries in over 80% of the featurecount matrix.

Q: What is the common pitfall at this stage?

One common pitfall at this very rst stage is estimating how many cells to expect from the FASTQ input data, and this requires three crucial pieces of information: which reads contain the barcodes (or precisely, which subset of both the forward and reverse reads contains the barcodes); of these barcodes, which speci c ones were actually used for the analysis; and how to resolve barcode mismatches/errors.

Q: What is the purpose of the normalisation step?

The normalisation step aims to remove any technical factors that are not relevant to the analysis, such as the library size, where cells sharing the same identity are likely to di er from one another more by the number of transcripts they exhibit, than due to more relevant biological factors.

GigaScience, 2020, 1–9

doi: xx.xxxx/xxxx

Manuscript in Preparation

Paper

P A P E R

A single-cell RNA-seq Training and Analysis Suite

using the Galaxy Framework

Mehmet Tekman

†

, Bérénice Batut

†

, Alexander Ostrovsky

, Christophe

Antoniewski

, Dave Clements

, Fidel Ramirez

, Graham J Etherington

Hans-Rudolf Hotz

, Jelle Scholtalbers

, Jonathan R Manning

, Lea

Bellenger

, Maria A Doyle

, Mohammad Heydarian

, Ni Huang

8,10

, Nicola

Soranzo

, Pablo Moreno

, Stefan Mautner

, Irene Papatheodorou

, Anton

Nekrutenko

, James Taylor

, Daniel Blankenberg

, Rolf Backofen

and

Björn Grüning

Chair of Bioinformatics, University of Freiburg, Freiburg, Germany, and

Department of Biology, Johns

Hopkins University, Baltimore, Maryland, USA and

ARTbio, Sorbonne Université, CNRS FR 3631, Inserm US

037, Paris, France and Institut de Biologie Paris Seine, Paris, France and

Boehringer Ingelheim, Biberach,

Germany and

Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom and

Friedrich

Miescher Institute for Biomedical Research, Basel, Switzerland and Swiss Institute of Bioinformatics, Basel,

Switzerland and

European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany and

European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton, United

Kingdom and

Research Computing Facility, Peter MacCallum Cancer Centre, Melbourne, Victoria 3000,

Australia and Sir Peter MacCallum Department of Oncology, The University of Melbourne, Victoria 3010,

Australia and

Wellcome Sanger Institute, Cambridge, United Kingdom and

Department of Biochemistry

and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA and

Genomic

Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, USA

tekman@informatik.uni-freiburg.de; berenice.batut@gmail.com; gruening@informatik.uni-freiburg.de

†

These authors contributed equally to this work.

Abstract

Background The vast ecosystem of single-cell RNA-seq tools has until recently been plagued by an excess of diverging

analysis strategies, inconsistent le formats, and compatibility issues between dierent software suites. The uptake of 10x

Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large

computing requirements and the statistically-driven methods needed to process and understand these ever-growing

datasets.

Results Here we outline several Galaxy workows and learning resources for scRNA-seq, with the aim of providing a

comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap

between the computational methods and the underlying cell biology. The Galaxy reproducible bioinformatics framework

provides tools, workows and trainings that not only enable users to perform one-click 10x preprocessing, but also

empowers them to demultiplex raw sequencing from custom tagged and full-length sequencing protocols. The downstream

analysis supports a wide range of high-quality interoperable suites separated into common stages of analysis: inspection,

ltering, normalization, confounder removal and clustering. The teaching resources cover an assortment of dierent

concepts from computer science to cell biology. Access to all resources is provided at the singlecell.usegalaxy.eu portal.

Conclusions

The reproducible and training-oriented Galaxy framework provides a sustainable HPC environment for users

to run exible analyses on both 10x and alternative platforms. The tutorials from the Galaxy Training Network along with

the frequent training workshops hosted by the Galaxy Community provide a means for users to learn, publish and teach

scRNA-seq analysis.

Key words: scRNA; Galaxy; resources; HPC; single-cell; 10x; Training; Web;

.CC-BY 4.0 International licensemade available under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

GigaScience, 2020, Vol. 00, No. 0

Key Points

• Single-cell RNA-seq has stabilised towards 10x Genomics datasets.

• Galaxy provides rich and reproducible scRNA-seq workows with a wide range of robust tools.

• The Galaxy Training Network provides tutorials for the processing of both 10x and non-10x datasets.

Background

Single-Cell RNA-seq and cellular heterogeneity. The continuing

rise in single cell technologies has led to previously unprece-

dented levels of analysis into cell heterogeneity within tissue

samples, providing new insights into developmental and dif-

ferentiation pathways for a wide range of disciplines. Gene

expression studies are now performed at a cellular level of res-

olution, which compared to bulk RNA-seq methods, allows re-

searchers to model their tissue samples as distributions of dif-

ferent expressions instead of as an average.

Pathways from Single-cell data. The various expression proles

uncovered within tissue samples infer discrete cell types which

are related to one another across an “expression landscape”.

The relationships between the more distinct proles are in-

ferred via distance-metrics or manifold learning techniques.

Ultimately, the aim is to model the continuous biological pro-

cess of cell dierentiation from multipotent stem cells to dis-

tinct mature cell types, and infer lineage and dierentiation

pathways between transient cell types [1].

Elucidating Cell Identity. Trajectory analysis which integrates

the up or down regulation of signicant genes along lineage

branches can then be performed in order to uncover the factors

and extracellular triggers that can coerce a pluripotent cell to

become biased towards one cell fate outcome compared to an-

other. This undertaking has created a new frontier of explo-

ration in cell biology, where researchers have assembled refer-

ence maps for dierent cell lines for the purpose of fully record-

ing these cell dynamics and their characteristics in which to

create a global “atlas” of cells [2, 3].

Pitfalls and Technical Challenges

Sequencing sensitivity and Normalization. With each new protocol

comes a host of new technical problems to overcome. The rst

wave of software utilities to deal with the analysis of single cell

datasets were statistical packages, aimed at tackling the issue

of “dropout events” during sequencing, which would manifest

as a high prevalence of zero-entries in over 80% of the feature-

count matrix. These zeroes were problematic, since they could

not be trivially ignored as their presence stated that either the

cell did not produce any molecules for that transcript, or that

the sequencer simply did not detect them. Normalisation tech-

niques originally developed for bulk RNA-seq had to be adapted

to accommodate for this uncertainty, and new ones were cre-

ated that harnessed hurdle models, data imputation via mani-

fold learning techniques, or by pooling subsets of cells together

and building general linear models [4].

Improvements in sequencing. With the downstream analysis

packages attempting to solve the dropouts via stochastic meth-

ods, the upstream sequencing technologies also aspired to solve

the capture eciency via new well, droplet, and ow cytome-

try based protocols, all of which lend importance to the process

of barcoding sequencing reads.

In each protocol, cells are tagged with cell barcodes such

that any reads derived from them can be unambiguously as-

signed to the cell of origin. The inclusion of unique molecular

identiers (UMIs) are also employed to mitigate the eects of

amplication bias of transcripts within the same cell. The de-

tection, extraction, and (de-)multiplexing of cell barcodes and

UMIs is therefore one of the rst hurdles researchers encounter

when receiving raw FASTQ data from a sequencing facility.

The Burgeoning Software Ecosystem

Since its conception, several dierent packages and many

pipelines have been developed to assist researchers in the anal-

ysis of scRNA-seq [5, 6]. The vast majority of these pack-

ages were written for the R programming language since many

of the novel normalisation methods developed to handle the

dropout events depended on statistical packages that were

primarily R-based [7]. Standalone analysis suites emerged

as the dierent authors of these packages rapidly expanded

their methods to encapsulate all facets of the single-cell anal-

ysis, often creating compatibility issues with previous package

versions. The Bioconductor repository provided some much-

needed stability in this regard by hosting stable releases, but

researchers were still prone to building directly from reposi-

tory sources in order to reap the benets of new features in the

upstream versions [8, 9].

Nonexchangeable Data Formats. Another issue was the prolif-

eration of the many dierent and quickly evolving R-based

le formats for processing and storing the data, such as

SingleCellExperiment as used by the Scater suite, SCSeq from

RaceID, and SeuratObject from Seurat [10, 11]. Many new pack-

ages would cater only towards one format or suite, leading to

the common problem that data processed in one suite could not

be reliably processed by methods in another. This incompati-

bility between packages fuelled a choice of one analysis suite

over another, or conversely required researchers to dig deeper

into the internal semantics of R S4 objects in order to manually

slot data components together [12]. These problems only accel-

erated the rapid development of these suites, leading to further

version instability. As a result of this analysis diversity, there

are many tutorials on how to perform scRNA-seq analysis each

oriented around one of these pipelines [13].

Error propagation and Analysis Uncertainty. Dierent pipelines

produce dierent results, where the stochastic nature of the

analyses means that any uncertainty in a crucial quality control

step upstream, such as ltering or the removal of unwanted

variability, can propagate forward into the downstream sec-

tions to yield wildly dierent results on the same data. This

uncertainty, and the statistically-driven methods to overcome

Compiled on: August 28, 2020.

Draft manuscript prepared by the author.

.CC-BY 4.0 International licensemade available under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

Tekman et al.

them, leaves a wide knowledge gap for researchers simply try-

ing to understand the underlying dynamics of cell identity.

Rise of 10x Genomics

10x Launch. In 2015, 10x Genomics released their GemCode prod-

uct, which was a droplet-seq based protocol capable of se-

quencing tens of thousands of cells with an average cell quality

higher than other facilities [14]. This unprecedented level of

throughput steadily gained traction amongst researchers and

startups seeking to perform single-cell analysis, and thus 10x

datasets began to prevail in the eld.

10x Analysis Software. 10x Genomics provided software that was

able to perform much of the pre-processing, and provided

feature-count matrices in a transparent HDF5-based format

which provided a means of ecient matrix storage and ex-

change, and conclusively removed the restriction for down-

stream analysis modules to be written in R.

ScanPy, a popular alternative. The ScanPy suite [15], written in

Python, using its own HDF5-based AnnData format became a

valid alternative for analysing 10x datasets. The Seurat devel-

opers had similar aspirations and soon adopted the LOOM format,

another HDF5 variant. However, the popularity of ScanPy rose

as it began to integrate the methods of other standalone pack-

ages into its codebase, making it the natural choice for users

who wanted to achieve more without compromising on com-

patibility between dierent suites [9].

Solutions in the Cloud

Accessible Science. As the size of datasets scaled, so did the

computing resources required to analyse them, both in terms

of the processing power and in storage. Galaxy is an open-

source biocomputing infrastructure that exemplies the three

main tenets of science: reproducibility, peer review, and open-

access - all freely accessible within the web-browser [16]. It

hosts a wide range of highly-cited bioinformatics tools with

many dierent versions, and enables users to freely create their

own workows via a seamless drag-and-drop interface.

Reproducible Workows. Galaxy can make use of Conda or Con-

tainers to setup tool environments in order to ensure that the

bioinformatics tools will always be able to run, even when the

library dependencies for a tool have changed, by building tools

under locked version dependencies and bundling them together

in a self-contained environment [17]. These environments pro-

vide a concise solution for the package version instability that

plagues scRNA-seq analysis notebooks, both in terms of repro-

ducibility and analysis exibility. A user could keep the quality

control results obtained from an older version of ScanPy, whilst

running a newer ScanPy version at the clustering stage to reap

the benets of the later improvements in that algorithm. By

allowing the user to select from multiple versions of the same

tool, and by further permitting dierent versions of the tools

within a workow, Galaxy enables an unprecedented level of

free-ow analysis by letting researchers pick and choose the

best aspects of a tool without worrying about the underlying

software libraries [18]. The burden of software incompatibility

and choice of programming language that plagued the scRNA-

seq analysis ecosystem before, are now completely alleviated

from the user.

User-driven Custom Workows. Analyses are not limited to one

pipeline either, as the datasets which are passed between tools

can easily be interpreted by a dierent tool that is capable of

reading that dataset. In the case of scRNA-seq, Galaxy can

convert between CSV, MTX, LOOM and AnnData formats. This inter-

exchange of modules from dierent tools further extends the

exibility of the analysis by again letting the user decide which

component of a tool would be best suited for a specic part of

an analysis.

Training Resources. Galaxy also provides a wide range of learn-

ing resources, with the aim of guiding users step-by-step

through an analysis, often reproducing the results of published

works. The teaching and training materials are part of the

Galaxy Training Network (GTN), which is a worldwide collabo-

rative eort to produce high-quality teaching material in order

to educate users in how to analyse their data, and in turn to

train others of the same materials via easily deployable work-

shops backed by monthly stable releases of the GTN materials

[19]. Training materials are provided on a wide variety of dier-

ent topics, and workshops are hosted regularly, as advertised

on the Galaxy Events web portal. The GTN has grown rapidly

since its conception and gains new volunteers every year who

each contribute and coordinate training and teaching events,

maintain topic and subtopics, translate tutorials into multiple

languages, and provide peer review on new material [20].

Methods

Stable Workows in Galaxy. The analysis of scRNA-seq within

Galaxy was a two-pronged eort concentrated on bringing

high quality single-cell tools into Galaxy, and providing the

necessary workows and training to accompany them. As men-

tioned in the previous section, this eort needed to overcome

incompatible le format issues, unstable packages due to rapid

development, and needed to establish a standardised basis for

the analysis.

Tutorials. The tutorials are split into two main parts as out-

lined in Figure 1: rst, the pre-processing stage which con-

structs a count matrix from the initial sequencing data; sec-

ond, the cluster-based downstream analysis on the count ma-

trix. These stages are very dierent from one another in terms

of their information content, since the pre-processing stage re-

quires the researcher to be more familiar with wetlab sequenc-

ing protocols than your average bioinformatician would nor-

mally know, and the downstream analysis stage requires the

researcher to be familiar with statistics concepts that a wet-

lab scientist might not be too familiar with. The tutorials are

designed to broadly appeal to both the biologist and the statis-

tician, as well as complete beginners to the entire topic.

Pre-processing Workows

The pre-processing scRNA-seq materials tackle the two most

common use-cases that researchers will encounter when they

rst begin the eld: processing scRNA-seq data from 10x Ge-

nomics, and processing data generated from alternative pro-

tocols. For instance, microwell-based protocols have been

known to yield more features and display lower levels of

dropouts compared to 10x, and so we accommodate for them

by providing a more customizable path through the pre-

processing stage [21].

Barcode Extraction. Before the era of 10x Genomics, scRNA-seq

data had to be demultiplexed, mapped, and quantied. The de-

multiplexing stage entails an intimate knowledge of cell bar-

codes and Unique Molecular Identiers (UMIs) which are pro-

tocol dependent, and expects that the bioinformatician knows

.CC-BY 4.0 International licensemade available under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

GigaScience, 2020, Vol. 00, No. 0

Figure 1. The main stages of single-cell analysis, separated broadly into the up-

per and lower stages of pre-processing and downstream analysis, respectively.

The upper part illustrates the two main routes to generating a count-matrix

from sequencing data; via one-click quantication solutions, or through man-

ual demultiplexing. The lower part describes the four main stages required

to perform cluster-based analysis from the count-matrix, through ltering,

normalisation, confounder removal, and embedding.

exactly where and how the data was generated. One common

pitfall at this very rst stage is estimating how many cells to

expect from the FASTQ input data, and this requires three cru-

cial pieces of information: which reads contain the barcodes (or

precisely, which subset of both the forward and reverse reads

contains the barcodes); of these barcodes, which specic ones

were actually used for the analysis; and how to resolve barcode

mismatches/errors.

Barcode Estimation. Naive strategies involve using a known bar-

code template and querying against the FASTQ data to prole the

number of reads that align to a specic barcode, often employ-

ing ’knee’ methods to estimate this amount [22]. However,

this approach is not robust, since certain cells are more likely

to be over-represented compared to others, and some cell bar-

codes may contain more unmappable reads compared to others,

meaning that the metric of higher library read depth is not nec-

essarily correlated with a better-dened cell. Ultimately, the

bioinformatician must inquire directly with the sequencing lab

as to which cell barcodes were used, as these are often not spe-

cic to the protocol but to the technician who designed them,

with the idea that they should not align to a specic reference

genome or transcriptome.

One-click Pre-processing

Quantication with Cell Ranger. 10x Genomics simplied the

scRNA package ecosystem by using a language independent le

format, and streamlining much of the barcode particularities

with their Cell Ranger pipeline, allowing researchers to focus

more on the internal biological variability of their datasets [23].

Quantication with STARsolo. The pre-processing workow (ti-

tled "10X StarSolo Workow") in Galaxy uses RNA STARsolo util-

ity as a drop-in replacement for Cell Ranger, because not only

is it a feature update of the already existing RNA STAR tool in

Galaxy, but because it boasts a ten-fold speedup in comparison

to Cell Ranger and does not require Illumina lane-read informa-

tion to perform the processing [24, 25].

Other Approaches. The pre-processing workows for these

“one-click” solutions consume the same datasets and yield

approximately the same count matrices by following simi-

lar modes of barcode discovery and quantication. Within

Galaxy, there is also Alevin (paired with Salmon) and scPipe

which can both also perform the necessary demultiplexing,

(alignment-free) mapping, and quantication stages in a sin-

gle step [26, 27, 28].

Flexible Pre-processing

CELSeq2 Barcoding. The custom pre-processing workow (ti-

tled "CELSeq2: Single Batch mm10") is modelled after the CEL-

seq2 protocol using the barcoding strategies of the Freiburg

Max-Planck Institute laboratory as its main template, but the

workow is actually exible to accommodate any droplet or

well-based protocol such as SMART-seq2, and Drop-seq [29].

Manual Demultiplexing and Quantication. The training picto-

graphically guides users through the concepts of extracting cell

barcodes from the protocol, explains the signicance of UMIs

in the process of read deduplication with illustrative examples,

and instructs the user in the process of performing further

quality controls on their data during the post-mapping process

via RNA STAR and other tools that are native to Galaxy.

Training the User. At each stage, the user’s knowledge is queried

via question prompts and expandable answer box dialogs, as

well as helpful hints for future processing in comment boxes,

all written in the transparent Markdown specication devel-

oped for contributing to the GTN.

Downstream Workows

Common Stages of Analysis. The downstream modules are de-

ned by the ve main stages of downstream scRNA-seq anal-

ysis: ltering, normalisation, confounder removal, clustering,

and trajectory inference. There are three workows to aid in

this process (two of which are shown in Figure 2), each sport-

ing a dierent well-established scRNA-seq pipeline tool.

Quality Control with Scater. The Scater pipeline follows a

visualise-lter-visualise paradigm which provides an intuitive

means to perform quality control on a count matrix by use of

repeated incremental changes on a dataset through the use of

PCA and library size based metrics [30]. Once this pre-analysis

stage is complete, the full downstream analysis (comprising

the ve stages mentioned above) can be performed by work-

ows based on the following suites: RaceID and ScanPy.

Downstream Analysis with the RaceID Suite. RaceID was developed

initially to analyse rare cell transcriptomes whilst being ro-

bust against noise, and thus is ideal for working with smaller

datasets in the range of 300 to 1000 cells. Due to its complex

cell lineage and fate predictions models, it can also be used on

larger datasets with some scaling costs.

.CC-BY 4.0 International licensemade available under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

Tekman et al.

Downstream Analysis with the ScanPy Suite. ScanPy was devel-

oped as the Python alternative to the innumerable R-based

packages for scRNA-seq which was the dominant language for

such analyses, and it was one of the rst packages with na-

tive 10x genomics support. Since then it has grown substan-

tially, and has been re-implementing much of the newer R-

based methods released in BioConductor as “recipe” modules,

thereby providing a single source to perform many dierent

types of the same analysis.

The workows derived from both these suites emulate the

ve main stages of analysis mentioned previously, where lter-

ing, normalisation, and confounder removal are typically sep-

arated into distinct stages, though sometimes merged into one

step depending on the tool.

Filtering

Cell and Gene Removal. During the ltering stage, the initial

count matrix removes low-quality or unwanted cells using

commonly used parameters such as minimum gene detection

sensitivity and minimum library size, and low-quality genes

are also removed under similar metrics, where the minimum

number of cells for a gene to be included is decided. The Scater

pre-analysis workow can also be used here to provide a PCA-

based method of feature selection so that only the highly vari-

able genes are left in the analysis.

Disadvantages of Filtering. There is always the danger of over-

ltering a dataset, whereby setting overzealous lower-bound

thresholds on gene variability, can have the undesired eect of

removing essential housekeeping genes. These relatively uni-

formly expressed genes are often required for setting a baseline

to which the more desired dierentially expressed genes can

be selected from. It is therefore important that the user rst

performs a naive analysis and only later rene their ltering

thresholds to boost the biological signal.

Normalisation

Library Size Normalisation. The normalisation step aims to re-

move any technical factors that are not relevant to the analysis,

such as the library size, where cells sharing the same identity

are likely to dier from one another more by the number of

transcripts they exhibit, than due to more relevant biological

factors.

Intrinsic Cell Factors. The rst and foremost is cell capture e-

ciency, where dierent cells produce more or less transcripts

based on the amplication and coverage conditions they are

sequenced in. The second is the presence of dropout events

which manifest as a prevalence of “zeroes” in the nal count

matrix. Whether a “zero” is imputable to the lack of detection

of an existing molecule or to the absence of the molecule in the

cell is uncertain. This uncertainty alone has led to a wide se-

lection of dierent normalisation techniques that try to model

this expression either via hurdle models, or imputing the data

via manifold learning techniques, or working around the issue

by pooling subsets of cells together [31].

In this regard, both the RaceID and ScanPy workows of-

fer many dierent normalisation techniques, and users are en-

couraged to take advantage of the branching workow model

of Galaxy to explore all possible options.

Confounder Removal

Regression of Cell Cycle Eects. Other sources of variability stem

from unwanted biological contributions known as confounder

eects, such as cell cycle eects and transcription. Depending

on what stage of the cell cycle a cell was sequenced at, two cells

of the same type might cluster dierently because one might

have more transcripts due to it being in the M-phase of the cell

Figure 2. Downstream analysis workows as shown in the Galaxy Workow

Editor for (top) RaceID and (bottom) ScanPy, each displaying modules symbol-

izing the ve main stages of analysis.

cycle. Library sizes notwithstanding, it is the variability in spe-

cic cell cycle genes that can be the main driving factor in the

overall variability. Thankfully, these eects are easy to regress

out, and we replicate an entire standalone ScanPy workow

dedicated to detecting and visualising the eects based on the

original notebook [32].

Transcriptional Bursting. The transcription eects are harder to

model, as these are semi-stochastic and are as of yet still not

well understood. In bulk RNA-seq the expression of genes un-

dergoing transcription are averaged to give “high” or “low”

signals producing a global eect that gives the false impres-

sion that transcription is a continuous process. The reality is

more complex, where cells undergo transcription in “bursts” of

activity followed by periods of no activity, at irregular intervals

[33]. At the bulk level these discrete processes are smoothed to

give a continuous eect, but at the cell level it could mean that

even two directly adjacent cells of the same type normalised to

the same number of transcripts can still have dierent levels

of expression for a gene due to this process. This is not some-

thing that can be countered for, but it does educate the users

in which factors they can or cannot control in an analysis, and

how much variability they can expect to see.

Clustering and Projection

Dimension Reduction and Clustering. Once a user has obtained a

count matrix they are condent with, they are then guided

through the process of dimension reduction (with choice of dif-

ferent distance metrics), choosing a suitable low-dimensional

embedding, and performing clustering through commonly-

used techniques such as k-means, hierarchical, and neighbour-

hood community detection. These complex techniques are il-

lustrated in layman’s terms through the use of helpful images

and community examples. For example, the GTN ScanPy tu-

torial explains the Louvain clustering approach[34] via a stan-

dalone slide deck to assist in the workow [35].

Commonly-used Embeddings. The clustering and the cluster in-

spection stages are notably separated into distinct utilities here,

with the understanding that the same initial clustering can ap-

pear dissimilar under dierent projections, e.g. t-distributed

Stochastic Network Embedding (tSNE) against Uniform Mani-

fold Approximation and Projection (UMAP) [36, 37]. Ultimately

the user is encouraged to play with the plotting parameters to

yield the best looking clusters.

Static Plots or Interactive Environments. Cluster inspection tools

are available that allow users to easily generate static plots

tailored to pipeline-specic information as originally dened

.CC-BY 4.0 International licensemade available under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is

The copyright holder for this preprintthis version posted August 28, 2020. ; https://doi.org/10.1101/2020.06.06.137570doi: bioRxiv preprint

A Single-cell RNA-seq Training and Analysis Suite using the Galaxy Framework

Figures

Citations

Revealing the vectors of cellular identity with single-cell genomics

From bench to bedside: Single-cell analysis for cancer immunotherapy

Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-Practice

References

STAR: ultrafast universal RNA-seq aligner

Visualizing Data using t-SNE

Fast unfolding of communities in large networks

Fast unfolding of communities in large networks

Salmon provides fast and bias-aware quantification of transcript expression

Related Papers (5)

Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape

Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology

ASaiM: a Galaxy-based framework to analyze microbiota data

CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy : Technical Note

Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud.

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "A single-cell rna-seq training and analysis suite using the galaxy framework" ?

Q2. What was the primary language for scRNA-seq?

Q3. What are the main stages of scRNA-seq analysis?

Q4. What are the main problems of the rst wave of software utilities?

Q5. What is the purpose of the training materials?

Q6. What are some of the sources of variability in the cell cycle?

Q7. What is the common pitfall at this stage?

Q8. What is the purpose of the normalisation step?

Q9. What is the role of language translation in the training materials?

Q10. What is the future of scRNA sequencing?

Q11. What is the purpose of the tutorials?

Q12. What was the main reason for the development of standalone analysis suites?

Q13. What is the main purpose of the Galaxy framework?

Q14. What was the main focus of the analysis of scRNA-seq within Galaxy?

Q15. Who has granted bioRxiv a license to display the preprint in perpetuity?

Q16. What was the main reason for the incompatibility between packages?

Q17. What are the prerequisites for a tutorial?