scispace - formally typeset
Open AccessJournal ArticleDOI

ANAQUIN: a software toolkit for the analysis of spike-in controls for next generation sequencing

Reads0
Chats0
TLDR
The Anaquin software toolkit can be used to analyze the performance of spike‐in controls at multiple steps during RNA sequencing or genome sequencing analysis, providing useful diagnostic statistics, data visualization and sample normalization.
Abstract
Summary: Spike-in controls are synthetic nucleic-acid sequences that are added to a user’s sample and constitute internal standards for subsequent steps in the next generation sequencing workflow. The Anaquin software toolkit can be used to analyze the performance of spike-in controls at multiple steps during RNA sequencing or genome sequencing analysis, providing useful diagnostic statistics, data visualization and sample normalization. Availability and Implementation: The software is implemented in C ++/R and is freely available under BSD license. The source code is available from github.com/student-t/Anaquin, binaries and user manual from www.sequin.xyz/software and R package from bioconductor.org/packages/Anaquin Contact: anaquin@garvan.org.au or t.mercer@garvan.org.au Supplementary information: Supplementary data are available at Bioinformatics online.

read more

Content maybe subject to copyright    Report

Genome analysis
ANAQUIN: a software toolkit for the analysis of
spike-in controls for next generation sequencing
Ted Wong
1
, Ira W. Deveson
1,2
, Simon A. Hardwick
1,3
and Tim R. Mercer
1,3,
*
1
Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, NSW, Australia,
2
Faculty of
Science, School of Biotechnology and Biomolecular Sciences, UNSW, Sydney, NSW, Australia and
3
Faculty of
Medicine, St Vincents Clinical School, UNSW, Sydney, NSW, Australia
*To whom correspondence should be addressed.
Associate Editor: Bonnie Berger
Received on September 26, 2016; revised on January 5, 2017; editorial decision on January 18, 2017; accepted on January 23, 2017
Abstract
Summary: Spike-in controls are synthetic nucleic-acid sequences that are added to a user’s sample
and constitute internal standards for subsequent steps in the next generation sequencing
workflow.
The Anaquin software toolkit can be used to analyze the performance of spike-in controls at mul-
tiple steps during RNA sequencing or genome sequencing analysis, providing useful diagnostic
statistics, data visualization and sample normalization.
Availability and Implementation: The software is implemented in C þþ/R and is freely available
under BSD license. The source code is available from
github.com/student-t/Anaquin, binaries and
user manual from www.sequin.xyz/software and R package from bioconductor.org/packages/
Anaquin
Contact: anaquin@garvan.org.au or t.mercer@garvan.org.au
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Next-generation sequencing (NGS) is widely used in biological re-
search and is being increasingly used for clinical diagnosis. However,
NGS experiments are confounded by technical variation, biases and
artifacts that arise during library preparation, sequencing and subse-
quent bioinformatic analysis.
Spike-in controls are RNA or DNA molecules that can be dir-
ectly added to a user’s sample prior to sequencing (Deveson et al.,
2016; Hardwick et al., 2016; Jiang et al. , 2011; Zook et al., 2012).
Spike-in controls are typically synthetic sequences that can be distin-
guished from the natural RNA/DNA sequences in the sample. This
enables spike-ins to be analyzed in parallel to the accompanying nat-
ural sample, acting as internal quantitative and qualitative controls.
The analysis of spike-in controls enables an assessment of mul-
tiple steps during the NGS workflow (see Fig. 1). This includes
measuring features of the NGS library (such as library complexity,
quality and sequencing error), determining diagnostic statistics (such
as sensitivity and specificity) and for quality-control and trouble-
shooting purposes.
There are a range of statistical and bioinformatic strategies to
analyze spike-in controls. The erccdashboard software provides easy
analysis and visualization of ERCC RNA spike-in controls that
are commonly used in microarray and RNA sequencing experiments
(Munro et al., 2014). However, as spike-ins are being increasingly
adopted in genome sequencing and metagenomics, there is a
growing need for the analysis of spike-ins in diverse experimental
contexts.
To facilitate the analysis of spike-in controls for NGS, we have
developed a software toolkit, termed Anaquin. This toolkit allows
users to evaluate the performance of spike-in controls and the ac-
companying RNA/DNA sample. This toolkit is compatible with
most common bioinformatics tools and data formats, and is easily
integrated into standard NGS workflows.
V
C
The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 1723
Bioinformatics, 33(11), 2017, 1723–1724
doi: 10.1093/bioinformatics/btx038
Advance Access Publication Date: 27 January 2017
Applications Note
Downloaded from https://academic.oup.com/bioinformatics/article/33/11/1723/2959850 by guest on 16 August 2022

2 Results
2.1 Implementation
Anaquin is implemented in both C þþ and R programming language.
The C þþ command-line software is run in a UNIX environment, and
is useful for intensive computation analysis (eg. analyzing large .BAM
alignment files), and integration with other command line tools and
data formats. The related R-package is distributed by Bioconductor
and is ideal for data visualization and statistical analysis.
Anaquin has been designed for integration with NGS bioinformatics
pipelines of third-party software. Accordingly, the software supports
the use of standard data formats (such as SAM, BAM, BED, VCF, GTF
etc.) and has been tested in conjunction with popular third-party soft-
ware (such Cufflinks, DESeq2, TopHat2, STAR, GATK, VarScan,
etc.). Users can also convert non-supported data formats into simple
tab-delimited text formats that are supported by Anaquin. Where ap-
propriate, Anaquin also supports multiple replicate input files.
Anaquin may also require reference information files to help
with the analysis of spike-in controls. Common examples include
(i) mixture files that indicate the concentration of spike-in controls
in a mixture, (ii) sequence files that provide the sequence of the
spike-ins or an artificial in silico chromosome or genome sequences
to which spike-ins align or (iii) annotation files that indicate the co-
ordinates of spike-in controls with respect to the aforementioned in
silico chromosome or genome.
2.2 Tools
Anaquin toolkit is organized in a hierarchal fashion, with a range of dif-
ferent tools that can be used to assess the performance of spike-in con-
trols at several steps during the user’s NGS workflow of third party
software. For example, in RNA sequencing experiments, Anaquin can
be used to assess split-read alignments (RnaAlign), isoform assembly
(RnaAssembly), gene and isoform quantification (RnaExpression). In
addition, Anaquin enables normalization between multiple samples
(RnaSubsample) and can be used to assess differential gene expression
between libraries (RnaFoldChange). For DNA sequencing experiments,
Anaquin provides tools to assess the performance of read alignment
(VarAlign) and variant identification (VarDiscover), and calibrate
sequencing coverage between samples or replicates (VarSubsample).
Anaquin calculates a range of summary statistics derived from
spike-in controls within an NGS library. For example, Anaquin can
measure the minimal expression sufficient for the de novo assembly
of RNA isoforms or assess the accuracy of alternative isoform meas-
urements. Anaquin also provides detailed statistics on individual
spike-in controls in CSV format that can be easily exported for fur-
ther investigation.
Finally, Anaquin also generates template code for R. This enables
the easy import of spike-in data into R for further analysis with the
wide range of statistical and bioinformatics tools available through
the Bioconductor project. This includes the ability to quickly visualize
spike-in data using scatter-plots (to investigate dependence between
variables) and receiver operating characteristic curves (ROC) plots (to
assess diagnostic performance; see Supplementary Data S1 for ex-
amples plots provided by Anaquin). Notably, the assessment of spike-
ins enables users to optimize input parameters for third-party tools
and/or set filtering criteria in order to maximize the performance of
their bioinformatic workflow.
It is important to note that the range of possible analysis with
spike-in controls is diverse and will continue to expand. Spike-in
controls allow empirical evaluation of almost any aspect of the NGS
workflow and can inform novel statistical analyses yet to be de-
veloped. Accordingly, we anticipate that additional tools will be
added to Anaquin in conjunction with continued research and devel-
opment of spike-in controls.
Funding
The authors would like to thank the following funding sources: T.W is sup-
ported by Paramor Family fellowship. I.W.D. and S.A.H. are supported by
Australian Postgraduate Award scholarships. T.R.M. is supported by an
Australian National Health and Medical Research Council (NHMRC) fellow-
ship (APP1062470).
Conflict of Interest: Garvan Institute of Medical Research has filed patent ap-
plications on aspects of spike-in design.
References
Deveson,I.W. et al. (2016) Representing genetic variation with synthetic DNA
standards. Nat. Methods, 13, 784–791.
Hardwick,S.A. et al. (2016) Spliced synthetic genes as internal controls in
RNA sequencing experiments. Nat. Methods, 13, 792–798.
Jiang,L. et al. (2011) Synthetic spike-in standards for RNA-seq experiments.
Genome Res., 21, 1543–1551.
Munro,S.A. et al. (2014) Assessing technical performance in differential gene
expression experiments with external spike-in RNA control ratio mixtures.
Nat. Commun., 5, 5125.
Zook,J.M. et al. (2012) Synthetic spike-in standards improve run-specific sys-
tematic error analysis for DNA and RNA sequencing. PloS One, 7, e41356.
Library Preparation
Next-Generation
Sequencing
Anaquin tools:
RnaAlign & VarAlign
(Alignment performance)
User’s RNA/DNA
Sample
Spike-In
Controls
Combined Sample
Synthetic
Genome
Human
Genome
AGTCAGT
Anaquin tools:
RnaAssembly (Isoform Assembly)
RnaExpression (Gene and Isoform Expression)
RnaFoldChange (Differential Gene Expression)
Anaquin tools:
VarDiscover (Variant Identification)
VarFrequency (Allele Frequency)
Third Party Tools:
BWA, BowTie2,, Tophat2, STAR etc.
Formats: .BAM, .SAM
ALIGNMENT
GENOME SEQUENCING
(Variant Identification)
RNA SEQUENCING
(Gene Assembly and Expression)
Third-party tools:
GATK, MuTect, VarScan etc.
Formats: .VCF .TXT
Third-party tools: Cufflinks, StringTie,
DESeq2, edgeR, Kallisto etc.
Formats: .GTF .TXT
Anaquin tools:
plotLinear (Gene Expression)
plotLogistic (Isoform Assembly)
plotLOD (Fold-change sensititivty)
plotROC (Fold-change sensititivty)
Anaquin tools:
plotROC (Diagnostic Power)
plotLinear (Allele Frequency)
plotLOD (Detection senstivity)
Anaquin tools:
RnaSubsample & VarSubsample
(Calibration of Multiple Samples)
Formats: .BAM, .SAM
NORMALISATION
ANALYSIS
BIOINFORMATICS
(COMMAND LINE)
BIOINFORMATICS (R)
LABORATORY
SCHEMATIC WORKFLOW FOR NGS EXPERIMENT USING SPIKE-IN CONTROLS
OUTPUT RESULTS
DATA
VISUALISATION
REPORT
(.PDF)
DESCRIPTIVE
STATISTICS
x
Fig. 1. Schematic overview of next-generation sequencing workflow, with
analytical steps using Anaquin and third-party tools indicated
1724 T.Wong et al.
Downloaded from https://academic.oup.com/bioinformatics/article/33/11/1723/2959850 by guest on 16 August 2022
Citations
More filters
Journal ArticleDOI

YAP1 Mediates Resistance to MEK1/2 Inhibition in Neuroblastomas with Hyperactivated RAS Signaling.

TL;DR: Findings underscore the importance of YAP activity in response to trametinib in RAS-driven neuroblastomas, as well as the potential for targeting YAP in a trametInib combination, and the need for targeting the Hippo pathway transcriptional coactivator protein YAP1 in such a combination.
Journal ArticleDOI

Correction of amyotrophic lateral sclerosis related phenotypes in induced pluripotent stem cell-derived motor neurons carrying a hexanucleotide expansion mutation in C9orf72 by CRISPR/Cas9 genome editing using homology-directed repair.

TL;DR: Complete correction of an induced pluripotent stem cell line derived from a C9orf72-HRE positive ALS/FTD patient using CRISPR/Cas9 genome editing and homology directed repair (HDR) provides an ideal model to study the earliest effects of the hexanucleotide expansion on cellular homeostasis and the key pathways implicated in ALS pathophysiology.
Posted ContentDOI

Correction of amyotrophic lateral sclerosis related phenotypes in induced pluripotent stem cell-derived motor neurons carrying a hexanucleotide expansion mutation in C9orf72 by CRISPR/Cas9 genome editing using homology-directed repair

TL;DR: Complete correction of an induced pluripotent stem cell line derived from a C9orf72-HRE positive ALS/FTD patient using CRISPR/Cas9 genome editing and homology directed repair (HDR) provides an ideal model to study the earliest effects of the hexanucleotide expansion on cellular homeostasis and the key pathways implicated in ALS pathophysiology.
Journal ArticleDOI

Use of synthetic DNA spike-in controls (sequins) for human genome sequencing.

TL;DR: A set of synthetic DNA spike-ins, called sequins, that are directly added to DNA samples before library preparation are developed that can be used to measure technical biases and to act as internal quantitative and qualitative controls throughout the sequencing workflow.
References
More filters
Journal ArticleDOI

Synthetic spike-in standards for RNA-seq experiments

TL;DR: It is demonstrated that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification and these quality metrics facilitate comparable analysis across different samples, protocols, and platforms.
Journal ArticleDOI

Spliced synthetic genes as internal controls in RNA sequencing experiments

TL;DR: A set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms, that provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome are developed.
Journal ArticleDOI

Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing

TL;DR: Using DNA and RNA spike-in standards to human RNA and GATK to recalibrate base quality scores improves base quality score recalibration, and allows run-specific recalibrations even for the many species without a comprehensive and accurate SNP database.
Journal ArticleDOI

Representing genetic variation with synthetic DNA standards

TL;DR: This work develops a set of synthetic DNA standards, termed 'sequins', that emulate human genetic features and constitute qualitative and quantitative spike-in controls for genome sequencing, and provides sequins as a standardized, quantitative resource against which human genetic variation can be measured and diagnostic performance assessed.
Related Papers (5)