Dual RNA-seq to Elucidate the Plant–
Pathogen Duel
Sanushka Naidoo
1
*, Erik Andrei Visser
1
, Lizahn Zwart
1
, Yves du Toit
1
,
Vijai Bhadauria
2
and Louise Simone Shuey
1
*
1
Department of Genetics, Forestry and Agricultural Biotechnology Institute, Genomics Research Institute, University
of Pretoria, Pretoria, South Africa.
2
Crop Development Centre and Department of Plant Sciences, University of Saskatchewan, Saskatoon, SK, Canada.
*Correspondence: Sanushka.Naidoo@fabi.up.ac.za and Louise.Shuey@fabi.up.ac.za
hps://doi.org/10.21775/cimb.027.127
Abstract
RNA-sequencing technology has been widely
adopted to investigate host responses during
infection with pathogens. Dual RNA-sequencing
(RNA-seq) allows the simultaneous capture of
pathogen-specic transcripts during infection,
providing a more complete view of the interaction.
In this review, we focus on the design of dual RNA-
seq experiments and the application of downstream
data analysis to gain biological insight into both
sides of the interaction. Recent literature in this
area demonstrates the power of the dual RNA-seq
approach and shows that it is not limited to model
systems where genomic resources are available.
Sequencing costs continue to decrease and single
cell transcriptomics is becoming more feasible. In
combination with proteomics and metabolomics
studies, these technological advances are likely to
contribute to our understanding of the temporal
and spatial aspects of dynamic plant–pathogen
interactions.
A dual approach in planta
e interaction between plants and pathogens is an
active and dynamic process that can be likened to a
duel. Plants have complex defence mechanisms that
can be rendered ineective when pathogens inter-
fere with one of the various processes required for
host defence. ese processes include penetration
resistance, recognition by Paern Recognition
Receptors (PRRs), phytohormone signalling path-
ways, secretory pathways, secondary metabolite
production, and plant cell death (Dou and Zhou,
2012). Until recently, transcriptomic approaches
have been applied in the host and pathogen sepa-
rately to obtain the gene expression prole of each
organism and gain insight into infection biology or
host defence mechanisms.
RNA sequencing (RNA-seq) is a powerful tech-
nology that does not rely on any prior knowledge
of transcripts and can generate vast quantities of
data with much smaller costs involved than for
older techniques such as microarrays (Pareek et al.,
2011; Wilhelm and Landry, 2009). An advantage
of RNA-seq in the eld of plant–pathogen interac-
tions is that both plant and pathogen transcripts
can be detected simultaneously and accurately in
the same sample. is tactic, known as dual RNA-
seq, in planta RNA-seq, simultaneous RNA-seq, or
comparative RNA-seq, is a relatively new technique
both in the plant and medical elds. In plants, it
allows for the study of plant–pathogen interactions
in herbaceous crops (Chen et al., 2013; Kunjeti
et al., 2012; Lowe et al., 2014) as well as trees
(Hayden et al., 2014; Liang et al., 2014; Teixeira et
al., 2014). is review outlines technical considera-
tions for dual RNA-seq experiments, summarizes
recent insights drawn from such approaches in
plant–pathogen interactions, and provides an
Curr. Issues Mol. Biol. Vol. 27
Naidoo et al.
128
|
overview of the next generation of dual approaches.
Since this technique is most useful to study interac-
tions with pathogens with complex prokaryotic and
eukaryotic genomes, viral pathogens have not been
included in this review.
It’s all in the design
Experimental design considerations for a dual
RNA-seq experiment can be divided into three
broad categories: sample generation, data genera-
tion and data analysis. An overview of the process
can be found in Fig. 8.1.
Figure 8.1 Flow chart of a dual RNA-seq experiment with example software programs. (A) Experimental
design considerations can include comparisons between resistant and susceptible interactions, normalized
to a mock inoculated control. (B) Dierent library preparation options include enrichments for mRNA, small
RNAs, stranded RNA and total RNA. (C) The sequencing platform can vary based on availability and aim of the
study. Paired-end sequencing on the Illumina platform is a common approach for RNA-seq. Deep sequencing
is required for dual RNA-seq approaches. Read length can vary depending on the application and sequencing
platform used. (D) Downstream read quality control can be implemented. Filtering for a minimum Phred quality
score of Q30 is generally optimal, but the threshold is data dependent. (E) Dual RNA-seq is performed by
mapping to both the host and pathogen reference genome sequences, or to the host rst with the remaining
reads mapped to the pathogen reference. Endophyte contamination can be removed by mapping to common
contaminant sequences obtained from databases such as Refseq. (F) If a reference is not available, a de novo
transcriptome can be assembled from the reads. (G) The mapping approach will dier based on the type of
reference used and dierent methods can be used when mapping to the host or the pathogen. (H) Other
programs commonly used for expression quantication include HTSeq, featureCounts (Liao et al., 2014) and
Limma (Ritchie et al., 2015). (I) Dierential Gene Expression (DGE) analysis can be performed using a number of
methods. The two examples listed here can be used for both transcriptome and genome-based DGE analysis.
(J) Genes identied as dierentially expressed can be used in programs and databases such as BinGO (Maere
et al., 2005), MapMan (Thimm et al., 2004) and Kegg (Ogata et al., 1999) to predict biological signicance
(Maere et al., 2005).
Curr. Issues Mol. Biol. Vol. 27
Dual RNA-seq of the Plant–Pathogen Duel
|
129
Sample generation
When considering experimental design for sample
generation, the main factors include trial design,
sample harvesting approach and sample handling
(reviewed in Yang and Wei, 2015).
Two important trial design and sample harvest-
ing considerations for dual RNA-seq experiments
are the predicted gene number in the pathogen and
host genomes, and the relative amounts of pathogen
and host cells within a given sample (Westermann
et al., 2012). Both of these factors inuence the
amount of pathogen RNA relative to host RNA
within a sample. A lower ratio of pathogen to host
RNA requires greater sequencing depth to capture
the full extent of biological variation within the
pathogen.
A dual RNA-seq experiment considering
an interaction between a eukaryotic host and
prokaryotic pathogen requires approximately 10
to 20 times as many reads than would usually be
required. is is partly due to the smaller amount
of cellular RNA within prokaryotic cells relative to
eukaryotes (Westermann et al., 2012). While the
relative amounts of cellular RNA between host and
pathogen are more similar in eukaryote–eukaryote
interactions, the higher read coverage is still neces-
sary due to the lower quantity of pathogen versus
host cells, which results in less pathogen RNA per
sample.
An important trial design consideration specic
to dual RNA-seq experiments is the inclusion
of a control for pathogen gene expression. is
can be done by comparing in planta expression
of pathogen genes to in planta gene expression of
a non-pathogenic strain and/or pathogen gene
expression in an agar culture or spore suspension
(Kawahara et al., 2012). Synthetic RNA spike-ins
can also be included to quantify both pathogen and
host RNA (Box 8.1).
Data generation
e main experimental design considerations for
data generation include the level of sample replica-
tion, library construction and sequencing (Liu et
al., 2014).
Sample replication
One of the rst factors to consider in experimental
design is the level of sample replication (Auer and
Doerge, 2010). Sample replication is divided into
technical replication, which is dened as perform-
ing the same analysis multiple times on the same
sample, and biological replication, a study depend-
ent term that can be loosely dened as harvesting
the same type of sample from the same type of
organism from the same conditions.
Technical variation arises when errors occur in
the experimental procedure and can be accounted
for through technical replication. Illumina sequenc-
ing produces negligible technical variability,
removing the need for technical replication in RNA-
seq experiments (Marioni et al., 2008). However,
when coverage is low for certain transcripts, techni-
cal variation can still arise (McIntyre et al., 2011).
us, technical replication should be considered
for dual RNA-seq experiments where there is low
representation of pathogen RNA within a sample,
resulting in low coverage of pathogen transcripts.
Box 8.1 Total RNA quantication
It is not always possible to accurately predict the amounts of host and pathogen RNA that will be present
in a sample. While it is possible to measure the amount of host and pathogen DNA in a sample using
qRT-PCR, this is not always an accurate reection of total host and pathogen RNA. This problem can be
circumvented by the addition of RNA spike-ins to samples. An RNA spike-in for RNA-seq is a mixture
of synthesized RNA transcripts of known sequence, concentration and abundance. While inclusion of
an RNA spike-in could increase the cost of sequencing due to increased coverage requirements, it can
be used to measure sensitivity and accuracy of sequencing as well as to detect biases that can occur
during RNA-seq (Jiang et al., 2011). Furthermore, standard curves can be generated from RNA spike-ins.
This allows for more accurate quantication of transcript abundance (Jiang et al., 2011). In dual RNA-
seq experiments, it is possible to use these standard curves to estimate host and pathogen RNA levels
within a sample. However, it is important to ensure that none of the spike-in sequences are present in the
genome of either host or pathogen, as this could preclude accurate quantication of genes containing
similar sequences and the use of those spike-in sequences (Jiang et al., 2011).
Curr. Issues Mol. Biol. Vol. 27
Naidoo et al.
130
|
While technical replication can be excluded due
to reliability of the technology, biological replica-
tion remains crucial to all RNA-seq experiments.
Besides accounting for biological variation (Hansen
et al., 2011; Neleton, 2014), biological replication
signicantly aects the power and accuracy of dier-
ential expression analyses. Liu et al. (2014) showed
that increasing the number of biological replicates
sequenced increased the number of accurately
identied dierentially expressed genes, whereas
increased read depth produced diminishing returns
for both statistical power and the precision with
which dierential expression is detected. is is
especially important in dual RNA-seq experiments
where biological variation is introduced from both
pathogen and host.
Library construction and sequencing
e main factors to consider during library con-
struction and sequencing are depletion methods,
strandedness, insert size, read length, and read
depth. e use of strand-specic rather than non-
strand-specic libraries [reviewed in Levin et al.
(2010)], allows the accurate detection of anti-sense
transcription and can allow accurate expression
quantication of overlapping transcripts. us,
strand-specic sequencing in dual RNA-seq
experiments could enable detection of evidence
suggesting host–pathogen interaction through anti-
sense transcription.
e choice of insert size is dependent on the
complexity of the transcriptome and the target
RNA species (reviewed in Head et al., 2014). Insert
size selection can be a limiting factor in which RNA
species can be analysed because inclusion of a size
selection step during library preparation results in
loss of transcripts shorter than the selected insert
size. Insert size selection also imposes an upper
limit on read length, since reads longer than the
insert size will sequence into adapters, providing no
new information.
Apart from insert size, the choice of read length
is dependent mainly on the objectives of the study
and the quality of the reference sequence used
for mapping. If a high quality and well annotated
reference sequence is available, increasing read
length above 50 bp is unnecessary for accurate
detection of dierential expression (Chhangawala
et al., 2015). Similarly, sequencing of paired-end
instead of single-end reads does not signicantly
aect detection of dierential expression in these
cases (Chhangawala et al., 2015). Conversely, when
studying organisms with less well dened reference
sequences, sequencing of longer paired-end reads
increases the accuracy of splice junction detection
(Chhangawala et al., 2015). When no reference
sequence is available, it is oen assumed that longer
reads equate to increased accuracy for de novo
assembly. Similar to the detection of dierential
expression, however, there seems to be a species-
specic threshold beyond which increasing read
length becomes redundant (Chang et al., 2014).
In cases where a high quality reference is avail-
able, less coverage is required for accurate transcript
identication and quantication, compared to
cases where a reference is missing. is is because
gaps in an assembly arising from low coverage can
be lled using the underlying reference sequence.
For studies relying on de novo assembly, a predicted
minimum of 30× total reference coverage is required
(Martin and Wang, 2011), while genome-guided
assembly can be accomplished with coverage below
10× (Denoeud et al., 2008).
For an RNA-seq experiment to be representa-
tive, it is important to make sure that the number
of reads is sucient to account for the least rep-
resented RNA species. is is also referred to as
sucient sequencing depth. To obtain adequate
depth for a dual RNA-seq experiment, enough
reads need to be sequenced to have at least 1× cov-
erage of the least represented pathogen transcript in
the sample with the lowest level of pathogen to host
RNA. However, it is almost impossible to know this
information when performing de novo RNA-seq
experiments.
Techniques to deplete or enrich certain RNA
species, such as RNA fractionation and poly(A)
selection, can enhance detection of transcripts with
low expression in eukaryotes (Sims et al., 2014).
Depletion of the rRNA fraction can further reduce
the required sequencing depth of an experiment
and, unlike poly(A) selection, allow for detection of
non-poly(A) transcripts. Although depletion-based
methods allow for selection of non-poly(A) RNA
species, these methods can bias quantication of
abundant transcripts and decrease exon coverage
and power to detect splice junctions due to the
presence of sequenced introns from pre-mRNA in
eukaryotes (Martin and Wang, 2011; Sims et al.,
2014).
Curr. Issues Mol. Biol. Vol. 27
Dual RNA-seq of the Plant–Pathogen Duel
|
131
Data analysis
As with sample and data generation, data analysis
considerations are dependent on the underlying
biological questions. Data analysis for the majority
of RNA-seq experiments follows three sequential
steps: (1) quality control, (2) mapping, expres-
sion quantication and DE analysis, and (3)
downstream analysis. Due to the variety of tools
and platforms that can be used for RNA-seq data
analysis (Grant et al., 2011), programs typically
used for RNA-seq data analysis may be created for
specic analysis types and tested within a specic
experimental context. us, it is oen advisable
to repeat an analysis using dierent programs and
compare the outputs.
Quality control
Quality control for dual RNA-seq studies is
similar to that used for traditional RNA-seq stud-
ies. However, contaminant ltering becomes more
complicated as reads from both host and pathogen
need to be retained. While reads originating from
the host and pathogen can be separated by map-
ping to the host and pathogen reference sequences
(Schulze et al., 2015; Westermann et al., 2012),
contamination of various forms should be consid-
ered in order to improve the accuracy and eciency
with which genes and transcripts are mapped and
quantied. Contamination may occur in two main
forms: non-mRNA species (which constitute the
majority of the total RNA extracted) and reads
representing mRNA extracted from organisms
(saprophytes and endophytes) other than the
organisms of interest. ese forms of contamina-
tion may skew the quantication of genes and
transcripts when assembling and mapping reads
to the reference. Westermann et al. (2012) provide
insight into dealing with contaminating RNA which
is species and study dependent.
Contamination in the form of RNA extracted
from endophytic or saprophytic organisms is
especially important in plant–pathogen interaction
studies. Saprophytes may be present at the sites
of wounding due to the degradation of tissue that
occurs, while endophytes colonize areas below the
surface of the plant tissue without causing symp-
toms. us RNA from these types of organisms
can be present in RNA-seq libraries. While surface
sterilization could be used to decrease the presence
of saprophytes, the process is time consuming
and may result in damage to host RNA. Surface
sterilization could also result in decreased pathogen
representation, which is counterproductive for
a dual RNA-seq experiment. erefore, removal
of these contaminating sequences requires bioin-
formatics intervention. is can be accomplished
through stringent mapping of data to a database
of common contaminant cDNA sequences con-
structed from databases such as RefSeq, UniRef100
and GenBank (Ikeue et al., 2015). In cases where
reference genomes are available for known endo-
phytes and saprophytes, stringent alignment to
these references could also be used to lter reads
(Zuluaga et al., 2015).
Mapping, expression quantication and
dierential expression analysis
Mapping is the reconstruction of the transcriptome
through alignment of reads to a reference sequence.
In dual RNA-seq experiments, reads are mapped
to the host reference sequence and the unaligned
sequences are retained and mapped to the patho-
gen reference sequence (Teixeira et al., 2014). A
common program used for read alignment to a
reference is the short read aligner Bowtie, which is
part of the Tophat package of the Tuxedo pipeline
(Trapnell et al., 2012). Box 8.2 describes mapping
and splice site determination. Bowtie allows the
user to set the number of mismatches between the
query and reference sequence, eectively seing a
stringency threshold for the alignment (Langmead
and Salzberg, 2012). is aects the stringency
with which reads will be aligned and eectively
assigned to the host or pathogen reference.
Once the reads have been assembled and l-
tered into host and pathogen libraries, transcript
abundance quantication and dierential expres-
sion analysis can be performed (Boxes 8.3 and
8.4, respectively). Expression levels are quantied
by counting the number of reads mapped to each
gene/transcript, normalized across the length of the
gene/transcript to account for bias across abundant
gene regions, relative to the number of reads in the
original library. Programs like Cuinks (Trapnell
et al., 2012) and RSEM (Li and Dewey, 2011) can
be used to accurately quantify relative numbers of
genes and transcripts. Dierential expression analy-
sis is commonly performed using packages such as
Cudi, DESeq and EdgeR (Anders and Huber,
2010; Robinson et al., 2010; Trapnell et al., 2013).
Curr. Issues Mol. Biol. Vol. 27