scispace - formally typeset
Open AccessPosted ContentDOI

Genome-guided discovery of natural products through multiplexed low coverage whole-genome sequencing of soil Actinomycetes on Oxford Nanopore Flongle

TLDR
In this article, a low-cost DNA sequencing option was proposed to obtain contiguous DNA assemblies and identify large (ca. 10 to greater than 100 Kb) biosynthetic gene clusters (BGCs) with high-GC (>70%) and repeat content.
Abstract
Genome-mining is an important tool for discovery of new natural products; however, the number of publicly available genomes for natural product-rich microbes such as Actinomycetes, relative to human pathogens with smaller genomes, is small. To obtain contiguous DNA assemblies and identify large (ca. 10 to greater than 100 Kb) biosynthetic gene clusters (BGCs) with high-GC (>70%) and -repeat content, it is necessary to use long-read sequencing methods when sequencing Actinomycete genomes. One of the hurdles to long-read sequencing is the higher cost. In the current study, we assessed Flongle, a recently launched platform by Oxford Nanopore Technologies, as a low-cost DNA sequencing option to obtain contiguous DNA assemblies and analyze BGCs. To make the workflow more cost-effective, we multiplexed up to four samples in a single Flongle sequencing experiment while expecting low-sequencing coverage per sample. We hypothesized that contiguous DNA assemblies might enable analysis of BGCs even at low sequencing depth. To assess the value of these assemblies, we collected high-resolution mass-spectrometry data and conducted a multi-omics analysis to connect BGCs to secondary metabolites. In total, we assembled genomes for 20 distinct strains across seven sequencing experiments. In each experiment, 50% of the bases were in reads longer than 10 Kb, which facilitated the assembly of reads into contigs with an average N50 value of 3.5 Mb. The programs antiSMASH and PRISM predicted 629 and 295 BGCs, respectively. We connected BGCs to metabolites for N,N-dimethyl cyclic-ditryptophan, a novel lassopeptide and three known Actinomycete-associated siderophores, namely mirubactin, heterobactin and salinichelin. ImportanceShort-read sequencing of GC-rich genomes such as Actinomycetes results in a fragmented genome assembly and truncated biosynthetic gene clusters (often 10 to >100 Kb long), which hinders our ability to understand the biosynthetic potential of a given strain and predict the molecules that can be produced. The current study demonstrates that contiguous DNA assemblies, suitable for analysis of BGCs, can be obtained through low-coverage, multiplexed sequencing on Flongle, which provides a new low-cost workflow ($30-40 per strain) for sequencing Actinomycete strain libraries.

read more

Content maybe subject to copyright    Report

Page 1 of 27
Genome-guided discovery of natural products
through multiplexed low coverage whole-genome
sequencing of soil Actinomycetes on Oxford
Nanopore Flongle
Rahim Rajwani
1
, Shannon I. Ohlemacher
1
, Gengxiang Zhao
1
, Hong-bing Liu
1
,
Carole A. Bewley
1#
1
Laboratory of Bioorganic Chemistry, National Institute of Diabetes and Digestive and Kidney Diseases,
National Institutes of Health, Bethesda, Maryland 20892, United States
Correspondence to Dr. Carole A. Bewley (caroleb@nih.gov)
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 2 of 27
Abstract
1
2
Genome-mining is an important tool for discovery of new natural products; however, the number of
3
publicly available genomes for natural product-rich microbes such as Actinomycetes, relative to human
4
pathogens with smaller genomes, is small. To obtain contiguous DNA assemblies and identify large (ca.
5
10 to greater than 100 Kb) biosynthetic gene clusters (BGCs) with high-GC (>70%) and -repeat content, it
6
is necessary to use long-read sequencing methods when sequencing Actinomycete genomes. One of the
7
hurdles to long-read sequencing is the higher cost.
8
In the current study, we assessed Flongle, a recently launched platform by Oxford Nanopore
9
Technologies, as a low-cost DNA sequencing option to obtain contiguous DNA assemblies and analyze
10
BGCs. To make the workflow more cost-effective, we multiplexed up to four samples in a single Flongle
11
sequencing experiment while expecting low-sequencing coverage per sample. We hypothesized that
12
contiguous DNA assemblies might enable analysis of BGCs even at low sequencing depth. To assess the
13
value of these assemblies, we collected high-resolution mass-spectrometry data and conducted a multi-
14
omics analysis to connect BGCs to secondary metabolites.
15
In total, we assembled genomes for 20 distinct strains across seven sequencing experiments. In each
16
experiment, 50% of the bases were in reads longer than 10 Kb, which facilitated the assembly of reads
17
into contigs with an average N50 value of 3.5 Mb. The programs antiSMASH and PRISM predicted 629
18
and 295 BGCs, respectively. We connected BGCs to metabolites for N,N-dimethyl cyclic-ditryptophan, a
19
novel lassopeptide and three known Actinomycete-associated siderophores, namely mirubactin,
20
heterobactin and salinichelin.
21
Importance
22
Short-read sequencing of GC-rich genomes such as Actinomycetes results in a fragmented genome
23
assembly and truncated biosynthetic gene clusters (often 10 to >100 Kb long), which hinders our ability
24
to understand the biosynthetic potential of a given strain and predict the molecules that can be
25
produced. The current study demonstrates that contiguous DNA assemblies, suitable for analysis of
26
BGCs, can be obtained through low-coverage, multiplexed sequencing on Flongle, which provides a new
27
low-cost workflow ($30-40 per strain) for sequencing Actinomycete strain libraries.
28
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 3 of 27
Introduction
29
30
Clinical pathogens are increasingly becoming resistant to currently used antimicrobials causing over
31
700,000 deaths worldwide (1). New antimicrobials are urgently needed to alleviate antimicrobial
32
resistance and prevent deaths per year to rise over 10 million by 2050 (1). One of the prolific sources of
33
new antimicrobials is a group of gram-positive mycelia forming bacteria, the Actinomycetes. Several
34
currently used antibiotics, including vancomycin, rifamycin, and streptomycin are isolated from
35
Actinomycetes and they still hold enormous potential for the future discovery of new medicines (2).
36
Genome sequencing is now an important component of natural products research. Whole-genome
37
sequencing (WGS) enables identification of the genes responsible for the biosynthesis of natural
38
products (3). Often genes required for the biosynthesis of a natural product positionally cluster on the
39
genome and are referred to as biosynthetic gene clusters (BGCs) (4). The BGC sequences can be used to
40
predict possible structures of the resulting natural product (5), assess novelty of the compound (6) and
41
dereplicate compounds from a strain collection (7). Despite the merits offered by WGS, the number of
42
Actinomycete genomes remains limited. Several rare genera are not represented by a complete
43
genome, and the majority of currently available genomes are sequenced using Illumina short-read
44
technology that results in highly fragmented assemblies. BGCs span multiple contigs in fragmented
45
genome assemblies and cannot be detected or analyzed by commonly used BGC prediction tools such as
46
antiSMASH (8, 9).
47
Long-read sequencing technologies (e.g. PacBio or Oxford Nanopore Technologies, ONT) produce
48
contiguous genome sequences needed to analyze secondary metabolite gene clusters. Notably, PacBio
49
assemblies achieve consensus accuracy over 99.999%; however, it is generally less accessible due to the
50
upfront cost of sequencing instruments and higher per sample sequencing costs. By contrast, ONT does
51
not require an upfront cost of an expensive sequencing instrument and the devices are inexpensive.
52
Nevertheless, ONT data results in a lower consensus accuracy (99.9%) and often requires polishing with
53
Illumina reads to obtain reference-quality genomes. We hypothesized that while BGC identification
54
requires a contiguous DNA sequence, it might be less affected by the lower consensus accuracy of a
55
Nanopore assembly since most BGC analysis steps involve inferring homology between distantly related
56
amino acid sequences using profile Hidden Markov models. If this is true, contiguous DNA assemblies
57
can be obtained at ca. 10x coverage using ONT, allowing complete genome sequencing at a significantly
58
lower cost. While such ONT sequenced genomes would still require error correction with Illumina reads,
59
they could be used on their own to sequence a strain collection, build a catalog and compare BGCs for
60
dereplication or identification of potentially new compounds, which might be particularly useful to
61
natural product research and drug discovery programs.
62
To assess the feasibility of obtaining contiguous assemblies from ca. 10x sequencing depth, predicting
63
BGCs, and connecting BGCs to metabolites, we conducted the current multi-omics study. We sequenced
64
20 new soil-derived Actinomycete strains and analyzed their metabolome using high-resolution mass
65
spectrometry (HRMS). For sequencing, we specifically selected Flongle, a recently launched ONT
66
sequencing device that costs $90 USD and can generate up to 1-2 Gigabases of sequence output. With a
67
typical Actinomycetes genome being 8-10 Mb, a single Flongle experiment might be sufficient to
68
sequence 3-4 strains at 20-30x coverage. Sequencing workflows based on Flongle could be broadly
69
applicable to small and large studies due to the modular experimental design. In the current study, we
70
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 4 of 27
obtained 300-850 Mb of data per experiment across ten sequencing experiments with read-length N50
71
values over 10 Kb. Assembling of reads resulted in contiguous assemblies (average contig N50 value =
72
3.5 Mb and average number of contigs = 47.3). AntiSMASH5 predicted a total of 629 BGCs from these
73
assemblies. Through a combined analysis with metabolomics data, we were able to connect BGCs to
74
their secondary metabolites. The study demonstrates the utility of low coverage nanopore-only
75
assemblies as a rapid and low-cost sequencing option to advance natural product research.
76
Results
77
An in silico analysis to study the effect of sequencing coverage and read length on BGC
78
detection
79
We first analyzed what level of sequencing coverage would be sufficient for contiguous assemblies and
80
BGC detection using Oxford Nanopore sequencing. For this purpose, three Actinomycete genomes,
81
previously sequenced at high coverage, were downloaded from the European Nucleotide Archive and
82
their reads were down sampled to 60x, 30x, 15x and 7x coverage (assuming a genome size of 8 Mb)
83
before assembling and detecting BGCs. While actual genome sizes of the three genomes differed (Table
84
S 1), an assumption of a fixed expected genome size of 8 Mb allowed us to determine the utility of a
85
prospective sequencing experiment where Actinomycete genome sizes would not be known. In the
86
down sampling analysis, assembly size and number of predicted genes nearly plateau at ca. 15x
87
coverage. Similarly, a sharp decline in the number of contigs and number of mismatches per 100 Kb was
88
observed at ca. 15-20x coverage (Figure 1). At approximately the same coverage of 15-20x, 72-96% of
89
BGCs were detected by antiSMASH and a further increase in coverage led to detection of only 1-6
90
additional BGCs (Figure 1). Moreover, most of the BGCs were not located at the edge of a contig, also
91
referred to as complete. Relative to antiSMASH, PRISM predicted a lower number of BGCs. This could be
92
because antiSMASH was run in a ‘relaxed’ mode in this study whereas PRISM does not have this option.
93
Nevertheless, the trend relative to coverage was similar between antiSMASH and PRISM.
94
Assembly contiguity and therefore BGC detection in a Nanopore sequencing experiment is also related
95
to read length. In another computational experiment, we evaluated whether longer reads might enable
96
a more contiguous DNA assembly and BGC detection at fixed coverage. For this purpose, simulated
97
Nanopore reads of average length 500, 1000, 2000 and 4000 were generated at 10x coverage of a
98
Streptomyces genome (GB4-14) using BadRead. The resulting reads were assembled and analyzed for
99
assembly contiguity and BGC detection. It was observed that an ca. 2-fold increase in average read
100
length was associated with a 2-fold reduction in the number of contigs (Figure S 3). Improved assembly
101
contiguity also led to a reduction in the number of BGCs on contig edges (incomplete) and an increased
102
number of complete BGCs with little to no change of sequencing coverage (Figure S 4 ).
103
Overall, these computational experiments suggested contiguous DNA assemblies and complete BGCs
104
can be detected at low sequencing coverage using long reads from Oxford Nanopore Technologies; this
105
should allow for a dramatic reduction in cost per genome through multiplexing. The computational
106
experiments were followed up with prospective sequencing of Actinomycetes genomes using Flongle
107
and more detailed analyses of BGCs described in the following sections.
108
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 5 of 27
Nanopore sequencing, genome assembly and quality assessment
109
A total of ten sequencing experiments were conducted— each with an attempt to sequence four
110
Actinomycetes strains (Figure 2). Impurities in the starting genomic DNA (as measured by a ratio of the
111
UV absorbance at 260 and 280nm) and low pore occupancy (caused by insufficient loading of the library
112
or inhibition of adapter ligation) resulted in three unsuccessful experiments with a total output <100
113
megabases (Mb) per experiment. The remaining seven experiments yielded 288- 797 Mb over 18-24
114
hours. The longest read for each sample was over 80 Kb.
115
Across the experiments, we tried different buffers for bead-based purification to apply size selection and
116
increase the read length N50 values from standard protocols (Table S 2) (Figure 2). One of our initial
117
experiments using 0.5x of the standard buffer concentration was not successful resulting in read N50
118
values of 1-1.6 Kb for three out of five samples sequenced in the experiment. In two subsequent
119
successful experiments (AET670and AFK704), we utilized 0.15x of a modified buffer containing 0.5 M
120
MgCl
2
+ 5% PEG in TE buffer (10 mM Tris-Cl pH 8.0, 1 mM EDTA) for bead-based purification after
121
barcodes ligation as described previously (10). Read length N50 values in these experiments were 11.6-
122
15.1 Kb (Figure 2). In three experiments (AEZ324, AFA498 and AFA876), the concentration of the
123
modified buffer-based size selection was reduced to 0.1x which led to a further increase in read length
124
N50s (10.5-23.3 Kb) accompanied by an increased sample loss. Application of size selection after
125
barcodes ligation ensures approximately equal fragment lengths for pooling of samples and adapter
126
ligation. However, ligation of barcodes could be less efficient to longer fragments if shorter fragments
127
are present in the mixture. We tested buffer-based size selection (0.1x beads in modified buffer) before
128
barcode ligation in later experiments (flow cell ids AFK426 and AFK406) (Table S 2). A more consistent
129
output was observed, possibly due to more efficient barcode/adapter ligation to longer DNA fragments.
130
Across the seven successful runs, 3,814,434,062 bases in 751,459 reads were generated. Upon
131
demultiplexing, the median number of bases per sample was 77.5 Mb (theoretical coverage of 9.5x with
132
an expected 8 Mb genome size). Three strains were sequenced at <2.5x theoretical coverage (<20Mb
133
per strain) and were excluded from further analysis. Subsequently, 25 samples (20 distinct isolates) were
134
de novo assembled with Canu and polished with Racon and medaka (Figure 3). The median length of the
135
obtained assemblies was 8.5 Mb (average: 7.9 Mb, maximum: 9.4 Mb), typical of Actinomycete genome
136
size. The only exception was a 3 Mb assembly for GB8-002 which was also sequenced at the least
137
coverage (4.0x) (Figure 3).
138
We assessed the accuracy and quality of these low coverage genomes by comparing them with genomes
139
sequenced at high coverage on MinION or PacBio. In particular, two strains, GA3-008 and GB4-14 were
140
previously sequenced by our lab at 10-fold higher coverage using MinION and PacBio, respectively
141
(Table 1). Despite the lower sequence coverage, the genomescontiguity was only slightly affected on
142
flongle and all were assembled into <10 contigs. The size of the assembly differed by 6.1 Kb (GA3-008)
143
and 19.6 Kb (GB4-14) due to insertion/deletion (indel) errors. Despite many mismatches and indel
144
errors, 87.5-100% of the BGCs detected in the MinION or PacBio assemblies were also detected in these
145
Flongle assemblies by antiSMASH.
146
Taxonomy and BGCs
147
antiSMASH predicted 629 BGCs of 29 different types across all assembled genomes from this study
148
(Figure S 5). Seventy-nine percent (497/629) of these BGCs were complete (i.e. not located on a contig
149
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Halogenase-Targeted Genome Mining Leads to the Discovery of (±) Pestalachlorides A1a, A2a, and Their Atropisomers

TL;DR: This study indicates that halogenase-targeted genome mining is an efficient strategy for discovering halogenated compounds and their corresponding halogenases.
Journal ArticleDOI

Improved Assembly of Metagenome-Assembled Genomes and Viruses in Tibetan Saline Lake Sediment by HiFi Metagenomic Sequencing

TL;DR: In this article , a comparative evaluation of multiple assembly strategies based on high-throughput short-read and HiFi data from lake sediments metagenomic sequencing was carried out to expand the understanding of microbial dark matter in the environment.
Journal ArticleDOI

Comparative analysis of assembly algorithms to optimize biosynthetic gene cluster identification in novel marine actinomycete genomes

TL;DR: This study revealed the strengths and weaknesses of the different assemblers based on their ease of use and ability to be manipulated based on output format and none of the assembly methods handle contamination well and high-quality DNA is a prerequisite.
Journal ArticleDOI

Uncovering Genomic Features and Biosynthetic Gene Clusters in Endophytic Bacteria from Roots of the Medicinal Plant Alkanna tinctoria Tausch as a Strategy To Identify Novel Biocontrol Bacteria

TL;DR: In this paper , the authors have investigated the biocontrol activity of endophytic bacteria isolated from the medicinal plant Alkanna tinctoria Tausch and found that Pseudomonas sp. R-71838 showed a strong antifungal effect, in both dual-culture and in planta assays.
References
More filters
Journal ArticleDOI

Minimap2: pairwise alignment for nucleotide sequences

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Journal ArticleDOI

Open Babel: An open chemical toolbox

TL;DR: The implementation of Open Babel is detailed, key advances in the 2.3 release are described, and a variety of uses are outlined both in terms of software products and scientific research, including applications far beyond simple format interconversion.
Journal ArticleDOI

QUAST: quality assessment tool for genome assemblies

TL;DR: This tool improves on leading assembly comparison software with new ideas and quality metrics, and can evaluate assemblies both with a reference genome, as well as without a reference.
Journal ArticleDOI

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

TL;DR: Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies.
Journal ArticleDOI

Fast and accurate de novo genome assembly from long uncorrected reads

TL;DR: It is shown that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What was the effect of size selection on the genomes of Actinomycetes?

Bead-based size selection further depleted shorter 272 DNA fragments, and enriched sequencing reads in longer sequences (10 Kb+). 

Their sequencing and assembly results showed that up to four near-complete genomes of Actinomycete 270 strains could be sequenced on a single Flongle device. 

The antiSMASH-predicted BGC sequences were extracted from each strain's genome assemblies and 396 aligned in all possible strain pair combinations using minimap2, allowing for 5% sequence divergence 397 (43). 

Based on sequence 183 information, cyclic-di-tryptophan c(WW) was predicted as a possible metabolite that can be methylated 184 on nitrogen by the N-methyl transferase as reported previously for Actinosynnema mirum (13). 

For this purpose, three Actinomycete genomes, 81 previously sequenced at high coverage, were downloaded from the European Nucleotide Archive and 82 their reads were down sampled to 60x, 30x, 15x and 7x coverage (assuming a genome size of 8 Mb) 83 before assembling and detecting BGCs. 

286An interesting observation on BGC analysis was that active site specificity for various BGC classes (NRPS, 287 PKS and CDPS) in the Flongle assemblies were correctly predicted. 

For instance, Flongle 307 flow cells were less consistent in the number of starting pores (<60 out of an expected 126 in most 308 cases), which affected total sequencing output and lower than desired coverage for a few samples. 

279A common strategy to obtain contiguous and accurate genome assemblies is through polishing 280 contiguous nanopore assemblies with Illumina reads. 

The authors first analyzed what level of sequencing coverage would be sufficient for contiguous assemblies and 80 BGC detection using Oxford Nanopore sequencing. 

More complex structures that better resemble 192 final products of PKS pathways were predicted from a more contiguous Flongle (Figure 3) or PacBio 193 assembly (Table 1) of the same strain but were not detected in the metabolite extracts. 

Given a list of short peptides, 418 metaminer constructs possible RiPP products based on knowledge of post-translational modifications 419 within RiPPs. 

encoding known antibiotic classes with no metabolites detected - glycopeptide, 233 aminoglycoside and aminocoumarin 234In addition to the above-described BGCs whose metabolites were expressed and detected by HRMS/MS, 235 many other BGCs with sequence homology to known antibiotic BGCs were also identified in the 236 sequencing data. 

The precursor m/z (1041.504 [M+2H]2+ and 694.672 [M+2H]3+) of the matched 206 spectra was consistent with the predicted core peptide after loss of one water molecule (-18.010). 

Trending Questions (1)
What new techniques are currently being employed to discover novel natural products via genomics?

The paper discusses the use of low-coverage, multiplexed sequencing on the Flongle platform to obtain contiguous DNA assemblies and analyze biosynthetic gene clusters (BGCs) for the discovery of novel natural products.