What was the effect of size selection on the genomes of Actinomycetes?

Bead-based size selection further depleted shorter 272 DNA fragments, and enriched sequencing reads in longer sequences (10 Kb+).

How many BGCs could be sequenced on a single Flongle device?

Their sequencing and assembly results showed that up to four near-complete genomes of Actinomycete 270 strains could be sequenced on a single Flongle device.

How many BGCs were predicted using antiSMASH?

The antiSMASH-predicted BGC sequences were extracted from each strain's genome assemblies and 396 aligned in all possible strain pair combinations using minimap2, allowing for 5% sequence divergence 397 (43).

What was the metabolite predicted as a possible metabolite?

Based on sequence 183 information, cyclic-di-tryptophan c(WW) was predicted as a possible metabolite that can be methylated 184 on nitrogen by the N-methyl transferase as reported previously for Actinosynnema mirum (13).

How many BGCs were detected in the Actinomycete genome?

For this purpose, three Actinomycete genomes, 81 previously sequenced at high coverage, were downloaded from the European Nucleotide Archive and 82 their reads were down sampled to 60x, 30x, 15x and 7x coverage (assuming a genome size of 8 Mb) 83 before assembling and detecting BGCs.

What was the interesting observation on BGC analysis?

286An interesting observation on BGC analysis was that active site specificity for various BGC classes (NRPS, 287 PKS and CDPS) in the Flongle assemblies were correctly predicted.

How many starting pores were found in flongle 307 flow cells?

For instance, Flongle 307 flow cells were less consistent in the number of starting pores (<60 out of an expected 126 in most 308 cases), which affected total sequencing output and lower than desired coverage for a few samples.

What is the way to obtain contiguous and accurate genome assemblies?

279A common strategy to obtain contiguous and accurate genome assemblies is through polishing 280 contiguous nanopore assemblies with Illumina reads.

What was the metabolite structure predicted from the BGCs?

More complex structures that better resemble 192 final products of PKS pathways were predicted from a more contiguous Flongle (Figure 3) or PacBio 193 assembly (Table 1) of the same strain but were not detected in the metabolite extracts.

How many RiPPs are predicted using metaminer?

Given a list of short peptides, 418 metaminer constructs possible RiPP products based on knowledge of post-translational modifications 419 within RiPPs.

How many BGCs were detected in the MiBiG database?

encoding known antibiotic classes with no metabolites detected - glycopeptide, 233 aminoglycoside and aminocoumarin 234In addition to the above-described BGCs whose metabolites were expressed and detected by HRMS/MS, 235 many other BGCs with sequence homology to known antibiotic BGCs were also identified in the 236 sequencing data.

What is the m/z of the precursor peptide?

The precursor m/z (1041.504 [M+2H]2+ and 694.672 [M+2H]3+) of the matched 206 spectra was consistent with the predicted core peptide after loss of one water molecule (-18.010).

(Open Access) Genome-guided discovery of natural products through multiplexed low coverage whole-genome sequencing of soil Actinomycetes on Oxford Nanopore Flongle (2021) | Rahim Rajwani

Q: What is the way to obtain contiguous and accurate genome assemblies?

279A common strategy to obtain contiguous and accurate genome assemblies is through polishing 280 contiguous nanopore assemblies with Illumina reads.

Q: How did the authors analyze the coverage of the genomes?

The authors first analyzed what level of sequencing coverage would be sufficient for contiguous assemblies and 80 BGC detection using Oxford Nanopore sequencing.

Q: What was the metabolite structure predicted from the BGCs?

More complex structures that better resemble 192 final products of PKS pathways were predicted from a more contiguous Flongle (Figure 3) or PacBio 193 assembly (Table 1) of the same strain but were not detected in the metabolite extracts.

Page 1 of 27

Genome-guided discovery of natural products

through multiplexed low coverage whole-genome

sequencing of soil Actinomycetes on Oxford

Nanopore Flongle

Rahim Rajwani

, Shannon I. Ohlemacher

, Gengxiang Zhao

, Hong-bing Liu

Carole A. Bewley

Laboratory of Bioorganic Chemistry, National Institute of Diabetes and Digestive and Kidney Diseases,

National Institutes of Health, Bethesda, Maryland 20892, United States

Correspondence to Dr. Carole A. Bewley (caroleb@nih.gov)

The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 2 of 27

Abstract

Genome-mining is an important tool for discovery of new natural products; however, the number of

publicly available genomes for natural product-rich microbes such as Actinomycetes, relative to human

pathogens with smaller genomes, is small. To obtain contiguous DNA assemblies and identify large (ca.

10 to greater than 100 Kb) biosynthetic gene clusters (BGCs) with high-GC (>70%) and -repeat content, it

is necessary to use long-read sequencing methods when sequencing Actinomycete genomes. One of the

hurdles to long-read sequencing is the higher cost.

In the current study, we assessed Flongle, a recently launched platform by Oxford Nanopore

Technologies, as a low-cost DNA sequencing option to obtain contiguous DNA assemblies and analyze

BGCs. To make the workflow more cost-effective, we multiplexed up to four samples in a single Flongle

sequencing experiment while expecting low-sequencing coverage per sample. We hypothesized that

contiguous DNA assemblies might enable analysis of BGCs even at low sequencing depth. To assess the

value of these assemblies, we collected high-resolution mass-spectrometry data and conducted a multi-

omics analysis to connect BGCs to secondary metabolites.

In total, we assembled genomes for 20 distinct strains across seven sequencing experiments. In each

experiment, 50% of the bases were in reads longer than 10 Kb, which facilitated the assembly of reads

into contigs with an average N50 value of 3.5 Mb. The programs antiSMASH and PRISM predicted 629

and 295 BGCs, respectively. We connected BGCs to metabolites for N,N-dimethyl cyclic-ditryptophan, a

novel lassopeptide and three known Actinomycete-associated siderophores, namely mirubactin,

heterobactin and salinichelin.

Importance

Short-read sequencing of GC-rich genomes such as Actinomycetes results in a fragmented genome

assembly and truncated biosynthetic gene clusters (often 10 to >100 Kb long), which hinders our ability

to understand the biosynthetic potential of a given strain and predict the molecules that can be

produced. The current study demonstrates that contiguous DNA assemblies, suitable for analysis of

BGCs, can be obtained through low-coverage, multiplexed sequencing on Flongle, which provides a new

low-cost workflow ($30-40 per strain) for sequencing Actinomycete strain libraries.

The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 3 of 27

Introduction

Clinical pathogens are increasingly becoming resistant to currently used antimicrobials causing over

700,000 deaths worldwide (1). New antimicrobials are urgently needed to alleviate antimicrobial

resistance and prevent deaths per year to rise over 10 million by 2050 (1). One of the prolific sources of

new antimicrobials is a group of gram-positive mycelia forming bacteria, the Actinomycetes. Several

currently used antibiotics, including vancomycin, rifamycin, and streptomycin are isolated from

Actinomycetes and they still hold enormous potential for the future discovery of new medicines (2).

Genome sequencing is now an important component of natural products research. Whole-genome

sequencing (WGS) enables identification of the genes responsible for the biosynthesis of natural

products (3). Often genes required for the biosynthesis of a natural product positionally cluster on the

genome and are referred to as biosynthetic gene clusters (BGCs) (4). The BGC sequences can be used to

predict possible structures of the resulting natural product (5), assess novelty of the compound (6) and

dereplicate compounds from a strain collection (7). Despite the merits offered by WGS, the number of

Actinomycete genomes remains limited. Several rare genera are not represented by a complete

genome, and the majority of currently available genomes are sequenced using Illumina short-read

technology that results in highly fragmented assemblies. BGCs span multiple contigs in fragmented

genome assemblies and cannot be detected or analyzed by commonly used BGC prediction tools such as

antiSMASH (8, 9).

Long-read sequencing technologies (e.g. PacBio or Oxford Nanopore Technologies, ONT) produce

contiguous genome sequences needed to analyze secondary metabolite gene clusters. Notably, PacBio

assemblies achieve consensus accuracy over 99.999%; however, it is generally less accessible due to the

upfront cost of sequencing instruments and higher per sample sequencing costs. By contrast, ONT does

not require an upfront cost of an expensive sequencing instrument and the devices are inexpensive.

Nevertheless, ONT data results in a lower consensus accuracy (99.9%) and often requires polishing with

Illumina reads to obtain reference-quality genomes. We hypothesized that while BGC identification

requires a contiguous DNA sequence, it might be less affected by the lower consensus accuracy of a

Nanopore assembly since most BGC analysis steps involve inferring homology between distantly related

amino acid sequences using profile Hidden Markov models. If this is true, contiguous DNA assemblies

can be obtained at ca. 10x coverage using ONT, allowing complete genome sequencing at a significantly

lower cost. While such ONT sequenced genomes would still require error correction with Illumina reads,

they could be used on their own to sequence a strain collection, build a catalog and compare BGCs for

dereplication or identification of potentially new compounds, which might be particularly useful to

natural product research and drug discovery programs.

To assess the feasibility of obtaining contiguous assemblies from ca. 10x sequencing depth, predicting

BGCs, and connecting BGCs to metabolites, we conducted the current multi-omics study. We sequenced

20 new soil-derived Actinomycete strains and analyzed their metabolome using high-resolution mass

spectrometry (HRMS). For sequencing, we specifically selected Flongle, a recently launched ONT

sequencing device that costs $90 USD and can generate up to 1-2 Gigabases of sequence output. With a

typical Actinomycetes genome being 8-10 Mb, a single Flongle experiment might be sufficient to

sequence 3-4 strains at 20-30x coverage. Sequencing workflows based on Flongle could be broadly

applicable to small and large studies due to the modular experimental design. In the current study, we

The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 4 of 27

obtained 300-850 Mb of data per experiment across ten sequencing experiments with read-length N50

values over 10 Kb. Assembling of reads resulted in contiguous assemblies (average contig N50 value =

3.5 Mb and average number of contigs = 47.3). AntiSMASH5 predicted a total of 629 BGCs from these

assemblies. Through a combined analysis with metabolomics data, we were able to connect BGCs to

their secondary metabolites. The study demonstrates the utility of low coverage nanopore-only

assemblies as a rapid and low-cost sequencing option to advance natural product research.

Results

An in silico analysis to study the effect of sequencing coverage and read length on BGC

detection

We first analyzed what level of sequencing coverage would be sufficient for contiguous assemblies and

BGC detection using Oxford Nanopore sequencing. For this purpose, three Actinomycete genomes,

previously sequenced at high coverage, were downloaded from the European Nucleotide Archive and

their reads were down sampled to 60x, 30x, 15x and 7x coverage (assuming a genome size of 8 Mb)

before assembling and detecting BGCs. While actual genome sizes of the three genomes differed (Table

S 1), an assumption of a fixed expected genome size of 8 Mb allowed us to determine the utility of a

prospective sequencing experiment where Actinomycete genome sizes would not be known. In the

down sampling analysis, assembly size and number of predicted genes nearly plateau at ca. 15x

coverage. Similarly, a sharp decline in the number of contigs and number of mismatches per 100 Kb was

observed at ca. 15-20x coverage (Figure 1). At approximately the same coverage of 15-20x, 72-96% of

BGCs were detected by antiSMASH and a further increase in coverage led to detection of only 1-6

additional BGCs (Figure 1). Moreover, most of the BGCs were not located at the edge of a contig, also

referred to as complete. Relative to antiSMASH, PRISM predicted a lower number of BGCs. This could be

because antiSMASH was run in a ‘relaxed’ mode in this study whereas PRISM does not have this option.

Nevertheless, the trend relative to coverage was similar between antiSMASH and PRISM.

Assembly contiguity and therefore BGC detection in a Nanopore sequencing experiment is also related

to read length. In another computational experiment, we evaluated whether longer reads might enable

a more contiguous DNA assembly and BGC detection at fixed coverage. For this purpose, simulated

Nanopore reads of average length 500, 1000, 2000 and 4000 were generated at 10x coverage of a

Streptomyces genome (GB4-14) using BadRead. The resulting reads were assembled and analyzed for

assembly contiguity and BGC detection. It was observed that an ca. 2-fold increase in average read

100

length was associated with a 2-fold reduction in the number of contigs (Figure S 3). Improved assembly

101

contiguity also led to a reduction in the number of BGCs on contig edges (incomplete) and an increased

102

number of complete BGCs with little to no change of sequencing coverage (Figure S 4 ).

103

Overall, these computational experiments suggested contiguous DNA assemblies and complete BGCs

104

can be detected at low sequencing coverage using long reads from Oxford Nanopore Technologies; this

105

should allow for a dramatic reduction in cost per genome through multiplexing. The computational

106

experiments were followed up with prospective sequencing of Actinomycetes genomes using Flongle

107

and more detailed analyses of BGCs described in the following sections.

108

The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Page 5 of 27

Nanopore sequencing, genome assembly and quality assessment

109

A total of ten sequencing experiments were conducted— each with an attempt to sequence four

110

Actinomycetes strains (Figure 2). Impurities in the starting genomic DNA (as measured by a ratio of the

111

UV absorbance at 260 and 280nm) and low pore occupancy (caused by insufficient loading of the library

112

or inhibition of adapter ligation) resulted in three unsuccessful experiments with a total output <100

113

megabases (Mb) per experiment. The remaining seven experiments yielded 288- 797 Mb over 18-24

114

hours. The longest read for each sample was over 80 Kb.

115

Across the experiments, we tried different buffers for bead-based purification to apply size selection and

116

increase the read length N50 values from standard protocols (Table S 2) (Figure 2). One of our initial

117

experiments using 0.5x of the standard buffer concentration was not successful resulting in read N50

118

values of 1-1.6 Kb for three out of five samples sequenced in the experiment. In two subsequent

119

successful experiments (AET670and AFK704), we utilized 0.15x of a modified buffer containing 0.5 M

120

MgCl

+ 5% PEG in TE buffer (10 mM Tris-Cl pH 8.0, 1 mM EDTA) for bead-based purification after

121

barcodes ligation as described previously (10). Read length N50 values in these experiments were 11.6-

122

15.1 Kb (Figure 2). In three experiments (AEZ324, AFA498 and AFA876), the concentration of the

123

modified buffer-based size selection was reduced to 0.1x which led to a further increase in read length

124

N50s (10.5-23.3 Kb) accompanied by an increased sample loss. Application of size selection after

125

barcodes ligation ensures approximately equal fragment lengths for pooling of samples and adapter

126

ligation. However, ligation of barcodes could be less efficient to longer fragments if shorter fragments

127

are present in the mixture. We tested buffer-based size selection (0.1x beads in modified buffer) before

128

barcode ligation in later experiments (flow cell ids AFK426 and AFK406) (Table S 2). A more consistent

129

output was observed, possibly due to more efficient barcode/adapter ligation to longer DNA fragments.

130

Across the seven successful runs, 3,814,434,062 bases in 751,459 reads were generated. Upon

131

demultiplexing, the median number of bases per sample was 77.5 Mb (theoretical coverage of 9.5x with

132

an expected 8 Mb genome size). Three strains were sequenced at <2.5x theoretical coverage (<20Mb

133

per strain) and were excluded from further analysis. Subsequently, 25 samples (20 distinct isolates) were

134

de novo assembled with Canu and polished with Racon and medaka (Figure 3). The median length of the

135

obtained assemblies was 8.5 Mb (average: 7.9 Mb, maximum: 9.4 Mb), typical of Actinomycete genome

136

size. The only exception was a 3 Mb assembly for GB8-002 which was also sequenced at the least

137

coverage (4.0x) (Figure 3).

138

We assessed the accuracy and quality of these low coverage genomes by comparing them with genomes

139

sequenced at high coverage on MinION or PacBio. In particular, two strains, GA3-008 and GB4-14 were

140

previously sequenced by our lab at 10-fold higher coverage using MinION and PacBio, respectively

141

(Table 1). Despite the lower sequence coverage, the genomes’ contiguity was only slightly affected on

142

flongle and all were assembled into <10 contigs. The size of the assembly differed by 6.1 Kb (GA3-008)

143

and 19.6 Kb (GB4-14) due to insertion/deletion (indel) errors. Despite many mismatches and indel

144

errors, 87.5-100% of the BGCs detected in the MinION or PacBio assemblies were also detected in these

145

Flongle assemblies by antiSMASH.

146

Taxonomy and BGCs

147

antiSMASH predicted 629 BGCs of 29 different types across all assembled genomes from this study

148

(Figure S 5). Seventy-nine percent (497/629) of these BGCs were complete (i.e. not located on a contig

149

The copyright holder for this preprintthis version posted August 12, 2021. ; https://doi.org/10.1101/2021.08.11.456034doi: bioRxiv preprint

Genome-guided discovery of natural products through multiplexed low coverage whole-genome sequencing of soil Actinomycetes on Oxford Nanopore Flongle

Figures

Citations

Halogenase-Targeted Genome Mining Leads to the Discovery of (±) Pestalachlorides A1a, A2a, and Their Atropisomers

Improved Assembly of Metagenome-Assembled Genomes and Viruses in Tibetan Saline Lake Sediment by HiFi Metagenomic Sequencing

Comparative analysis of assembly algorithms to optimize biosynthetic gene cluster identification in novel marine actinomycete genomes

Uncovering Genomic Features and Biosynthetic Gene Clusters in Endophytic Bacteria from Roots of the Medicinal Plant Alkanna tinctoria Tausch as a Strategy To Identify Novel Biocontrol Bacteria

References

Minimap2: pairwise alignment for nucleotide sequences

Open Babel: An open chemical toolbox

QUAST: quality assessment tool for genome assemblies

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Fast and accurate de novo genome assembly from long uncorrected reads

Related Papers (5)

Advantages of Single-Molecule Real-Time Sequencing in High-GC Content Genomes

Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing

Genome assembly using Nanopore-guided long and error-free DNA reads

Fine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40

Improved high-molecular-weight DNA extraction, nanopore sequencing and metagenomic assembly from the human gut microbiome.

Frequently Asked Questions (13)

Q1. What was the effect of size selection on the genomes of Actinomycetes?

Q2. How many BGCs could be sequenced on a single Flongle device?

Q3. How many BGCs were predicted using antiSMASH?

Q4. What was the metabolite predicted as a possible metabolite?

Q5. How many BGCs were detected in the Actinomycete genome?

Q6. What was the interesting observation on BGC analysis?

Q7. How many starting pores were found in flongle 307 flow cells?

Q8. What is the way to obtain contiguous and accurate genome assemblies?

Q9. How did the authors analyze the coverage of the genomes?

Q10. What was the metabolite structure predicted from the BGCs?

Q11. How many RiPPs are predicted using metaminer?

Q12. How many BGCs were detected in the MiBiG database?

Q13. What is the m/z of the precursor peptide?

Trending Questions (1)