scispace - formally typeset
Open AccessPosted ContentDOI

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Reads0
Chats0
TLDR
In this paper, the authors introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network, which can be used to predict new regulons and analyze datasets in the context of all published data.
Abstract
We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism’s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.

read more

Content maybe subject to copyright    Report

Title
Mining all publicly available expression data to compute dynamic microbial transcriptional
regulatory networks
Authors:
Anand V. Sastry
1,*
, Saugat Poudel
1
, Kevin Rychel
1
, Reo Yoo
1
, Cameron R. Lamoureux
1
,
Siddharth Chauhan
1
, Zachary B. Haiman
1
, Tahani Al Bulushi
1
, Yara Seif
1,†
, Bernhard O.
Palsson
1,2
Affiliations:
1
Department of Bioengineering, University of California San Diego, La Jolla, CA, 92093, USA.
2
Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark,
Lyngby, 2800, Denmark.
* Corresponding author: Anand V. Sastry (avsastry@eng.ucsd.edu)
Current address: Merck & Co., Inc., South San Francisco, CA 94080, USA
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Abstract:
We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible
and can be employed to support scientific research or build a holistic view of an organism. Here,
we introduce a workflow that converts all public gene expression data for a microbe into a dynamic
representation of the organism’s transcriptional regulatory network. This five-step process walks
researchers through the mining, processing, curation, analysis, and characterization of all
available expression data, using Bacillus subtilis as an example. The resulting reconstruction of
the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets
in the context of all published data. The results are hosted at https://imodulondb.org/, and
additional analyses can be performed using the PyModulon Python package. As the number of
publicly available datasets increases, this pipeline will be applicable to a wide range of microbial
pathogens and cell factories.
Introduction
Over the past few decades, advances in sequencing technologies have resulted in an exponential
increase in the availability of public genomics datasets
1,2
. Public RNA sequencing datasets, in
particular, have been integrated to provide a broad view of an organism’s transcriptomic state
3,4
,
generate new biological hypotheses
5,6
, and infer co-expression networks and transcriptional
regulation
7,8
.
Independent Component Analysis (ICA) has recently emerged as a promising method to extract
knowledge from large transcriptomics compendia
916
. ICA is a machine learning algorithm
designed to separate mixed signals into their original source components
17
. For example, given
a set of microphones interspersed in a crowded room, ICA can be applied to the microphones’
recordings (with no additional input) to not only recreate the individual voices in the room, but also
infer the relative distances between each microphone and each voice. Similarly, ICA can be
applied to transcriptomics datasets to extract gene modules whose expression patterns are
statistically independent to other genes. We call these independently modulated groups of genes
iModulons. Most iModulons are nearly identical to regulons, or groups of genes regulated by the
same transcriptional regulator, in model bacteria
9,10
, and can be used to discover new regulons
or gene functions in less-characterized organisms
11,18
.
ICA simultaneously computes the activity of each iModulon under every experimental condition
in the dataset, which represent the activity state of a transcriptional regulator. iModulon activities
have intuitive interpretations, leading to a dynamic representation of an organism’s transcriptional
regulatory network (TRN)
10,11,1921
. Even for pairwise comparisons, iModulons can simplify the
analysis from thousands of differentially expressed genes to merely dozens of differentially
activated iModulons
22,23
.
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

iModulons have many properties that lend themselves to knowledge generation from large
datasets. In terms of accuracy, ICA outcompeted 42 other regulatory module detection algorithms,
including WGCNA and biclustering algorithms, in detecting known regulons across E. coli, yeast,
and human transcriptomics data
24
. The FastICA algorithm
25
is also relatively fast compared to
deep learning approaches, computing components in a matter of minutes rather than hours
26
.
Unlike many other decompositions, the independent components themselves are conserved
across different datasets
27,28
, batches
29
and dimensionalities within the same dataset
26,30
. This
property enables us to use precomputed iModulon structures to interpret new transcriptomic
datasets
9
. Altogether, these properties make ICA and iModulons a powerful tool to interpret the
vast ocean of publicly available transcriptomic data to advance our understanding of
transcriptome organization.
We have outlined a five-step workflow (Figure 1a) that enables researchers to build and
characterize the iModulon structure for any microbe with sufficient public data. The first two steps
are to download and process all publicly available RNA-seq data for a given organism. Third, the
data must be inspected to ensure quality control, and curated to include all appropriate metadata.
Next, ICA can be applied to the high-quality compendium to produce independent components.
Finally, the independent components are processed into iModulons and can subsequently be
characterized. To facilitate iModulon characterization, interpretation, and visualization, we present
PyModulon, a Python library for iModulon analysis (https://pymodulon.readthedocs.io/en/latest/).
We have made the entire workflow available on GitHub (https://github.com/avsastry/modulome-
workflow), and will be disseminating iModulons through our interactive website
(https://imodulondb.org/).
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Figure 1: Overview of public B. subtilis RNA-seq data. a) Graphical representation of the five step workflow.
b) Pie chart illustrating the quality control process. Numbers at the beginning of arrows represent the
number of datasets before the quality control step, and numbers at the end represent the number of passed
datasets after the step. c) Number of high-quality RNA-seq datasets for B. subtilis in NCBI SRA over time.
d) Scatter plot of the top two principal components of the B. subtilis expression compendium. Points are
colored based on the growth phase parsed from the literature. e) Bar chart showing the expression of dnaA
across four projects. Points show individual replicates, while bars show the average expression for a given
condition. Bars with a red star serve as the reference condition for the project. The legend describes the
bars in the spore_genes project.
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Results
Here, we demonstrate how to build the iModulon structure of Bacillus subtilis from publicly
available RNA-seq datasets using five steps (Figure 1a). All code to reproduce this pipeline is
available at: https://github.com/avsastry/modulome-workflow/. Since this process results in the
totality of iModulons that can currently be computed for an organism, we have named the resulting
database as the “B. subtilis Modulome''.
Steps 1 and 2: Compile and process all publicly available RNA-seq
datasets for B. subtilis
Using Entrez Direct
31
, we have created a script that compiles the metadata for all publicly
available RNA-seq data for a given organism in NCBI SRA
(https://github.com/avsastry/modulome-workflow/tree/main/1_download_metadata). Since the
iModulon structure improves with the number of datasets, we recommend that at least 50 unique
conditions are available for an organism before proceeding with the remaining pipeline. As of
August 2020, we identified 718 datasets labelled as Bacillus subtilis RNA-seq data.
Although iModulons can be computed from both microarray and RNA-seq datasets, microarray
datasets tend to produce more uncharacterized iModulons, induce stronger batch effects through
platform heterogeneity, and have been largely superseded by RNA-seq in current publications
27
.
For these reasons, we have designed the first two steps specifically for compiling and processing
RNA-seq data.
The B. subtilis datasets were subsequently processed using the RNA-seq pipeline available at
https://github.com/avsastry/modulome-workflow/tree/main/2_process_data (Figure S1). Ten
datasets failed to complete the processing pipeline, resulting in expression counts for 708
datasets.
Step 3: Quality control, metadata curation, and normalization
The B. subtilis compendium was subjected to five quality control criteria (Figure 1b,
https://github.com/avsastry/modulome-workflow/tree/main/3_quality_control). The final high-
quality B. subtilis compendium contained 265 RNA-seq datasets (Figure 1c). As part of the quality
control procedure, manual curation of experimental metadata was performed to identify which
samples were replicates. Inaccurate or insufficient metadata reporting can hinder the widespread
utilization of public data and potentially prevent subsequent interpretation
2,32
.
Therefore, we inspected the literature to identify the strain and media used in the study, any
additional treatments or temperature changes, and the growth stage (e.g., mid-exponential,
stationary, biofilm, etc.), if reported. During curation, we removed some non-traditional RNA-seq
datasets, such as TermSeq or RiboSeq.
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Figures
Citations
More filters
Journal ArticleDOI

Machine learning from Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators

TL;DR: This comprehensive TRN framework encompasses the majority of the transcriptional regulatory machinery in P. aeruginosa, and thus should prove foundational for future research into its physiological functions.
Journal ArticleDOI

Machine Learning of All Mycobacterium tuberculosis H37Rv RNA-seq Data Reveals a Structured Interplay between Metabolism, Stress Response, and Infection

TL;DR: In this paper , an unsupervised machine learning method was used to obtain a quantitative, top-down model of the M. tuberculosis transcriptional regulatory network (TRN) and identified intrinsic clusters of regulons that link several important metabolic systems, including lipid, cholesterol, and sulfur metabolism.
Journal ArticleDOI

Machine Learning Uncovers a Data-Driven Transcriptional Regulatory Network for the Crenarchaeal Thermoacidophile Sulfolobus acidocaldarius.

TL;DR: In this article, independent component analysis was applied to 95 high-quality Sulfolobus acidocaldarius RNA-seq datasets and extract 45 independently modulated gene sets, or iModulons.
Posted ContentDOI

Machine Learning of Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators

TL;DR: In this article, the authors extracted and quality controlled all publicly available RNA-sequencing datasets for Pseudomonas aeruginosa to find 281 high-quality transcriptomes.
Journal ArticleDOI

Advanced transcriptomic analysis reveals the role of efflux pumps and media composition in antibiotic responses of Pseudomonas aeruginosa

TL;DR: In this paper , the authors used 411 transcription profiles of Pseudomonas aeruginosa from diverse growth conditions to construct a quantitative transcriptional regulatory network by identifying independently modulated sets of genes (called iModulons) and their condition-specific activity levels.
References
More filters
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Journal ArticleDOI

featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features

TL;DR: FeatureCounts as discussed by the authors is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments, which implements highly efficient chromosome hashing and feature blocking techniques.
Journal ArticleDOI

Independent component analysis, a new concept?

Pierre Comon
- 01 Apr 1994 - 
TL;DR: An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time and may actually be seen as an extension of the principal component analysis (PCA).
Frequently Asked Questions (17)
Q1. What contributions have the authors mentioned in the paper "Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks" ?

Here, the authors introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism ’ s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. 

Hierarchical clustering was used to identify samples that did not conform to a typical expression profile, as these samples often use non-standard library preparation methods, such as ribosome sequencing and 3’ or 5’ end sequencing 3. 

To guarantee a high quality expression dataset for B. subtilis, data that failed any of the following four FASTQC metrics were discarded: per base sequence quality, per sequence quality scores, per base n content, and adapter content. 

Revealing 29 sets of independently modulated genes in Staphylococcusaureus, their regulators, and role in key physiological response. 

iModulon motifs were compared to known motifs using the compare_motifs function in PyModulon, which is a wrapper for TOMTOM 63 using an E-value of 0.001. 

Since the number of dimensions selected in ICA can alter the results, the authors applied the above procedure to the B. subtilis dataset multiple times, ranging the number of dimensions from 10 to 260 (i.e., the approximate size of the dataset) with a step size of 10. 

Information including the strain description, base media, carbon source, treatments, and temperature were pulled from the literature. 

iModulon enrichments against known regulons were computed using Fisher’s Exact Test, with the false discovery rate (FDR) controlled at 10-5 using the Benjamini-Hochberg correction. 

The resulting independent components (ICs) were clustered using DBSCAN 56 to identify robust ICs, using an epsilon of 0.1 and minimum cluster seed size of 50. 

12. Karczewski, K. J., Snyder, M., Altman, R. B. & Tatonetti, N. P. Coherent functional modulesimprove transcription factor target identification, cooperativity prediction, and diseaseassociation. 

Additional functions for gene set enrichment analysis are located in the enrichment module, including a generalized gene set enrichment function and an implementation of the BonferroniHochberg false discovery rate (FDR). 

The Sci-kit learn 54 implementation of K-means clustering, using three clusters, can be applied to the absolute values of the gene weights in each independent component. 

To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al. 

Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program forassigning sequence reads to genomic features. 

In order to identify the most significant genes in each component, the authors iteratively removed genes with the largest absolute value and computed the D’Agostino K2 test statistic 57 for the resulting distribution. 

Global iModulon activity clustering was performed using the clustermap function in the Python Seaborn package 69 using the following distance metric:𝑑𝑥,𝑦 = 1 − ||𝜌𝑥,𝑦||where ||𝜌𝑥,𝑦|| is the absolute value of the Spearman R correlation between two iModulon activity profiles. 

the scikit-learn (v0.23.2) 54 implementation of FastICA 55 was executed 100 times with random seeds and a convergence tolerance of 10-7. 

Trending Questions (1)
How to construct a gene regulatory netowork from Omics-data of bacteria?

The paper provides a five-step workflow to construct a gene regulatory network from publicly available gene expression data for a microbe, using Bacillus subtilis as an example.