Open AccessPosted ContentDOI

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

- 02 Jul 2021 -

Chats0

TLDR

In this paper, the authors introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network, which can be used to predict new regulons and analyze datasets in the context of all published data.

Abstract:

We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism’s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.

Content maybe subject to copyright Report

Title

Mining all publicly available expression data to compute dynamic microbial transcriptional

regulatory networks

Authors:

Anand V. Sastry

1,*

, Saugat Poudel

, Kevin Rychel

, Reo Yoo

, Cameron R. Lamoureux

Siddharth Chauhan

, Zachary B. Haiman

, Tahani Al Bulushi

, Yara Seif

1,†

, Bernhard O.

Palsson

1,2

Affiliations:

Department of Bioengineering, University of California San Diego, La Jolla, CA, 92093, USA.

Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark,

Lyngby, 2800, Denmark.

* Corresponding author: Anand V. Sastry (avsastry@eng.ucsd.edu)

†

Current address: Merck & Co., Inc., South San Francisco, CA 94080, USA

.CC-BY-NC 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Abstract:

We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible

and can be employed to support scientific research or build a holistic view of an organism. Here,

we introduce a workflow that converts all public gene expression data for a microbe into a dynamic

representation of the organism’s transcriptional regulatory network. This five-step process walks

researchers through the mining, processing, curation, analysis, and characterization of all

available expression data, using Bacillus subtilis as an example. The resulting reconstruction of

the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets

in the context of all published data. The results are hosted at https://imodulondb.org/, and

additional analyses can be performed using the PyModulon Python package. As the number of

publicly available datasets increases, this pipeline will be applicable to a wide range of microbial

pathogens and cell factories.

Introduction

Over the past few decades, advances in sequencing technologies have resulted in an exponential

increase in the availability of public genomics datasets

1,2

. Public RNA sequencing datasets, in

particular, have been integrated to provide a broad view of an organism’s transcriptomic state

3,4

generate new biological hypotheses

5,6

, and infer co-expression networks and transcriptional

regulation

7,8

Independent Component Analysis (ICA) has recently emerged as a promising method to extract

knowledge from large transcriptomics compendia

9–16

. ICA is a machine learning algorithm

designed to separate mixed signals into their original source components

. For example, given

a set of microphones interspersed in a crowded room, ICA can be applied to the microphones’

recordings (with no additional input) to not only recreate the individual voices in the room, but also

infer the relative distances between each microphone and each voice. Similarly, ICA can be

applied to transcriptomics datasets to extract gene modules whose expression patterns are

statistically independent to other genes. We call these independently modulated groups of genes

iModulons. Most iModulons are nearly identical to regulons, or groups of genes regulated by the

same transcriptional regulator, in model bacteria

9,10

, and can be used to discover new regulons

or gene functions in less-characterized organisms

11,18

ICA simultaneously computes the activity of each iModulon under every experimental condition

in the dataset, which represent the activity state of a transcriptional regulator. iModulon activities

have intuitive interpretations, leading to a dynamic representation of an organism’s transcriptional

regulatory network (TRN)

10,11,19–21

. Even for pairwise comparisons, iModulons can simplify the

analysis from thousands of differentially expressed genes to merely dozens of differentially

activated iModulons

22,23

.CC-BY-NC 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

iModulons have many properties that lend themselves to knowledge generation from large

datasets. In terms of accuracy, ICA outcompeted 42 other regulatory module detection algorithms,

including WGCNA and biclustering algorithms, in detecting known regulons across E. coli, yeast,

and human transcriptomics data

. The FastICA algorithm

is also relatively fast compared to

deep learning approaches, computing components in a matter of minutes rather than hours

Unlike many other decompositions, the independent components themselves are conserved

across different datasets

27,28

, batches

and dimensionalities within the same dataset

26,30

. This

property enables us to use precomputed iModulon structures to interpret new transcriptomic

datasets

. Altogether, these properties make ICA and iModulons a powerful tool to interpret the

vast ocean of publicly available transcriptomic data to advance our understanding of

transcriptome organization.

We have outlined a five-step workflow (Figure 1a) that enables researchers to build and

characterize the iModulon structure for any microbe with sufficient public data. The first two steps

are to download and process all publicly available RNA-seq data for a given organism. Third, the

data must be inspected to ensure quality control, and curated to include all appropriate metadata.

Next, ICA can be applied to the high-quality compendium to produce independent components.

Finally, the independent components are processed into iModulons and can subsequently be

characterized. To facilitate iModulon characterization, interpretation, and visualization, we present

PyModulon, a Python library for iModulon analysis (https://pymodulon.readthedocs.io/en/latest/).

We have made the entire workflow available on GitHub (https://github.com/avsastry/modulome-

workflow), and will be disseminating iModulons through our interactive website

(https://imodulondb.org/).

.CC-BY-NC 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Figure 1: Overview of public B. subtilis RNA-seq data. a) Graphical representation of the five step workflow.

b) Pie chart illustrating the quality control process. Numbers at the beginning of arrows represent the

number of datasets before the quality control step, and numbers at the end represent the number of passed

datasets after the step. c) Number of high-quality RNA-seq datasets for B. subtilis in NCBI SRA over time.

d) Scatter plot of the top two principal components of the B. subtilis expression compendium. Points are

colored based on the growth phase parsed from the literature. e) Bar chart showing the expression of dnaA

across four projects. Points show individual replicates, while bars show the average expression for a given

condition. Bars with a red star serve as the reference condition for the project. The legend describes the

bars in the spore_genes project.

.CC-BY-NC 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

Results

Here, we demonstrate how to build the iModulon structure of Bacillus subtilis from publicly

available RNA-seq datasets using five steps (Figure 1a). All code to reproduce this pipeline is

available at: https://github.com/avsastry/modulome-workflow/. Since this process results in the

totality of iModulons that can currently be computed for an organism, we have named the resulting

database as the “B. subtilis Modulome''.

Steps 1 and 2: Compile and process all publicly available RNA-seq

datasets for B. subtilis

Using Entrez Direct

, we have created a script that compiles the metadata for all publicly

available RNA-seq data for a given organism in NCBI SRA

(https://github.com/avsastry/modulome-workflow/tree/main/1_download_metadata). Since the

iModulon structure improves with the number of datasets, we recommend that at least 50 unique

conditions are available for an organism before proceeding with the remaining pipeline. As of

August 2020, we identified 718 datasets labelled as Bacillus subtilis RNA-seq data.

Although iModulons can be computed from both microarray and RNA-seq datasets, microarray

datasets tend to produce more uncharacterized iModulons, induce stronger batch effects through

platform heterogeneity, and have been largely superseded by RNA-seq in current publications

For these reasons, we have designed the first two steps specifically for compiling and processing

RNA-seq data.

The B. subtilis datasets were subsequently processed using the RNA-seq pipeline available at

https://github.com/avsastry/modulome-workflow/tree/main/2_process_data (Figure S1). Ten

datasets failed to complete the processing pipeline, resulting in expression counts for 708

datasets.

Step 3: Quality control, metadata curation, and normalization

The B. subtilis compendium was subjected to five quality control criteria (Figure 1b,

https://github.com/avsastry/modulome-workflow/tree/main/3_quality_control). The final high-

quality B. subtilis compendium contained 265 RNA-seq datasets (Figure 1c). As part of the quality

control procedure, manual curation of experimental metadata was performed to identify which

samples were replicates. Inaccurate or insufficient metadata reporting can hinder the widespread

utilization of public data and potentially prevent subsequent interpretation

2,32

Therefore, we inspected the literature to identify the strain and media used in the study, any

additional treatments or temperature changes, and the growth stage (e.g., mid-exponential,

stationary, biofilm, etc.), if reported. During curation, we removed some non-traditional RNA-seq

datasets, such as TermSeq or RiboSeq.

.CC-BY-NC 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted July 2, 2021. ; https://doi.org/10.1101/2021.07.01.450581doi: bioRxiv preprint

HTML Viewer

Figures

Figure 4: Clustermap of inferred iModulon activities across B. subtilis biofilm development. Abbreviations: LC - liquid culture, H - hours, D - days, M - month

Figure 2: Overview of the B. subtilis iModulon structure. (a) Graphical representation of the IcaData object from PyModulon, illustrating the data, attributes, and methods stored in the object. (b) Example of an iModulon. Each point represents a gene. The x-axis shows the location of the gene in the genome, and the y-axis measures the weight of the gene in the Zur iModulon. Genes with prior evidence of Zur regulation are highlighted in orange. Genes outside the dashed black line are members of the Zur iModulon, whereas the genes inside the dashed black lines are not in the Zur iModulon. (c) Treemap of the 72 B. subtilis iModulons. The size of each box represents the fraction of expression variance that is explained by the iModulon. (d) Scatter plot comparing the overlap of each iModulon and its associated regulon(s). The circle size scales with the number of genes in the iModulon, and the color indicates the general category of the iModulon. (e) Venn diagram between the three SigB iModulons and the SigB regulon. (f) Motif identified upstream of all 58 genes in the ResD iModulon.

Figure 3: Examples of insights derived from iModulons. (a) Scatter plot of the gene weights in the newly discovered early-biofilm iModulon, created using the plot_gene_weights function. Genes outside the horizontal dashed black lines are in the iModulon, and genes are colored by their Cluster of Orthologous Gene (COG) category. (b) Bar plot of the iModulon activities for the early-biofilm iModulon, created from the plot_activities function. Individual points show iModulon activities for replicates, whereas the bars show the average activity for an experimental condition. Asterisks indicate the reference condition for each project. (c) Venn diagram comparing the SPbeta-1, SPbeta-2 and YonO-1 iModulons against the genes in the SPβ prophage. The asterisk indicates that one gene (yozZ) was in all three iModulons, but not in the prophage. (d) Scatter plot comparing the SPbeta-1 and SPbeta-2 iModulon activities, created from the compare_activities function. Each point represents a gene expression dataset under a specific condition. The center diagonal line is the 45-degree line of equal activities. (e) Scatter plot comparing the SPbeta-2 and YonO-1 iModulon activities. Each point represents a gene expression dataset under a specific condition.

Figure 1: Overview of public B. subtilis RNA-seq data. a) Graphical representation of the five step workflow. b) Pie chart illustrating the quality control process. Numbers at the beginning of arrows represent the number of datasets before the quality control step, and numbers at the end represent the number of passed datasets after the step. c) Number of high-quality RNA-seq datasets for B. subtilis in NCBI SRA over time. d) Scatter plot of the top two principal components of the B. subtilis expression compendium. Points are colored based on the growth phase parsed from the literature. e) Bar chart showing the expression of dnaA across four projects. Points show individual replicates, while bars show the average expression for a given condition. Bars with a red star serve as the reference condition for the project. The legend describes the bars in the spore_genes project.

Figure 5: Comparison of iModulon structures. (a) Bar chart comparing the iModulons found in this RNAseq dataset compared to a previous microarray dataset, colored by the type of iModulon. iModulons that are conserved between the two datasets are shown in a darker color. (b-d) Scatter plots comparing gene weights of iModulons found in different datasets, created using the compare_gene_weights function. Horizontal and vertical dashed lines indicate iModulon thresholds. Diagonal dashed line indicates the 45- degree line of equal gene weights. Genes in red are members of both iModulons. (b) Comparison of the Zur iModulon gene weights computed from the RNA-seq and microarray datasets. (c) Comparison of an uncharacterized iModulon found in both the RNA-seq and microarray datasets. (d) Comparison of the B. subtilis GlpP iModulon to the E. coli GlpR iModulon.

- 06 Apr 2022 -

Nucleic Acids Research

TL;DR: In this paper , the authors used 411 transcription profiles of Pseudomonas aeruginosa from diverse growth conditions to construct a quantitative transcriptional regulatory network by identifying independently modulated sets of genes (called iModulons) and their condition-specific activity levels.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Ben Langmead, +3 more

- 04 Mar 2009 -

Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Q1. What contributions have the authors mentioned in the paper "Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks" ?

Here, the authors introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism ’ s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example.

Q2. What were the main criteria for identifying samples that did not conform to the typical expression profile?

Hierarchical clustering was used to identify samples that did not conform to a typical expression profile, as these samples often use non-standard library preparation methods, such as ribosome sequencing and 3’ or 5’ end sequencing 3.

Q3. What were the four metrics used to evaluate the quality of B. subtilis?

To guarantee a high quality expression dataset for B. subtilis, data that failed any of the following four FASTQC metrics were discarded: per base sequence quality, per sequence quality scores, per base n content, and adapter content.

Q4. what is the role of a staphylococcus aureus?

Revealing 29 sets of independently modulated genes in Staphylococcusaureus, their regulators, and role in key physiological response.

Q5. How was the iModulon compared to known motifs?

iModulon motifs were compared to known motifs using the compare_motifs function in PyModulon, which is a wrapper for TOMTOM 63 using an E-value of 0.001.

Q6. How many dimensions were used to determine the optimal independent components?

Since the number of dimensions selected in ICA can alter the results, the authors applied the above procedure to the B. subtilis dataset multiple times, ranging the number of dimensions from 10 to 260 (i.e., the approximate size of the dataset) with a step size of 10.

Q7. What information was pulled from the literature?

Information including the strain description, base media, carbon source, treatments, and temperature were pulled from the literature.

Q8. How was the false discovery rate calculated?

iModulon enrichments against known regulons were computed using Fisher’s Exact Test, with the false discovery rate (FDR) controlled at 10-5 using the Benjamini-Hochberg correction.

Q9. How many independent components were used to identify B. subtilis?

The resulting independent components (ICs) were clustered using DBSCAN 56 to identify robust ICs, using an epsilon of 0.1 and minimum cluster seed size of 50.

Q10. what is the role of coherent functional modules in breast cancer?

12. Karczewski, K. J., Snyder, M., Altman, R. B. & Tatonetti, N. P. Coherent functional modulesimprove transcription factor target identification, cooperativity prediction, and diseaseassociation.

Q11. What functions are located in the enrichment module?

Additional functions for gene set enrichment analysis are located in the enrichment module, including a generalized gene set enrichment function and an implementation of the BonferroniHochberg false discovery rate (FDR).

Q12. How can the authors use the K-means clustering method?

The Sci-kit learn 54 implementation of K-means clustering, using three clusters, can be applied to the absolute values of the gene weights in each independent component.

Q13. What was the way to compute the optimal independent components?

To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al.

Q14. what is the nf-core framework for a genome-wide quality control program?

Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program forassigning sequence reads to genomic features.

Q15. How did the authors determine the significant genes in each component?

In order to identify the most significant genes in each component, the authors iteratively removed genes with the largest absolute value and computed the D’Agostino K2 test statistic 57 for the resulting distribution.

Q16. What was the distance metric used to determine the iModulon activity?

Global iModulon activity clustering was performed using the clustermap function in the Python Seaborn package 69 using the following distance metric:𝑑𝑥,𝑦 = 1 − ||𝜌𝑥,𝑦||where ||𝜌𝑥,𝑦|| is the absolute value of the Spearman R correlation between two iModulon activity profiles.

Q17. How many times was the implementation of fastICA executed?

the scikit-learn (v0.23.2) 54 implementation of FastICA 55 was executed 100 times with random seeds and a convergence tolerance of 10-7.

Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks

Figures

Citations

Machine learning from Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators

Machine Learning of All Mycobacterium tuberculosis H37Rv RNA-seq Data Reveals a Structured Interplay between Metabolism, Stress Response, and Infection

Machine Learning Uncovers a Data-Driven Transcriptional Regulatory Network for the Crenarchaeal Thermoacidophile Sulfolobus acidocaldarius.

Machine Learning of Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators

Advanced transcriptomic analysis reveals the role of efflux pumps and media composition in antibiotic responses of Pseudomonas aeruginosa

References

Scikit-learn: Machine Learning in Python

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

A density-based algorithm for discovering clusters in large spatial Databases with Noise

featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features

Independent component analysis, a new concept?

Frequently Asked Questions (17)

Q1. What contributions have the authors mentioned in the paper "Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks" ?

Q2. What were the main criteria for identifying samples that did not conform to the typical expression profile?

Q3. What were the four metrics used to evaluate the quality of B. subtilis?

Q4. what is the role of a staphylococcus aureus?

Q5. How was the iModulon compared to known motifs?

Q6. How many dimensions were used to determine the optimal independent components?

Q7. What information was pulled from the literature?

Q8. How was the false discovery rate calculated?

Q9. How many independent components were used to identify B. subtilis?

Q10. what is the role of coherent functional modules in breast cancer?

Q11. What functions are located in the enrichment module?

Q12. How can the authors use the K-means clustering method?

Q13. What was the way to compute the optimal independent components?

Q14. what is the nf-core framework for a genome-wide quality control program?

Q15. How did the authors determine the significant genes in each component?

Q16. What was the distance metric used to determine the iModulon activity?

Q17. How many times was the implementation of fastICA executed?

Trending Questions (1)