Mining all publicly available expression data to compute dynamic microbial transcriptional regulatory networks
read more
Citations
Machine learning from Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators
Machine Learning of All Mycobacterium tuberculosis H37Rv RNA-seq Data Reveals a Structured Interplay between Metabolism, Stress Response, and Infection
Machine Learning Uncovers a Data-Driven Transcriptional Regulatory Network for the Crenarchaeal Thermoacidophile Sulfolobus acidocaldarius.
Machine Learning of Pseudomonas aeruginosa transcriptomes identifies independently modulated sets of genes associated with known transcriptional regulators
Advanced transcriptomic analysis reveals the role of efflux pumps and media composition in antibiotic responses of Pseudomonas aeruginosa
References
Scikit-learn: Machine Learning in Python
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
A density-based algorithm for discovering clusters in large spatial Databases with Noise
featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features
Independent component analysis, a new concept?
Frequently Asked Questions (17)
Q2. What were the main criteria for identifying samples that did not conform to the typical expression profile?
Hierarchical clustering was used to identify samples that did not conform to a typical expression profile, as these samples often use non-standard library preparation methods, such as ribosome sequencing and 3’ or 5’ end sequencing 3.
Q3. What were the four metrics used to evaluate the quality of B. subtilis?
To guarantee a high quality expression dataset for B. subtilis, data that failed any of the following four FASTQC metrics were discarded: per base sequence quality, per sequence quality scores, per base n content, and adapter content.
Q4. what is the role of a staphylococcus aureus?
Revealing 29 sets of independently modulated genes in Staphylococcusaureus, their regulators, and role in key physiological response.
Q5. How was the iModulon compared to known motifs?
iModulon motifs were compared to known motifs using the compare_motifs function in PyModulon, which is a wrapper for TOMTOM 63 using an E-value of 0.001.
Q6. How many dimensions were used to determine the optimal independent components?
Since the number of dimensions selected in ICA can alter the results, the authors applied the above procedure to the B. subtilis dataset multiple times, ranging the number of dimensions from 10 to 260 (i.e., the approximate size of the dataset) with a step size of 10.
Q7. What information was pulled from the literature?
Information including the strain description, base media, carbon source, treatments, and temperature were pulled from the literature.
Q8. How was the false discovery rate calculated?
iModulon enrichments against known regulons were computed using Fisher’s Exact Test, with the false discovery rate (FDR) controlled at 10-5 using the Benjamini-Hochberg correction.
Q9. How many independent components were used to identify B. subtilis?
The resulting independent components (ICs) were clustered using DBSCAN 56 to identify robust ICs, using an epsilon of 0.1 and minimum cluster seed size of 50.
Q10. what is the role of coherent functional modules in breast cancer?
12. Karczewski, K. J., Snyder, M., Altman, R. B. & Tatonetti, N. P. Coherent functional modulesimprove transcription factor target identification, cooperativity prediction, and diseaseassociation.
Q11. What functions are located in the enrichment module?
Additional functions for gene set enrichment analysis are located in the enrichment module, including a generalized gene set enrichment function and an implementation of the BonferroniHochberg false discovery rate (FDR).
Q12. How can the authors use the K-means clustering method?
The Sci-kit learn 54 implementation of K-means clustering, using three clusters, can be applied to the absolute values of the gene weights in each independent component.
Q13. What was the way to compute the optimal independent components?
To compute the optimal independent components, an extension of ICA was performed on the RNA-seq dataset as described in McConn et al.
Q14. what is the nf-core framework for a genome-wide quality control program?
Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program forassigning sequence reads to genomic features.
Q15. How did the authors determine the significant genes in each component?
In order to identify the most significant genes in each component, the authors iteratively removed genes with the largest absolute value and computed the D’Agostino K2 test statistic 57 for the resulting distribution.
Q16. What was the distance metric used to determine the iModulon activity?
Global iModulon activity clustering was performed using the clustermap function in the Python Seaborn package 69 using the following distance metric:𝑑𝑥,𝑦 = 1 − ||𝜌𝑥,𝑦||where ||𝜌𝑥,𝑦|| is the absolute value of the Spearman R correlation between two iModulon activity profiles.
Q17. How many times was the implementation of fastICA executed?
the scikit-learn (v0.23.2) 54 implementation of FastICA 55 was executed 100 times with random seeds and a convergence tolerance of 10-7.