scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improving gene function predictions using independent transcriptional components

TL;DR: In this article, the authors use a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles.
Abstract: The interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal. Our understanding of the function of many transcripts is still incomplete, limiting the interpretability of transcriptomic data. Here the authors use consensus-independent component analysis, together with a guilt-by-association approach, to improve the prediction of gene function.

Content maybe subject to copyright    Report

Citations
More filters
Posted ContentDOI
02 Jul 2021-bioRxiv
TL;DR: In this paper, the authors introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism's transcriptional regulatory network, which can be used to predict new regulons and analyze datasets in the context of all published data.
Abstract: We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism’s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at https://imodulondb.org/, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.

20 citations

Posted ContentDOI
11 Apr 2021-bioRxiv
TL;DR: The PRECISE 2.0 dataset as discussed by the authors is a large-scale, high-quality transcriptomics dataset which may be analyzed at multiple scales to yield important biological insights. But it is limited to Escherichia coli K-12 MG1655, containing 278 RNA-seq datasets created using a standardized protocol.
Abstract: Uncovering the structure of the transcriptional regulatory network (TRN) that modulates gene expression in prokaryotes remains an important challenge. Transcriptomics data is plentiful, necessitating the development of scalable methods for converting this data into useful knowledge about the TRN. Previously, we published the PRECISE dataset for Escherichia coli K-12 MG1655, containing 278 RNA-seq datasets created using a standardized protocol. Here, we present PRECISE 2.0, which is nearly three times the size of the original PRECISE dataset and also created using a standardized protocol. We analyze PRECISE 2.0 at multiple scales, demonstrating multiple analytical strategies for extracting knowledge from this dataset. Specifically, we: (1) highlight patterns in gene expression across the dataset; (2) utilize independent component analysis to extract 218 independently modulated groups of genes (iModulons) that describe the TRN at the systems level; (3) demonstrate the utility of iModulons over traditional differential expression analysis; and (4) uncover 6 new potential regulons. Thus, PRECISE 2.0 is a large-scale, high-quality transcriptomics dataset which may be analyzed at multiple scales to yield important biological insights.

19 citations

Journal ArticleDOI
29 Jan 2022-Oncogene
TL;DR: In this article , the authors describe a gene expression signature of oncogene-induced replication stress, which characterizes many aggressive cancers, including triple-negative breast cancer (TNBC) and non-transformed cell lines were engineered to overexpress CDC25A, CCNE1 or MYC.
Abstract: Oncogene-induced replication stress characterizes many aggressive cancers. Several treatments are being developed that target replication stress, however, identification of tumors with high levels of replication stress remains challenging. We describe a gene expression signature of oncogene-induced replication stress. A panel of triple-negative breast cancer (TNBC) and non-transformed cell lines were engineered to overexpress CDC25A, CCNE1 or MYC, which resulted in slower replication kinetics. RNA sequencing analysis revealed a set of 52 commonly upregulated genes. In parallel, mRNA expression analysis of patient-derived tumor samples (TCGA, n = 10,592) also revealed differential gene expression in tumors with amplification of oncogenes that trigger replication stress (CDC25A, CCNE1, MYC, CCND1, MYB, MOS, KRAS, ERBB2, and E2F1). Upon integration, we identified a six-gene signature of oncogene-induced replication stress (NAT10, DDX27, ZNF48, C8ORF33, MOCS3, and MPP6). Immunohistochemical analysis of NAT10 in breast cancer samples (n = 330) showed strong correlation with expression of phospho-RPA (R = 0.451, p = 1.82 × 10−20) and γH2AX (R = 0.304, p = 2.95 × 10−9). Finally, we applied our oncogene-induced replication stress signature to patient samples from TCGA (n = 8,862) and GEO (n = 13,912) to define the levels of replication stress across 27 tumor subtypes, identifying diffuse large B cell lymphoma, ovarian cancer, TNBC and colorectal carcinoma as cancer subtypes with high levels of oncogene-induced replication stress.

9 citations

Journal ArticleDOI
TL;DR: In this article, two additional biological omics approaches are incorporated into the molecular diagnostic process of neuromuscular diseases. But, they cannot be used to diagnose the causative genetic variant and variants of uncertain significance.
Abstract: The diagnosis of neuromuscular diseases (NMDs) has been progressively evolving from the grouping of clinical symptoms and signs towards the molecular definition. Optimal clinical, biochemical, electrophysiological, electrophysiological, and histopathological characterization is very helpful to achieve molecular diagnosis, which is essential for establishing prognosis, treatment and genetic counselling. Currently, the genetic approach includes both the gene-targeted analysis in specific clinically recognizable diseases, as well as genomic analysis based on next-generation sequencing, analyzing either the clinical exome/genome or the whole exome or genome. However, as of today, there are still many patients in whom the causative genetic variant cannot be definitely established and variants of uncertain significance are often found. In this review, we address these drawbacks by incorporating two additional biological omics approaches into the molecular diagnostic process of NMDs. First, functional genomics by introducing experimental cell and molecular biology to analyze and validate the variant for its biological effect in an in-house translational diagnostic program, and second, incorporating a multi-omics approach including RNA-seq, metabolomics, and proteomics in the molecular diagnosis of neuromuscular disease. Both translational diagnostics programs and omics are being implemented as part of the diagnostic process in academic centers and referral hospitals and, therefore, an increase in the proportion of neuromuscular patients with a molecular diagnosis is expected. This improvement in the process and diagnostic performance of patients will allow solving aspects of their health problems in a precise way and will allow them and their families to take a step forward in their lives.

6 citations

Journal ArticleDOI
TL;DR: In this article , the authors investigated the potential pathogenic mechanism of glioma-related epilepsy (GRE) by analyzing the dynamic expression profiles of microRNA/ mRNA/ lncRNA in brain tissues.
Abstract: Abstract Background Seizures are a common symptom in glioma patients, and they can cause brain dysfunction. However, the mechanism by which glioma-related epilepsy (GRE) causes alterations in brain networks remains elusive. Objective To investigate the potential pathogenic mechanism of GRE by analyzing the dynamic expression profiles of microRNA/ mRNA/ lncRNA in brain tissues of glioma patients. Methods Brain tissues of 16 patients with GRE and 9 patients with glioma without epilepsy (GNE) were collected. The total RNA was dephosphorylated, labeled, and hybridized to the Agilent Human miRNA Microarray, Release 19.0, 8 × 60 K. The cDNA was labeled and hybridized to the Agilent LncRNA + mRNA Human Gene Expression Microarray V3.0, 4 × 180 K. The raw data was extracted from hybridized images using Agilent Feature Extraction, and quantile normalization was performed using the Agilent GeneSpring. P-value < 0.05 and absolute fold change > 2 were considered the threshold of differential expression data. Data analyses were performed using R and Bioconductor. Results We found that 3 differentially expressed miRNAs (miR-10a-5p, miR-10b-5p, miR-629-3p), 6 differentially expressed lncRNAs (TTN-AS1, LINC00641, SNHG14, LINC00894, SNHG1, OIP5-AS1), and 49 differentially expressed mRNAs play a vitally critical role in developing GRE. The expression of GABARAPL1, GRAMD1B, and IQSEC3 were validated more than twofold higher in the GRE group than in the GNE group in the validation cohort. Pathways including ECM receptor interaction and long-term potentiation (LTP) may contribute to the disease’s progression. Meanwhile, We built a lncRNA-microRNA-Gene regulatory network with structural and functional significance. Conclusion These findings can offer a fresh perspective on GRE-induced brain network changes.

4 citations

References
More filters
Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations

Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal ArticleDOI
TL;DR: The authors show the double-slit interference effect in the strong-field ionization of neon dimers by employing COLTRIMS method to record the momentum distribution of the photoelectrons in the molecular frame.
Abstract: Wave-particle duality is an inherent peculiarity of the quantum world. The double-slit experiment has been frequently used for understanding different aspects of this fundamental concept. The occurrence of interference rests on the lack of which-way information and on the absence of decoherence mechanisms, which could scramble the wave fronts. Here, we report on the observation of two-center interference in the molecular-frame photoelectron momentum distribution upon ionization of the neon dimer by a strong laser field. Postselection of ions, which are measured in coincidence with electrons, allows choosing the symmetry of the residual ion, leading to observation of both, gerade and ungerade, types of interference.

7,160 citations

Journal ArticleDOI
TL;DR: The Gene Expression Omnibus is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community and supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable.
Abstract: The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

6,683 citations

Journal ArticleDOI
TL;DR: Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases, which removes a major computational bottleneck in RNA-seq analysis.
Abstract: We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

6,468 citations