scispace - formally typeset
Open accessJournal ArticleDOI: 10.1038/S41467-021-21671-W

Improving gene function predictions using independent transcriptional components

05 Mar 2021-Nature Communications (Nature Publishing Group)-Vol. 12, Iss: 1, pp 1464-1464
Abstract: The interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal. Our understanding of the function of many transcripts is still incomplete, limiting the interpretability of transcriptomic data. Here the authors use consensus-independent component analysis, together with a guilt-by-association approach, to improve the prediction of gene function.

... read more


6 results found

Open accessPosted ContentDOI: 10.1101/2021.07.01.450581
Anand V. Sastry1, Saugat Poudel1, Kevin Rychel1, Reo Yoo1  +7 moreInstitutions (2)
02 Jul 2021-bioRxiv
Abstract: We are firmly in the era of biological big data. Millions of omics datasets are publicly accessible and can be employed to support scientific research or build a holistic view of an organism. Here, we introduce a workflow that converts all public gene expression data for a microbe into a dynamic representation of the organism’s transcriptional regulatory network. This five-step process walks researchers through the mining, processing, curation, analysis, and characterization of all available expression data, using Bacillus subtilis as an example. The resulting reconstruction of the B. subtilis regulatory network can be leveraged to predict new regulons and analyze datasets in the context of all published data. The results are hosted at, and additional analyses can be performed using the PyModulon Python package. As the number of publicly available datasets increases, this pipeline will be applicable to a wide range of microbial pathogens and cell factories.

... read more

4 Citations

Open accessPosted ContentDOI: 10.1101/2021.04.08.439047
11 Apr 2021-bioRxiv
Abstract: Uncovering the structure of the transcriptional regulatory network (TRN) that modulates gene expression in prokaryotes remains an important challenge. Transcriptomics data is plentiful, necessitating the development of scalable methods for converting this data into useful knowledge about the TRN. Previously, we published the PRECISE dataset for Escherichia coli K-12 MG1655, containing 278 RNA-seq datasets created using a standardized protocol. Here, we present PRECISE 2.0, which is nearly three times the size of the original PRECISE dataset and also created using a standardized protocol. We analyze PRECISE 2.0 at multiple scales, demonstrating multiple analytical strategies for extracting knowledge from this dataset. Specifically, we: (1) highlight patterns in gene expression across the dataset; (2) utilize independent component analysis to extract 218 independently modulated groups of genes (iModulons) that describe the TRN at the systems level; (3) demonstrate the utility of iModulons over traditional differential expression analysis; and (4) uncover 6 new potential regulons. Thus, PRECISE 2.0 is a large-scale, high-quality transcriptomics dataset which may be analyzed at multiple scales to yield important biological insights.

... read more

3 Citations

Open accessPosted ContentDOI: 10.1101/2021.11.25.470058
Yi-Heng Zhu1, Chengxin Zhang2, Yan Liu1, Gilbert S. Omenn2  +3 moreInstitutions (2)
27 Nov 2021-bioRxiv
Abstract: Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. We proposed a new method (TripletGO) to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profiling, genetic sequence alignment, protein sequence alignment and naive probability, respectively. TripletGO was tested on a large set of 5,754 genes from 8 species (human, mouse, arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2,433 proteins with available expression data from the CAFA3 experiment and achieved function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet-network based profiling method with the feature space mapping technique which can accurately recognize function patterns from transcript expressions. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at

... read more

Open accessJournal ArticleDOI: 10.1007/S12094-021-02704-8
Jinwei Lei1, Shipeng Guo1, Kang Li1, Jiao Tian1  +5 moreInstitutions (1)
Abstract: Lysophosphatidic acid (LPA) is a bioactive molecule which participates in many physical and pathological processes. Although LPA receptor 6 (LPAR6), the last identified LPA receptor, has been reported to have diverse effects in multiple cancers, including breast cancer, its effects and functioning mechanisms are not fully known. Multiple public databases were used to investigate the mRNA expression of LPAR6, its prognostic value, and potential mechanisms in breast cancer. Western blotting was performed to validate the differential expression of LPAR6 in breast cancer tissues and their adjacent tissues. Furthermore, in vitro experiments were used to explore the effects of LPAR6 on breast cancer. Additionally, TargetScan and miRWalk were used to identify potential upstream regulating miRNAs and validated the relationship between miR-27a-3p and LPAR6 via real-time polymerase chain reaction and an in vitro rescue assay. LPAR6 was significantly downregulated in breast cancer at transcriptional and translational levels. Decreased LPAR6 expression in breast cancer is significantly correlated with poor overall survival, disease-free survival, and distal metastasis-free survival, particularly for hormone receptor-positive patients, regardless of lymph node metastatic status. In vitro gain and loss-of-function assays indicated that LPAR6 attenuated breast cancer cell proliferation. The analyses of TCGA and METABRIC datasets revealed that LPAR6 may regulate the cell cycle signal pathway. Furthermore, the expression of LPAR6 could be positively regulated by miR-27a-3p. The knockdown of miR-27a-3p increased cell proliferation, and ectopic expression of LPAR6 could partly rescue this phenotype. LPAR6 acts as a tumor suppressor in breast cancer and is positively regulated by miR-27a-3p.

... read more

Topics: Breast cancer (63%), LPAR6 (55%), microRNA (54%) ... read more

Open accessJournal ArticleDOI: 10.3390/IJMS22084274
Abstract: The diagnosis of neuromuscular diseases (NMDs) has been progressively evolving from the grouping of clinical symptoms and signs towards the molecular definition. Optimal clinical, biochemical, electrophysiological, electrophysiological, and histopathological characterization is very helpful to achieve molecular diagnosis, which is essential for establishing prognosis, treatment and genetic counselling. Currently, the genetic approach includes both the gene-targeted analysis in specific clinically recognizable diseases, as well as genomic analysis based on next-generation sequencing, analyzing either the clinical exome/genome or the whole exome or genome. However, as of today, there are still many patients in whom the causative genetic variant cannot be definitely established and variants of uncertain significance are often found. In this review, we address these drawbacks by incorporating two additional biological omics approaches into the molecular diagnostic process of NMDs. First, functional genomics by introducing experimental cell and molecular biology to analyze and validate the variant for its biological effect in an in-house translational diagnostic program, and second, incorporating a multi-omics approach including RNA-seq, metabolomics, and proteomics in the molecular diagnosis of neuromuscular disease. Both translational diagnostics programs and omics are being implemented as part of the diagnostic process in academic centers and referral hospitals and, therefore, an increase in the proportion of neuromuscular patients with a molecular diagnosis is expected. This improvement in the process and diagnostic performance of patients will allow solving aspects of their health problems in a precise way and will allow them and their families to take a step forward in their lives.

... read more

Topics: Exome (53%), Molecular diagnostics (52%)


37 results found

Open accessJournal ArticleDOI: 10.1186/S13059-014-0550-8
05 Dec 2014-Genome Biology
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at .

... read more

Topics: MRNA Sequencing (54%), Integrator complex (51%), Count data (50%) ... read more

29,675 Citations

Open accessJournal ArticleDOI: 10.1073/PNAS.0506580102
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

... read more

26,320 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKS1193
Abstract: The Gene Expression Omnibus (GEO, is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

... read more

Topics: Metadata (53%)

4,737 Citations

Journal ArticleDOI: 10.1038/NBT.3519
Abstract: We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis.

... read more

4,396 Citations

Open accessJournal ArticleDOI: 10.1038/S41467-018-07882-8
Maksim Kunitski1, Nicolas Eicke2, Pia Huber1, Jonas Köhler1  +12 moreInstitutions (2)
Abstract: Wave-particle duality is an inherent peculiarity of the quantum world. The double-slit experiment has been frequently used for understanding different aspects of this fundamental concept. The occurrence of interference rests on the lack of which-way information and on the absence of decoherence mechanisms, which could scramble the wave fronts. Here, we report on the observation of two-center interference in the molecular-frame photoelectron momentum distribution upon ionization of the neon dimer by a strong laser field. Postselection of ions, which are measured in coincidence with electrons, allows choosing the symmetry of the residual ion, leading to observation of both, gerade and ungerade, types of interference.

... read more

Topics: Ionization (55%), Neon (54%)

4,138 Citations

No. of citations received by the Paper in previous years