Showing papers in "Journal of Computational Biology in 2021"

PDF

Open Access

Journal Article•DOI•

Diagnosis of Autism Spectrum Disorder Based on Functional Brain Networks with Deep Learning.

[...]

Wutao Yin¹, Sakib Mostafa¹, Fang-Xiang Wu¹•Institutions (1)

04 Feb 2021-Journal of Computational Biology

TL;DR: Results show that the proposed deep learning methods for diagnosis of ASD from functional brain networks constructed with brain functional magnetic resonance imaging (fMRI) data outperform the state-of-the-art methods.

...read moreread less

Abstract: Autism spectrum disorder (ASD) is a neurological and developmental disorder. Traditional diagnosis of ASD is typically performed through the observation of behaviors and interview of a patient. However, these diagnosis methods are time-consuming and can be misleading sometimes. Integrating machine learning algorithms with neuroimages, a diagnosis method, can possibly be established to detect ASD subjects from typical control subjects. In this study, we develop deep learning methods for diagnosis of ASD from functional brain networks constructed with brain functional magnetic resonance imaging (fMRI) data. The entire Autism Brain Imaging Data Exchange 1 (ABIDE 1) data set is utilized to investigate the performance of our proposed methods. First, we construct the brain networks from brain fMRI images and define the raw features based on such brain networks. Second, we employ an autoencoder (AE) to learn the advanced features from the raw features. Third, we train a deep neural network (DNN) with the advanced features, which achieves the classification accuracy of 76.2% and the receiving operating characteristic curve (AUC) of 79.7%. As a comparison, we also apply the same advanced features to train several traditional machine learning algorithms to benchmark the classification performance. Finally, we combine the DNN with the pretrained AE and train it with the raw features, which achieves the classification accuracy of 79.2% and the AUC of 82.4%. These results show that our proposed deep learning methods outperform the state-of-the-art methods.

...read moreread less

47 citations

Journal Article•DOI•

Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

[...]

Amatur Rahman¹, Paul Medevedev¹•Institutions (1)

Pennsylvania State University¹

20 Apr 2021-Journal of Computational Biology

TL;DR: This work proves a lower bound on the size of the optimal SPSS and proposes a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to the lower bound.

...read moreread less

Abstract: Given the popularity and elegance of \(k\)-mer based tools, finding a space-efficient way to represent a set of \(k\)-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of \(k\)-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of \(k\)-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static \(k\)-mer membership index, UST-FM, which we show improves index size by 10–44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

...read moreread less

19 citations

Journal Article•DOI•

Supervised Adversarial Alignment of Single-Cell RNA-seq Data

[...]

Songwei Ge¹, Haohan Wang¹, Amir H. Alavi¹, Eric P. Xing¹, Ziv Bar-Joseph¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

19 Jan 2021-Journal of Computational Biology

TL;DR: By overcoming batch effects this method was able to correctly separate cell types, improving on several prior methods suggested for this task and analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

...read moreread less

Abstract: Dimensionality reduction is an important first step in the analysis of single cell RNA-seq (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and labs. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell type specific. To overcome this we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different datasets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

...read moreread less

16 citations

Journal Article•DOI•

Mapping the Nonstructural Transmembrane Proteins of Severe Acute Respiratory Syndrome Coronavirus 2.

[...]

Sunil Thomas¹•Institutions (1)

Lankenau Institute for Medical Research¹

28 Jun 2021-Journal of Computational Biology

TL;DR: In this paper, a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) responsible for the disease CoV19 disease (COVID-19) has wreaked havoc on the health and economy of humanity.

...read moreread less

Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) responsible for the disease coronavirus-19 disease (COVID-19) has wreaked havoc on the health and economy of humanity. In addition, the ...

...read moreread less

16 citations

Journal Article•DOI•

Computing the rearrangement distance of natural genomes

[...]

Leonard Bohnenkämper¹, Marília D. V. Braga¹, Daniel Doerr¹, Jens Stoye¹•Institutions (1)

Bielefeld University¹

20 Apr 2021-Journal of Computational Biology

TL;DR: An algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times, is presented, based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted.

...read moreread less

Abstract: The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos, Attie and Friedberg in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed.

...read moreread less

14 citations

Journal Article•DOI•

Identification of Differentially Expressed Genes Associated with Idiopathic Pulmonary Arterial Hypertension by Integrated Bioinformatics Approaches.

[...]

Enfa Zhao¹, Hang Xie¹, Yushun Zhang¹•Institutions (1)

Xi'an Jiaotong University¹

06 Jan 2021-Journal of Computational Biology

TL;DR: The identified candidate key genes and pathways help to understand the molecular mechanisms underlying the pathogenesis of IPAH and may be novel biomarkers in IPAH diagnosis.

...read moreread less

Abstract: Idiopathic pulmonary arterial hypertension (IPAH) is a fatal cardiovascular disease event with significant morbidity and mortality. However, its potential molecular mechanisms and potential key genes have not been totally evaluated. The gene expression profile of GSE33463, including 30 individuals diagnosed with IPAH and 41 normal controls, was downloaded from Gene Expression Omnibus database. The differentially expressed genes (DEGs) were identified using limma package in R. Gene Ontology (GO) annotation, the Kyoto Encyclopedia of Genes and Genomes (KEGG) were carried out to get further insight into the possible functions of the identified DEGs. Then, the protein-protein interaction (PPI) network of all DEGs was constructed. Nodes with higher degree centrality (≥10) were considered as hub proteins in the PPI network. Area under the curve (AUC) values obtained from the receiver operating characteristic (ROC) curve analysis was utilized to assess the diagnostic effectiveness of hub genes in discriminating IPAH from normal individuals. Sixty-nine DEGs were identified, including 41 upregulated and 28 downregulated DEGs. The GO enrichment analysis indicated that genes were significantly enriched in oxygen carrier activity, oxygen binding, heme binding, molecular carrier activity, and antioxidant activity. KEGG pathway enrichment showed that genes were mainly involved in cytokine and cytokine receptor, Chemokine signaling pathway, interleukin-17 signaling pathway, and Toll-like receptor (TLR) signaling pathway. JUN, ALAS2, HBD, EPB42, TLR7, SLC4A1, and CXCR4 were identified as the hub genes nodes. The area under the ROC curve indicated that three hub genes have high diagnostic value in IPAH with AUC of 0.934 [95% confidence interval (CI): 0.849-0.979] in TLR7, 0.910 (95% CI: 0.818-0.965) in JUN, and 0.895 (95% CI: 0.800-0.955) in CXCR4. The identified candidate key genes and pathways help us understand the molecular mechanisms underlying the pathogenesis of IPAH. TLR7, JUN, and CXCR4 may be novel biomarkers in IPAH diagnosis.

...read moreread less

14 citations

Journal Article•DOI•

From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering.

[...]

Andrew Melnyk¹, Fatemeh Mohebbi¹, Sergey Knyazev¹, Bikram Sahoo¹, Roya Hosseini¹, Pavel Skums¹, Alexander Zelikovsky¹, Alexander Zelikovsky², Murray Patterson¹ - Show less +5 more•Institutions (2)

Georgia State University¹, I.M. Sechenov First Moscow State Medical University²

25 Oct 2021-Journal of Computational Biology

TL;DR: In this paper, the authors identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations.

...read moreread less

Abstract: The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.

...read moreread less

11 citations

Journal Article•DOI•

Simpler and Faster Development of Tumor Phylogeny Pipelines.

[...]

Sarwan Ali¹, Simone Ciccolella², Lorenzo Lucarella², Gianluca Della Vedova², Murray Patterson¹ - Show less +1 more•Institutions (2)

Georgia State University¹, University of Milano-Bicocca²

25 Oct 2021-Journal of Computational Biology

TL;DR: PipeLine Amalgamating Single-cell Tree Inference Components (PLAN) as discussed by the authors is an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies.

...read moreread less

Abstract: In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.

...read moreread less

10 citations

Journal Article•DOI•

Genome Rearrangement Distance with Reversals, Transpositions, and Indels.

[...]

Alexsandro Oliveira Alexandrino¹, Andre Rodrigues Oliveira¹, Ulisses Dias¹, Zanoni Dias¹•Institutions (1)

State University of Campinas¹

04 Mar 2021-Journal of Computational Biology

TL;DR: In this study, it is proved that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions andIndels onsigned strings; and reversals, transposition, and indel on signed and unsigned strings.

...read moreread less

Abstract: The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.

...read moreread less

10 citations

Journal Article•DOI•

Essential Protein Prediction Based on node2vec and XGBoost.

[...]

Nian Wang¹, Min Zeng¹, Yiming Li¹, Fang-Xiang Wu², Min Li¹ - Show less +1 more•Institutions (2)

Central South University¹, University of Saskatchewan²

15 Jul 2021-Journal of Computational Biology

TL;DR: Wang et al. as discussed by the authors proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network.

...read moreread less

Abstract: Essential proteins are a vital part of the survival of organisms and cells. Identification of essential proteins lays a solid foundation for understanding protein functions and discovering drug targets. The traditional biological experiments are expensive and time-consuming. Recently, many computational methods have been proposed. However, some noises in the protein-protein interaction (PPI) networks affect the efficiency of essential protein prediction. It is necessary to construct a credible PPI network by using other useful biological information to reduce the effects of these noises. In this article, we proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network. In our model, first, we constructed a credible weighted PPI network by using different types of biological information. Second, we extracted the topological features of proteins in the constructed weighted PPI network by using the node2vec technique. Last, we used eXtreme Gradient Boosting (XGBoost) to predict essential proteins by using the topological features of proteins. The extensive results show that our model has better performance than other computational methods.

...read moreread less

9 citations

Journal Article•DOI•

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

[...]

Brandon Legried¹, Erin K. Molloy², Tandy Warnow³, Sebastien Roch¹•Institutions (3)

University of Wisconsin-Madison¹, University of California, Los Angeles², University of Illinois at Urbana–Champaign³

20 May 2021-Journal of Computational Biology

TL;DR: It is shown that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRal-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model.

...read moreread less

Abstract: Phylogenomics—the estimation of species trees from multi-locus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.

...read moreread less

Journal Article•DOI•

EPTool: A New Enhancing PSSM Tool for Protein Secondary Structure Prediction.

[...]

Yuzhi Guo¹, Yuzhi Guo², Jiaxiang Wu¹, Hehuan Ma², Sheng Wang², Junzhou Huang² - Show less +2 more•Institutions (2)

Tencent¹, University of Texas at Arlington²

20 Apr 2021-Journal of Computational Biology

TL;DR: Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model, and is capable of handling different input data set and various computing algorithms to train the enhancing model.

...read moreread less

Abstract: Recently, a deep learning-based enhancing Position-Specific Scoring Matrix (PSSM) method (Bagging Multiple Sequence Alignment [MSA] Learning) Guo et al. has been proposed, and its effectiveness has been empirically proved. Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model. It is capable of handling different input data set and various computing algorithms to train the enhancing model, then eventually improve the PSSM quality for those proteins with insufficient homologous sequences. In addition, EPTool equips several convenient applications, such as PSSM features calculator, and PSSM features visualization. In this article, we propose designed EPTool and briefly introduce its functionalities and applications. The detailed accessible instructions are also provided.

...read moreread less

Journal Article•DOI•

NetMix: A Network-Structured Mixture Model for Reduced-Bias Estimation of Altered Subnetworks

[...]

Matthew A. Reyna¹, Uthsav Chitra², Rebecca Elyanow², Rebecca Elyanow³, Benjamin J. Raphael² - Show less +1 more•Institutions (3)

Emory University¹, Princeton University², Brown University³

05 Jan 2021-Journal of Computational Biology

TL;DR: In this paper, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the Altered Subset Distribution (ASD) is proposed. But the algorithm is not suitable for the detection of altered subnetworks.

...read moreread less

Abstract: A classic problem in computational biology is the identification of altered subnetworks: subnetworks of an interaction network that contain genes/proteins that are differentially expressed, highly mutated, or otherwise aberrant compared to other genes/proteins. Numerous methods have been developed to solve this problem under various assumptions, but the statistical properties of these methods are often unknown. For example, some widely-used methods are reported to output very large subnetworks that are difficult to interpret biologically. In this work, we formulate the identification of altered subnetworks as the problem of estimating the parameters of a class of probability distributions which we call the Altered Subset Distribution (ASD). We derive a connection between a popular method, jActiveModules, and the maximum likelihood estimator (MLE) of the ASD. We show that the MLE is statistically biased, explaining the large subnetworks output by jActiveModules. We introduce NetMix, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the ASD. We demonstrate that NetMix outperforms existing methods in identifying altered subnetworks on both simulated and real data, including the identification of differentially expressed genes from both microarray and RNA-seq experiments and the identification of cancer driver genes in somatic mutation data.

...read moreread less

Journal Article•DOI•

Screening of Severe Acute Respiratory Syndrome Coronavirus 2 RNA-Dependent RNA Polymerase Inhibitors Using Computational Approach.

[...]

Poonam Dhankhar¹, Vikram Dalal², Viney Kumar³•Institutions (3)

Virginia Commonwealth University¹, Washington University in St. Louis², Indian Institute of Technology Roorkee³

29 Nov 2021-Journal of Computational Biology

TL;DR: Wang et al. as discussed by the authors used the computational approach of drug development to screen the antiviral molecules from two antiviral libraries (Life Chemicals [LC] and ASINEX) against RdRP and found that these molecules could be potential inhibitors of SARS-CoV-2 RdRP.

...read moreread less

Abstract: The detrimental effect of coronavirus disease 2019 (COVID-19) pandemic has manifested itself as a global crisis. Currently, no specific treatment options are available for COVID-19, so therapeutic interventions to tackle the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection must be urgently established. Therefore, cohesive and multidimensional efforts are required to identify new therapies or investigate the efficacy of small molecules and existing drugs against SARS-CoV-2. Since the RNA-dependent RNA Polymerase (RdRP) of SARS-CoV-2 is a promising therapeutic target, this study addresses the identification of antiviral molecules that can specifically target SARS-CoV-2 RdRP. The computational approach of drug development was used to screen the antiviral molecules from two antiviral libraries (Life Chemicals [LC] and ASINEX) against RdRP. Here, we report six antiviral molecules (F3407-4105, F6523-2250, F6559-0746 from LC and BDG 33693278, BDG 33693315, LAS 34156196 from ASINEX), which show substantial interactions with key amino acid residues of the active site of SARS-CoV-2 RdRP and exhibit higher binding affinity (>7.5 kcalmol-1) than Galidesivir, an Food and Drug Administration-approved inhibitor of the same. Further, molecular dynamics simulation and Molecular Mechanics Poisson-Boltzmann Surface Area results confirmed that identified molecules with RdRP formed higher stable RdRP-inhibitor(s) complex than RdRP-Galidesvir complex. Our findings suggest that these molecules could be potential inhibitors of SARS-CoV-2 RdRP. However, further in vitro and preclinical experiments would be required to validate these potential inhibitors of SARS-CoV-2 protein.

...read moreread less

Journal Article•DOI•

GSEAplot: A Package for Customizing Gene Set Enrichment Analysis in R.

[...]

Sarah E. Innis¹, Kelsie Reinaltt¹, Mete Civelek¹, Warren D. Anderson¹•Institutions (1)

University of Virginia¹

14 Jun 2021-Journal of Computational Biology

TL;DR: In this paper, Gene Set Enrichment Analysis (GSEA) is used to identify differentially expressed gene sets that are enriched for annotated biological functions, and the existing GSEA R code is not in the form of a fle...

...read moreread less

Abstract: Gene Set Enrichment Analysis (GSEA) is used to identify differentially expressed gene sets that are enriched for annotated biological functions. The existing GSEA R code is not in the form of a fle...

...read moreread less

Journal Article•DOI•

Combining Algorithms to Find Signatures That Predict Risk in Early-Stage Stomach Cancer

[...]

James B. Nation¹, Justin Cabot-Miller², Oren Segal³, Robert Lucito³, Kira Adaricheva³ - Show less +1 more•Institutions (3)

University of Hawaii¹, Yale University², Hofstra University³

28 Sep 2021-Journal of Computational Biology

TL;DR: In this article, two mathematical algorithms, lattice up-stream targeting (LUST) and D-basis, were applied to the identification of prognostic signatures from cancer gene expression data.

...read moreread less

Abstract: This study applied two mathematical algorithms, lattice up-stream targeting (LUST) and D-basis, to the identification of prognostic signatures from cancer gene expression data. The LUST algorithm l...

...read moreread less

Journal Article•DOI•

GeneDMRs: An R Package for Gene-Based Differentially Methylated Regions Analysis.

[...]

Xiao Wang¹, Dan Hao², Haja N. Kadarmideen¹•Institutions (2)

Technical University of Denmark¹, Northwest A&F University²

04 Mar 2021-Journal of Computational Biology

TL;DR: This study used the public reduced representation bisulfite sequencing data of mouse for evaluating software and revealing novel biologically significant results to supplement the previous research.

...read moreread less

Abstract: DNA methylation in gene or gene body could influence gene transcription. Moreover, methylation in gene regions along with CpG island regions could modulate the transcription to undetectable gene expression levels. Therefore, it is necessary to investigate the methylation levels within the gene, gene body, CpG island regions, and their overlapped regions and then identify the gene-based differentially methylated regions (GeneDMRs). In this study, R package GeneDMRs aims to facilitate computing gene-based methylation rate using next-generation sequencing-based methylome data. The user-friendly GeneDMRs package is presented to analyze the methylation levels in each gene/promoter/exon/intron/CpG island/CpG island shore or each overlapped region (e.g., gene-CpG island/promoter-CpG island/exon-CpG island/intron-CpG island/gene-CpG island shore/promoter-CpG island shore/exon-CpG island shore/intron-CpG island shore). GeneDMRs can also interpret complex interplays between methylation levels and gene expression differences or similarities across physiological conditions or disease states. We used the public reduced representation bisulfite sequencing data of mouse (GSE62392) for evaluating software and revealing novel biologically significant results to supplement the previous research. In addition, the whole-genome bisulfite sequencing data of cattle (GSE106538) given the much larger size was used for further evaluation.

...read moreread less

Journal Article•DOI•

IMFLer: A Web Application for Interactive Metabolic Flux Analysis and Visualization

[...]

Rudolfs Petrovs¹, Egils Stalidzans¹, Egils Stalidzans², Agris Pentjuss¹•Institutions (2)

University of Latvia¹, Latvia University of Agriculture²

23 Aug 2021-Journal of Computational Biology

TL;DR: IMFLer as discussed by the authors is an interactive metabolic flux analyzer and visualizer that enables the reading and management of metabolic model layout maps, as well as immediate visualization of results from both FBA and flux variability analysis (FVA).

...read moreread less

Abstract: Increasing genome-wide data in biological sciences and medicine has contributed to the development of a variety of visualization tools. Several automatic, semiautomatic, and manual visualization tools have already been developed. Some even have integrated flux balance analysis (FBA), but in most cases, it depends on separately installed third party software that is proprietary and does not allow customization of its functionality and has many restrictions for easy data distribution and analysis. In this study, we present an interactive metabolic flux analyzer and visualizer (IMFLer)-a static single-page web application that enables the reading and management of metabolic model layout maps, as well as immediate visualization of results from both FBA and flux variability analysis (FVA). IMFLer uses the Escher Builder tool to load, show, edit, and save metabolic pathway maps. This makes IMFLer an attractive and easily applicable tool with a user-friendly interface. Moreover, it allows to faster interpret results from FBA and FVA and improves data interoperability by using a standardized file format for the genome-scale metabolic model. IMFLer is a fully open-source tool that enables the rapid visualization and interpretation of the results of FBA and FVA with no time setup and no programming skills required, available at https://lv-csbg.github.io/IMFLer/.

...read moreread less

Journal Article•DOI•

Deep Large-Scale Multitask Learning Network for Gene Expression Inference.

[...]

Kamran Ghasedi Dizaji¹, Wei Chen¹, Heng Huang¹•Institutions (1)

University of Pittsburgh¹

20 May 2021-Journal of Computational Biology

TL;DR: A novel deep multi-task learning algorithm with automatically learning the biological interrelations among target genes and utilizing such information to enhance the prediction is proposed, which can effectively learn the interrelations from the large-scale tasks on the gene expression inference problem, and does not suffer from cost-prohibitive operations.

...read moreread less

Abstract: Gene expressions profiling empowers many biological studies in various fields by comprehensive characterization of cellular status under different experimental conditions. Despite the recent advances in high-throughput technologies, profiling the whole-genome set is still challenging and expensive. Based on the fact that there is high correlation among the expression patterns of different genes, the above issue can be addressed by a cost-effective approach that collects only a small subset of genes, called landmark genes, as the representative of the entire genome set and estimates the remaining ones, called target genes, via the computational model. Several shallow and deep regression models have been presented in the literature for inferring the expressions of target genes. However, the shallow models suffer from underfitting due to their insufficient capacity in capturing the complex nature of gene expression data, and the existing deep models are prone to overfitting due to the lack of using the interrelations of target genes in the learning framework. To address these challenges, we formulate the gene expression inference as a multi-task learning problem and propose a novel deep multi-task learning algorithm with automatically learning the biological interrelations among target genes and utilizing such information to enhance the prediction. In particular, we employ a multi-layer sub-network with low dimensional latent variables for learning the interrelations among target genes (i.e. distinct predictive tasks), and impose a seamless and easy to implement regularization on deep models. Unlike the conventional complicated multi-task learning methods, which can only deal with tens or hundreds of tasks, our proposed algorithm can effectively learn the interrelations from the large-scale (\(\sim \)10,000) tasks on the gene expression inference problem, and does not suffer from cost-prohibitive operations. Experimental results indicate the superiority of our method compared to the existing gene expression inference models and alternative multi-task learning algorithms on two large-scale datasets.

...read moreread less

Journal Article•DOI•

Deep Learning of Sequence Patterns for CCCTC-Binding Factor-Mediated Chromatin Loop Formation.

[...]

Shuzhen Kuang¹, Liangjiang Wang¹•Institutions (1)

Clemson University¹

04 Feb 2021-Journal of Computational Biology

TL;DR: A deep learning model is developed, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions, and it is shown that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for C TCF-mediated

...read moreread less

Abstract: The three-dimensional (3D) organization of the human genome is of crucial importance for gene regulation, and the CCCTC-binding factor (CTCF) plays an important role in chromatin interactions. However, it is still unclear what sequence patterns in addition to CTCF motif pairs determine chromatin loop formation. To discover the underlying sequence patterns, we have developed a deep learning model, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions. Our results suggest that DeepCTCFLoop can accurately distinguish the CTCF motif pairs forming chromatin loops from the ones not forming loops. It significantly outperforms CTCF-MP, a machine learning model based on word2vec and boosted trees, when using DNA sequences only. Furthermore, we show that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for CTCF-mediated chromatin loop formation. DeepCTCFLoop has also been applied to disease-associated sequence variants to identify candidates that may disrupt chromatin loop formation. Therefore, our results provide useful information for understanding the mechanism of 3D genome organization and may also help annotate and prioritize the noncoding sequence variants associated with human diseases.

...read moreread less

Journal Article•DOI•

Integrated Analysis of an lncRNA-Associated ceRNA Network Reveals Potential Biomarkers for Hepatocellular Carcinoma.

[...]

Jie Yang, Qing-Chun Xu, Zhen-Yu Wang, Xun Lu, Liu-Kui Pan, Jun Wu, Chen Wang - Show less +3 more

04 Mar 2021-Journal of Computational Biology

TL;DR: In this article, a competing endogenous RNA (ceRNA) network was constructed based on potential long-noncoding RNA (lncRNA)-microRNA (miRNA)-mRNA interactions.

...read moreread less

Abstract: Hepatocellular carcinoma (HCC) is a common malignant tumor worldwide. In this study, we aimed to explore the potential biomarkers and key regulatory pathways related to HCC using integrated bioinformatic analysis and validation. The microarray data of GSE12717 and GSE54238 were downloaded from the Gene Expression Omnibus database. A competing endogenous RNA (ceRNA) network was constructed based on potential long-noncoding RNA (lncRNA)-microRNA (miRNA)-mRNA interactions. A total of 191 mRNAs, 8 miRNAs, and 5 lncRNAs were selected to construct the ceRNA network. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were used to predict their biological functions. The PI3K-Akt signaling pathway was significantly enriched. Kaplan-Meier survival analysis based on the Gene Expression Profiling Interactive Analysis (GEPIA) database was conducted for the weighted mRNAs and lncRNAs. The results showed that SRC, GMPS, CDK2, FEN1, EZH2, ZWINT, MTHFD1L, GINS2, and MAPKAPK5-AS1 were significantly upregulated in tumor tissues. The relative expression levels of these genes were significantly upregulated in HCC patients based on the StarBase database. For further validation, the expression levels of these genes were detected by real-time quantitative reverse transcription-polymerase chain reaction in 20 HCC tumor tissues and paired paracancerous tissues. Receiver operating characteristic analysis revealed that CDK2, MTHFD1L, SRC, ZWINT, and MAPKAPK5-AS1 had significant diagnostic value in HCC, but further studies are needed to explore their mechanisms in HCC.

...read moreread less

Journal Article•DOI•

Toward an Information Theory of Quantitative Genetics

[...]

David J. Galas¹, James Kunert-Graf¹, Lisa Uechi¹, Nikita A. Sakhanenko¹•Institutions (1)

Pacific Northwest Diabetes Research Institute¹

14 Jun 2021-Journal of Computational Biology

TL;DR: In this article, the authors present a framework built around a set of relationships that both unifies the information measures for the discrete functions and uses them to express key quantitative genetic relationships, and a general approach is described for inferring functional relationships in genotype and phenotype data.

...read moreread less

Abstract: Quantitative genetics has evolved dramatically in the past century, and the proliferation of genetic data, in quantity as well as type, enables the characterization of complex interactions and mechanisms beyond the scope of its theoretical foundations. In this article, we argue that revisiting the framework for analysis is important and we begin to lay the foundations of an alternative formulation of quantitative genetics based on information theory. Information theory can provide sensitive and unbiased measures of statistical dependencies among variables, and it provides a natural mathematical language for an alternative view of quantitative genetics. In the previous work, we examined the information content of discrete functions and applied this approach and methods to the analysis of genetic data. In this article, we present a framework built around a set of relationships that both unifies the information measures for the discrete functions and uses them to express key quantitative genetic relationships. Information theory measures of variable interdependency are used to identify significant interactions, and a general approach is described for inferring functional relationships in genotype and phenotype data. We present information-based measures of the genetic quantities: penetrance, heritability, and degrees of statistical epistasis. Our scope here includes the consideration of both two- and three-variable dependencies and independently segregating variants, which captures additive effects, genetic interactions, and two-phenotype pleiotropy. This formalism and the theoretical approach naturally apply to higher multivariable interactions and complex dependencies, and can be adapted to account for population structure, linkage, and nonrandomly segregating markers. This article thus focuses on presenting the initial groundwork for a full formulation of quantitative genetics based on information theory.

...read moreread less

Journal Article•DOI•

Identification of Potential Hub Genes of Atherosclerosis Through Bioinformatic Analysis.

[...]

Yang Yin, Yang-Fan Zou, Yu Xiao¹, Tian-Xi Wang², Ya-Ni Wang³, Zhi-Cheng Dong, Yu-Hu Huo⁴, Bo-Chen Yao, Ling-Bing Meng, Shuang-Xia Du - Show less +6 more•Institutions (4)

Peking University¹, Hebei University of Technology², Hebei Medical University³, Tongji University⁴

06 Jan 2021-Journal of Computational Biology

TL;DR: The DEGs and hub genes identified in this study may help to understand the potential etiology of the occurrence and development of AS.

...read moreread less

Abstract: Cardiovascular and cerebrovascular diseases, which mainly consist of atherosclerosis (AS), are major causes of death. A great deal of research has been carried out to clarify the molecular mechanisms of AS. However, the etiology of AS remains poorly understood. To screen the potential genes of AS occurrence and development, GSE43292 and GSE57691 were obtained from the Gene Expression Omnibus (GEO) database in this study for bioinformatic analysis. First, GEO2R was used to identify differentially expressed genes (DEGs) and the functional annotation of DEGs was performed by gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. The Search Tool for the Retrieval of Interacting Genes (STRING) tool was used to construct the protein-protein interaction network and the most important modules and core genes were mined. The results show that a total of 211 DEGs are identified. The functional changes of DEGs are mainly associated with the cellular process, catalytic activity, and protein binding. Eighteen genes were identified as core genes. Bioinformatic analysis showed that the core genes are mainly enriched in numerous processes related to actin. In conclusion, the DEGs and hub genes identified in this study may help us understand the potential etiology of the occurrence and development of AS.

...read moreread less

Journal Article•DOI•

A Fast Algorithm for Computing the Fourier Spectrum of a Fractional Period

[...]

Jiasong Wang¹, Changchuan Yin²•Institutions (2)

Nanjing University¹, University of Illinois at Chicago²

04 Mar 2021-Journal of Computational Biology

TL;DR: In this article, a fast algorithm for computing Fourier power spectra at fractional periods of real sequences is presented, which can be used in many digital signal processing applications, such as signal processing.

...read moreread less

Abstract: Directly computing Fourier power spectra at fractional periods of real sequences can be beneficial in many digital signal processing applications. In this article, we present a fast algorithm to co...

...read moreread less

Journal Article•DOI•

Inferring MicroRNA-Disease Associations Based on the Identification of a Functional Module.

[...]

Buwen Cao¹, Buwen Cao², Shuguang Deng², Hua Qin², Jiawei Luo¹, Guanghui Li³, Cheng Liang⁴ - Show less +3 more•Institutions (4)

Hunan University¹, Hunan City University², East China Jiaotong University³, Shandong Normal University⁴

06 Jan 2021-Journal of Computational Biology

TL;DR: This study constructs a miRNA functional similarity network derived from a disease similarity network and a known miRNA-disease relationship network and presents an improved K-means algorithm to detect mi RNA functional modules and uses 243 diseases to validate the performance of the proposed method.

...read moreread less

Abstract: Inferring potential associations between microRNAs (miRNAs) and human diseases can help people understand the pathogenesis of complex human diseases. Several computational approaches have been pres...

...read moreread less

Journal Article•DOI•

MetaProb 2: Metagenomic Reads Binning Based on Assembly Using Minimizers and K-Mers Statistics.

[...]

Francesco Andreace¹, Cinzia Pizzi¹, Matteo Comin¹•Institutions (1)

University of Padua¹

26 Aug 2021-Journal of Computational Biology

TL;DR: In this paper, the sequencing of microbial communities directly from the environment without prior culturing has been proposed, but the major problem when analyzing a microbial sample is to taxonomic taxonomic classification.

...read moreread less

Abstract: Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomic...

...read moreread less

Journal Article•DOI•

Identification of Selective Inhibitors of LdDHFR Enzyme Using Pharmacoinformatic Methods

[...]

Vishnu Sharma, Prasad V. Bharatam

06 Jan 2021-Journal of Computational Biology

TL;DR: The present study identifies hits that can be further designed and modified as potent LdDHFR inhibitors, and two hits were found to be more selective than the reported potent L dDHFR inhibitor.

...read moreread less

Abstract: Dihydrofolate reductase (DHFR) is a well-known enzyme of the folate metabolic pathway and it is a validated drug target for leishmaniasis. However, only a few leads are reported against Leishmania ...

...read moreread less

Journal Article•DOI•

Multiscale Feedback Loops in SARS-CoV-2 Viral Evolution.

[...]

Christopher L. Barrett¹, Andrei C. Bura¹, Qijun He¹, Fenix W. D. Huang¹, Thomas J. X. Li¹, Michael S. Waterman¹, Christian M. Reidys¹ - Show less +3 more•Institutions (1)

University of Virginia¹

04 Mar 2021-Journal of Computational Biology

TL;DR: Evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2 is provided, which complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns.

...read moreread less

Abstract: COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The viral genome is considered to be relatively stable and the mutations that have been observed and reported thus far are mainly focused on the coding region. This article provides evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2. This view complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns. First, we observe differences in mutational signals for geospatially separated populations such as the prevalence of A23404G in CA versus NY and WA. We show that the feedback between macrolevel dynamics and the viral population can be captured employing a transfer entropy framework. Second, we observe complex interactions within mutational clades. Namely, when C14408T first appeared in the viral population, the frequency of A23404G spiked in the subsequent week. Third, we identify a noncoding mutation, G29540A, within the segment between the coding gene of the N protein and the ORF10 gene, which is largely confined to NY ([Formula: see text]95%). These observations indicate that macrolevel sociobehavioral measures have an impact on the viral genomics and may be useful for the dashboard-like tracking of its evolution. Finally, despite the fact that SARS-CoV-2 is a genetically robust organism, our findings suggest that we are dealing with a high degree of adaptability. Owing to its ample spread, mutations of unusual form are observed and a high complexity of mutational interaction is exhibited.

...read moreread less

Journal Article•DOI•

Exploration of DNA Methylation-Driven Genes in Papillary Thyroid Carcinoma Based on the Cancer Genome Atlas.

[...]

Yanwei Chen¹, Keke Wang¹, Mengyuan Shang¹, Shuangshuang Zhao¹, Zheng Zhang¹, Haizhen Yang¹, Zheming Chen¹, Rui Du¹, Qilong Wang¹, Baoding Chen¹ - Show less +6 more•Institutions (1)

Jiangsu University¹

06 Jan 2021-Journal of Computational Biology

TL;DR: The results suggest the crucial roles of TNFRSF1A, CLDN1, and CASP1 in the tumorigenesis of PTC and provide a vital bioinformatic basis for further experimental validations and clinical applications.

...read moreread less

Abstract: Although the incidence of thyroid carcinoma is reported to be the highest among malignancies of endocrine system, its diagnosis is still unsatisfactory. This study sought to explore the key DNA methylation-driven genes in the development of papillary thyroid carcinoma (PTC) via a bioinformatic analysis based on the Cancer Genome Atlas (TCGA) database and was validated using the Gene Expression Omnibus (GEO) database. The level 3 DNA methylation, mRNA expression, and clinical data of 499 patients with PTC were obtained from the TCGA database. The R package LIMMA, edgeR, and MethylMix were applied to explore the DNA methylation-driven genes in PTC. The ConsensusPathDB software, DAVID, and STRING databases were used for Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes pathway analyses, as well as protein/protein interaction network construction individually. To verify the result, the explored genes were validated using GSE97466 data set retrieved from the GEO database. Fifty-seven (57) methylation-driven genes were detected via MethylMix based on a beta mixture model that compared the DNA methylation state of tumor tissues with that of the normal tissues. Eventually, three genes (TNFRSF1A, CLDN1, and CASP1) were identified to be the most potential biomarkers for the diagnosis or treatment of PTC. These results suggest the crucial roles of TNFRSF1A, CLDN1, and CASP1 in the tumorigenesis of PTC and provide a vital bioinformatic basis for further experimental validations and clinical applications.

...read moreread less

Journal Article•DOI•

Metabolic Pathway Prediction Using Non-Negative Matrix Factorization with Improved Precision.

[...]

Abdur Rahman M. A. Basher¹, Ryan J. McLaughlin¹, Steven J. Hallam•Institutions (1)

University of British Columbia¹

14 Sep 2021-Journal of Computational Biology

TL;DR: Triple non-negative matrix factorization with community detection (triUMPF) as mentioned in this paper combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network.

...read moreread less

Abstract: Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges, including pathway features engineering, multiple mapping of enzymatic reactions, and emergent or distributed metabolism within populations or communities of cells, can limit prediction performance. In this article, we present triUMPF (triple non-negative matrix factorization [NMF] with community detection for metabolic pathway inference), which combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract a higher-order structure based on the clustering of vertices that share similar statistical properties. We evaluated triUMPF performance by using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.

...read moreread less