scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Computational Biology in 2021"


Journal ArticleDOI
TL;DR: Results show that the proposed deep learning methods for diagnosis of ASD from functional brain networks constructed with brain functional magnetic resonance imaging (fMRI) data outperform the state-of-the-art methods.
Abstract: Autism spectrum disorder (ASD) is a neurological and developmental disorder. Traditional diagnosis of ASD is typically performed through the observation of behaviors and interview of a patient. However, these diagnosis methods are time-consuming and can be misleading sometimes. Integrating machine learning algorithms with neuroimages, a diagnosis method, can possibly be established to detect ASD subjects from typical control subjects. In this study, we develop deep learning methods for diagnosis of ASD from functional brain networks constructed with brain functional magnetic resonance imaging (fMRI) data. The entire Autism Brain Imaging Data Exchange 1 (ABIDE 1) data set is utilized to investigate the performance of our proposed methods. First, we construct the brain networks from brain fMRI images and define the raw features based on such brain networks. Second, we employ an autoencoder (AE) to learn the advanced features from the raw features. Third, we train a deep neural network (DNN) with the advanced features, which achieves the classification accuracy of 76.2% and the receiving operating characteristic curve (AUC) of 79.7%. As a comparison, we also apply the same advanced features to train several traditional machine learning algorithms to benchmark the classification performance. Finally, we combine the DNN with the pretrained AE and train it with the raw features, which achieves the classification accuracy of 79.2% and the AUC of 82.4%. These results show that our proposed deep learning methods outperform the state-of-the-art methods.

47 citations


Journal ArticleDOI
TL;DR: This work proves a lower bound on the size of the optimal SPSS and proposes a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to the lower bound.
Abstract: Given the popularity and elegance of \(k\)-mer based tools, finding a space-efficient way to represent a set of \(k\)-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of \(k\)-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of \(k\)-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static \(k\)-mer membership index, UST-FM, which we show improves index size by 10–44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

19 citations


Journal ArticleDOI
TL;DR: By overcoming batch effects this method was able to correctly separate cell types, improving on several prior methods suggested for this task and analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.
Abstract: Dimensionality reduction is an important first step in the analysis of single cell RNA-seq (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and labs. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell type specific. To overcome this we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different datasets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

16 citations


Journal ArticleDOI
TL;DR: In this paper, a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) responsible for the disease CoV19 disease (COVID-19) has wreaked havoc on the health and economy of humanity.
Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) responsible for the disease coronavirus-19 disease (COVID-19) has wreaked havoc on the health and economy of humanity. In addition, the ...

16 citations


Journal ArticleDOI
TL;DR: An algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times, is presented, based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted.
Abstract: The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos, Attie and Friedberg in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed.

14 citations


Journal ArticleDOI
TL;DR: The identified candidate key genes and pathways help to understand the molecular mechanisms underlying the pathogenesis of IPAH and may be novel biomarkers in IPAH diagnosis.
Abstract: Idiopathic pulmonary arterial hypertension (IPAH) is a fatal cardiovascular disease event with significant morbidity and mortality. However, its potential molecular mechanisms and potential key genes have not been totally evaluated. The gene expression profile of GSE33463, including 30 individuals diagnosed with IPAH and 41 normal controls, was downloaded from Gene Expression Omnibus database. The differentially expressed genes (DEGs) were identified using limma package in R. Gene Ontology (GO) annotation, the Kyoto Encyclopedia of Genes and Genomes (KEGG) were carried out to get further insight into the possible functions of the identified DEGs. Then, the protein-protein interaction (PPI) network of all DEGs was constructed. Nodes with higher degree centrality (≥10) were considered as hub proteins in the PPI network. Area under the curve (AUC) values obtained from the receiver operating characteristic (ROC) curve analysis was utilized to assess the diagnostic effectiveness of hub genes in discriminating IPAH from normal individuals. Sixty-nine DEGs were identified, including 41 upregulated and 28 downregulated DEGs. The GO enrichment analysis indicated that genes were significantly enriched in oxygen carrier activity, oxygen binding, heme binding, molecular carrier activity, and antioxidant activity. KEGG pathway enrichment showed that genes were mainly involved in cytokine and cytokine receptor, Chemokine signaling pathway, interleukin-17 signaling pathway, and Toll-like receptor (TLR) signaling pathway. JUN, ALAS2, HBD, EPB42, TLR7, SLC4A1, and CXCR4 were identified as the hub genes nodes. The area under the ROC curve indicated that three hub genes have high diagnostic value in IPAH with AUC of 0.934 [95% confidence interval (CI): 0.849-0.979] in TLR7, 0.910 (95% CI: 0.818-0.965) in JUN, and 0.895 (95% CI: 0.800-0.955) in CXCR4. The identified candidate key genes and pathways help us understand the molecular mechanisms underlying the pathogenesis of IPAH. TLR7, JUN, and CXCR4 may be novel biomarkers in IPAH diagnosis.

14 citations


Journal ArticleDOI
TL;DR: In this paper, the authors identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations.
Abstract: The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.

11 citations


Journal ArticleDOI
TL;DR: PipeLine Amalgamating Single-cell Tree Inference Components (PLAN) as discussed by the authors is an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies.
Abstract: In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal, we developed plastic (PipeLine Amalgamating Single-cell Tree Inference Components), an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data, (2) to infer tumor phylogenies, and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures-which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.

10 citations


Journal ArticleDOI
TL;DR: In this study, it is proved that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions andIndels onsigned strings; and reversals, transposition, and indel on signed and unsigned strings.
Abstract: The rearrangement distance is a well-known problem in the field of comparative genomics. Given two genomes, the rearrangement distance is the minimum number of rearrangements in a set of allowed rearrangements (rearrangement model), which transforms one genome into the other. In rearrangement distance problems, a genome is modeled as a string, where each element represents a conserved region within the two genomes. When the orientation of the genes is known, it is represented by (plus or minus) signs assigned to the elements of the string. Two of the most studied rearrangements are reversals, which invert a segment of the genome, and transpositions, which exchange the relative positions of two adjacent segments of the genome. The first works in genome rearrangements considered that the genomes being compared had the same genetic material and that rearrangement events were restricted to reversals, transpositions, or both. El-Mabrouk extended the reversal model on signed strings to include the operations of insertion and deletion of segments in the genome, which allowed the comparison of genomes with different genetic material. Other studies also addressed this problem and, recently, this problem was proved to be solvable in polynomial time by Willing et al. For unsigned strings, we still observe a lack of results. That said, in this study we prove that computing the rearrangement distance for the following models is NP-Hard: reversals and indels on unsigned strings; transpositions and indels on unsigned strings; and reversals, transpositions, and indels on signed and unsigned strings. Along with the NP-hardness proofs, we present a 2-approximation algorithm for reversals on unsigned strings and 3-approximation algorithms for the other models.

10 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network.
Abstract: Essential proteins are a vital part of the survival of organisms and cells. Identification of essential proteins lays a solid foundation for understanding protein functions and discovering drug targets. The traditional biological experiments are expensive and time-consuming. Recently, many computational methods have been proposed. However, some noises in the protein-protein interaction (PPI) networks affect the efficiency of essential protein prediction. It is necessary to construct a credible PPI network by using other useful biological information to reduce the effects of these noises. In this article, we proposed a model, Ess-NEXG, to identify essential proteins, which integrates biological information, including orthologous information, subcellular localization information, RNA-Seq information, and PPI network. In our model, first, we constructed a credible weighted PPI network by using different types of biological information. Second, we extracted the topological features of proteins in the constructed weighted PPI network by using the node2vec technique. Last, we used eXtreme Gradient Boosting (XGBoost) to predict essential proteins by using the topological features of proteins. The extensive results show that our model has better performance than other computational methods.

9 citations


Journal ArticleDOI
TL;DR: It is shown that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRal-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model.
Abstract: Phylogenomics—the estimation of species trees from multi-locus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.

Journal ArticleDOI
TL;DR: Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model, and is capable of handling different input data set and various computing algorithms to train the enhancing model.
Abstract: Recently, a deep learning-based enhancing Position-Specific Scoring Matrix (PSSM) method (Bagging Multiple Sequence Alignment [MSA] Learning) Guo et al. has been proposed, and its effectiveness has been empirically proved. Program EPTool is the implementation of Bagging MSA Learning, which provides a complete training and evaluation workflow for the enhancing PSSM model. It is capable of handling different input data set and various computing algorithms to train the enhancing model, then eventually improve the PSSM quality for those proteins with insufficient homologous sequences. In addition, EPTool equips several convenient applications, such as PSSM features calculator, and PSSM features visualization. In this article, we propose designed EPTool and briefly introduce its functionalities and applications. The detailed accessible instructions are also provided.

Journal ArticleDOI
TL;DR: In this paper, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the Altered Subset Distribution (ASD) is proposed. But the algorithm is not suitable for the detection of altered subnetworks.
Abstract: A classic problem in computational biology is the identification of altered subnetworks: subnetworks of an interaction network that contain genes/proteins that are differentially expressed, highly mutated, or otherwise aberrant compared to other genes/proteins. Numerous methods have been developed to solve this problem under various assumptions, but the statistical properties of these methods are often unknown. For example, some widely-used methods are reported to output very large subnetworks that are difficult to interpret biologically. In this work, we formulate the identification of altered subnetworks as the problem of estimating the parameters of a class of probability distributions which we call the Altered Subset Distribution (ASD). We derive a connection between a popular method, jActiveModules, and the maximum likelihood estimator (MLE) of the ASD. We show that the MLE is statistically biased, explaining the large subnetworks output by jActiveModules. We introduce NetMix, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the ASD. We demonstrate that NetMix outperforms existing methods in identifying altered subnetworks on both simulated and real data, including the identification of differentially expressed genes from both microarray and RNA-seq experiments and the identification of cancer driver genes in somatic mutation data.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors used the computational approach of drug development to screen the antiviral molecules from two antiviral libraries (Life Chemicals [LC] and ASINEX) against RdRP and found that these molecules could be potential inhibitors of SARS-CoV-2 RdRP.
Abstract: The detrimental effect of coronavirus disease 2019 (COVID-19) pandemic has manifested itself as a global crisis. Currently, no specific treatment options are available for COVID-19, so therapeutic interventions to tackle the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection must be urgently established. Therefore, cohesive and multidimensional efforts are required to identify new therapies or investigate the efficacy of small molecules and existing drugs against SARS-CoV-2. Since the RNA-dependent RNA Polymerase (RdRP) of SARS-CoV-2 is a promising therapeutic target, this study addresses the identification of antiviral molecules that can specifically target SARS-CoV-2 RdRP. The computational approach of drug development was used to screen the antiviral molecules from two antiviral libraries (Life Chemicals [LC] and ASINEX) against RdRP. Here, we report six antiviral molecules (F3407-4105, F6523-2250, F6559-0746 from LC and BDG 33693278, BDG 33693315, LAS 34156196 from ASINEX), which show substantial interactions with key amino acid residues of the active site of SARS-CoV-2 RdRP and exhibit higher binding affinity (>7.5 kcalmol-1) than Galidesivir, an Food and Drug Administration-approved inhibitor of the same. Further, molecular dynamics simulation and Molecular Mechanics Poisson-Boltzmann Surface Area results confirmed that identified molecules with RdRP formed higher stable RdRP-inhibitor(s) complex than RdRP-Galidesvir complex. Our findings suggest that these molecules could be potential inhibitors of SARS-CoV-2 RdRP. However, further in vitro and preclinical experiments would be required to validate these potential inhibitors of SARS-CoV-2 protein.

Journal ArticleDOI
TL;DR: In this paper, Gene Set Enrichment Analysis (GSEA) is used to identify differentially expressed gene sets that are enriched for annotated biological functions, and the existing GSEA R code is not in the form of a fle...
Abstract: Gene Set Enrichment Analysis (GSEA) is used to identify differentially expressed gene sets that are enriched for annotated biological functions. The existing GSEA R code is not in the form of a fle...

Journal ArticleDOI
TL;DR: In this article, two mathematical algorithms, lattice up-stream targeting (LUST) and D-basis, were applied to the identification of prognostic signatures from cancer gene expression data.
Abstract: This study applied two mathematical algorithms, lattice up-stream targeting (LUST) and D-basis, to the identification of prognostic signatures from cancer gene expression data. The LUST algorithm l...

Journal ArticleDOI
TL;DR: This study used the public reduced representation bisulfite sequencing data of mouse for evaluating software and revealing novel biologically significant results to supplement the previous research.
Abstract: DNA methylation in gene or gene body could influence gene transcription. Moreover, methylation in gene regions along with CpG island regions could modulate the transcription to undetectable gene expression levels. Therefore, it is necessary to investigate the methylation levels within the gene, gene body, CpG island regions, and their overlapped regions and then identify the gene-based differentially methylated regions (GeneDMRs). In this study, R package GeneDMRs aims to facilitate computing gene-based methylation rate using next-generation sequencing-based methylome data. The user-friendly GeneDMRs package is presented to analyze the methylation levels in each gene/promoter/exon/intron/CpG island/CpG island shore or each overlapped region (e.g., gene-CpG island/promoter-CpG island/exon-CpG island/intron-CpG island/gene-CpG island shore/promoter-CpG island shore/exon-CpG island shore/intron-CpG island shore). GeneDMRs can also interpret complex interplays between methylation levels and gene expression differences or similarities across physiological conditions or disease states. We used the public reduced representation bisulfite sequencing data of mouse (GSE62392) for evaluating software and revealing novel biologically significant results to supplement the previous research. In addition, the whole-genome bisulfite sequencing data of cattle (GSE106538) given the much larger size was used for further evaluation.

Journal ArticleDOI
TL;DR: IMFLer as discussed by the authors is an interactive metabolic flux analyzer and visualizer that enables the reading and management of metabolic model layout maps, as well as immediate visualization of results from both FBA and flux variability analysis (FVA).
Abstract: Increasing genome-wide data in biological sciences and medicine has contributed to the development of a variety of visualization tools. Several automatic, semiautomatic, and manual visualization tools have already been developed. Some even have integrated flux balance analysis (FBA), but in most cases, it depends on separately installed third party software that is proprietary and does not allow customization of its functionality and has many restrictions for easy data distribution and analysis. In this study, we present an interactive metabolic flux analyzer and visualizer (IMFLer)-a static single-page web application that enables the reading and management of metabolic model layout maps, as well as immediate visualization of results from both FBA and flux variability analysis (FVA). IMFLer uses the Escher Builder tool to load, show, edit, and save metabolic pathway maps. This makes IMFLer an attractive and easily applicable tool with a user-friendly interface. Moreover, it allows to faster interpret results from FBA and FVA and improves data interoperability by using a standardized file format for the genome-scale metabolic model. IMFLer is a fully open-source tool that enables the rapid visualization and interpretation of the results of FBA and FVA with no time setup and no programming skills required, available at https://lv-csbg.github.io/IMFLer/.

Journal ArticleDOI
TL;DR: A novel deep multi-task learning algorithm with automatically learning the biological interrelations among target genes and utilizing such information to enhance the prediction is proposed, which can effectively learn the interrelations from the large-scale tasks on the gene expression inference problem, and does not suffer from cost-prohibitive operations.
Abstract: Gene expressions profiling empowers many biological studies in various fields by comprehensive characterization of cellular status under different experimental conditions. Despite the recent advances in high-throughput technologies, profiling the whole-genome set is still challenging and expensive. Based on the fact that there is high correlation among the expression patterns of different genes, the above issue can be addressed by a cost-effective approach that collects only a small subset of genes, called landmark genes, as the representative of the entire genome set and estimates the remaining ones, called target genes, via the computational model. Several shallow and deep regression models have been presented in the literature for inferring the expressions of target genes. However, the shallow models suffer from underfitting due to their insufficient capacity in capturing the complex nature of gene expression data, and the existing deep models are prone to overfitting due to the lack of using the interrelations of target genes in the learning framework. To address these challenges, we formulate the gene expression inference as a multi-task learning problem and propose a novel deep multi-task learning algorithm with automatically learning the biological interrelations among target genes and utilizing such information to enhance the prediction. In particular, we employ a multi-layer sub-network with low dimensional latent variables for learning the interrelations among target genes (i.e. distinct predictive tasks), and impose a seamless and easy to implement regularization on deep models. Unlike the conventional complicated multi-task learning methods, which can only deal with tens or hundreds of tasks, our proposed algorithm can effectively learn the interrelations from the large-scale (\(\sim \)10,000) tasks on the gene expression inference problem, and does not suffer from cost-prohibitive operations. Experimental results indicate the superiority of our method compared to the existing gene expression inference models and alternative multi-task learning algorithms on two large-scale datasets.

Journal ArticleDOI
TL;DR: A deep learning model is developed, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions, and it is shown that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for C TCF-mediated
Abstract: The three-dimensional (3D) organization of the human genome is of crucial importance for gene regulation, and the CCCTC-binding factor (CTCF) plays an important role in chromatin interactions. However, it is still unclear what sequence patterns in addition to CTCF motif pairs determine chromatin loop formation. To discover the underlying sequence patterns, we have developed a deep learning model, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions. Our results suggest that DeepCTCFLoop can accurately distinguish the CTCF motif pairs forming chromatin loops from the ones not forming loops. It significantly outperforms CTCF-MP, a machine learning model based on word2vec and boosted trees, when using DNA sequences only. Furthermore, we show that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for CTCF-mediated chromatin loop formation. DeepCTCFLoop has also been applied to disease-associated sequence variants to identify candidates that may disrupt chromatin loop formation. Therefore, our results provide useful information for understanding the mechanism of 3D genome organization and may also help annotate and prioritize the noncoding sequence variants associated with human diseases.

Journal ArticleDOI
TL;DR: In this article, a competing endogenous RNA (ceRNA) network was constructed based on potential long-noncoding RNA (lncRNA)-microRNA (miRNA)-mRNA interactions.
Abstract: Hepatocellular carcinoma (HCC) is a common malignant tumor worldwide. In this study, we aimed to explore the potential biomarkers and key regulatory pathways related to HCC using integrated bioinformatic analysis and validation. The microarray data of GSE12717 and GSE54238 were downloaded from the Gene Expression Omnibus database. A competing endogenous RNA (ceRNA) network was constructed based on potential long-noncoding RNA (lncRNA)-microRNA (miRNA)-mRNA interactions. A total of 191 mRNAs, 8 miRNAs, and 5 lncRNAs were selected to construct the ceRNA network. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were used to predict their biological functions. The PI3K-Akt signaling pathway was significantly enriched. Kaplan-Meier survival analysis based on the Gene Expression Profiling Interactive Analysis (GEPIA) database was conducted for the weighted mRNAs and lncRNAs. The results showed that SRC, GMPS, CDK2, FEN1, EZH2, ZWINT, MTHFD1L, GINS2, and MAPKAPK5-AS1 were significantly upregulated in tumor tissues. The relative expression levels of these genes were significantly upregulated in HCC patients based on the StarBase database. For further validation, the expression levels of these genes were detected by real-time quantitative reverse transcription-polymerase chain reaction in 20 HCC tumor tissues and paired paracancerous tissues. Receiver operating characteristic analysis revealed that CDK2, MTHFD1L, SRC, ZWINT, and MAPKAPK5-AS1 had significant diagnostic value in HCC, but further studies are needed to explore their mechanisms in HCC.

Journal ArticleDOI
TL;DR: In this article, the authors present a framework built around a set of relationships that both unifies the information measures for the discrete functions and uses them to express key quantitative genetic relationships, and a general approach is described for inferring functional relationships in genotype and phenotype data.
Abstract: Quantitative genetics has evolved dramatically in the past century, and the proliferation of genetic data, in quantity as well as type, enables the characterization of complex interactions and mechanisms beyond the scope of its theoretical foundations. In this article, we argue that revisiting the framework for analysis is important and we begin to lay the foundations of an alternative formulation of quantitative genetics based on information theory. Information theory can provide sensitive and unbiased measures of statistical dependencies among variables, and it provides a natural mathematical language for an alternative view of quantitative genetics. In the previous work, we examined the information content of discrete functions and applied this approach and methods to the analysis of genetic data. In this article, we present a framework built around a set of relationships that both unifies the information measures for the discrete functions and uses them to express key quantitative genetic relationships. Information theory measures of variable interdependency are used to identify significant interactions, and a general approach is described for inferring functional relationships in genotype and phenotype data. We present information-based measures of the genetic quantities: penetrance, heritability, and degrees of statistical epistasis. Our scope here includes the consideration of both two- and three-variable dependencies and independently segregating variants, which captures additive effects, genetic interactions, and two-phenotype pleiotropy. This formalism and the theoretical approach naturally apply to higher multivariable interactions and complex dependencies, and can be adapted to account for population structure, linkage, and nonrandomly segregating markers. This article thus focuses on presenting the initial groundwork for a full formulation of quantitative genetics based on information theory.

Journal ArticleDOI
TL;DR: The DEGs and hub genes identified in this study may help to understand the potential etiology of the occurrence and development of AS.
Abstract: Cardiovascular and cerebrovascular diseases, which mainly consist of atherosclerosis (AS), are major causes of death. A great deal of research has been carried out to clarify the molecular mechanisms of AS. However, the etiology of AS remains poorly understood. To screen the potential genes of AS occurrence and development, GSE43292 and GSE57691 were obtained from the Gene Expression Omnibus (GEO) database in this study for bioinformatic analysis. First, GEO2R was used to identify differentially expressed genes (DEGs) and the functional annotation of DEGs was performed by gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. The Search Tool for the Retrieval of Interacting Genes (STRING) tool was used to construct the protein-protein interaction network and the most important modules and core genes were mined. The results show that a total of 211 DEGs are identified. The functional changes of DEGs are mainly associated with the cellular process, catalytic activity, and protein binding. Eighteen genes were identified as core genes. Bioinformatic analysis showed that the core genes are mainly enriched in numerous processes related to actin. In conclusion, the DEGs and hub genes identified in this study may help us understand the potential etiology of the occurrence and development of AS.

Journal ArticleDOI
TL;DR: In this article, a fast algorithm for computing Fourier power spectra at fractional periods of real sequences is presented, which can be used in many digital signal processing applications, such as signal processing.
Abstract: Directly computing Fourier power spectra at fractional periods of real sequences can be beneficial in many digital signal processing applications. In this article, we present a fast algorithm to co...

Journal ArticleDOI
TL;DR: This study constructs a miRNA functional similarity network derived from a disease similarity network and a known miRNA-disease relationship network and presents an improved K-means algorithm to detect mi RNA functional modules and uses 243 diseases to validate the performance of the proposed method.
Abstract: Inferring potential associations between microRNAs (miRNAs) and human diseases can help people understand the pathogenesis of complex human diseases. Several computational approaches have been pres...

Journal ArticleDOI
TL;DR: In this paper, the sequencing of microbial communities directly from the environment without prior culturing has been proposed, but the major problem when analyzing a microbial sample is to taxonomic taxonomic classification.
Abstract: Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomic...

Journal ArticleDOI
TL;DR: The present study identifies hits that can be further designed and modified as potent LdDHFR inhibitors, and two hits were found to be more selective than the reported potent L dDHFR inhibitor.
Abstract: Dihydrofolate reductase (DHFR) is a well-known enzyme of the folate metabolic pathway and it is a validated drug target for leishmaniasis. However, only a few leads are reported against Leishmania ...

Journal ArticleDOI
TL;DR: Evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2 is provided, which complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns.
Abstract: COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The viral genome is considered to be relatively stable and the mutations that have been observed and reported thus far are mainly focused on the coding region. This article provides evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2. This view complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns. First, we observe differences in mutational signals for geospatially separated populations such as the prevalence of A23404G in CA versus NY and WA. We show that the feedback between macrolevel dynamics and the viral population can be captured employing a transfer entropy framework. Second, we observe complex interactions within mutational clades. Namely, when C14408T first appeared in the viral population, the frequency of A23404G spiked in the subsequent week. Third, we identify a noncoding mutation, G29540A, within the segment between the coding gene of the N protein and the ORF10 gene, which is largely confined to NY ([Formula: see text]95%). These observations indicate that macrolevel sociobehavioral measures have an impact on the viral genomics and may be useful for the dashboard-like tracking of its evolution. Finally, despite the fact that SARS-CoV-2 is a genetically robust organism, our findings suggest that we are dealing with a high degree of adaptability. Owing to its ample spread, mutations of unusual form are observed and a high complexity of mutational interaction is exhibited.

Journal ArticleDOI
TL;DR: The results suggest the crucial roles of TNFRSF1A, CLDN1, and CASP1 in the tumorigenesis of PTC and provide a vital bioinformatic basis for further experimental validations and clinical applications.
Abstract: Although the incidence of thyroid carcinoma is reported to be the highest among malignancies of endocrine system, its diagnosis is still unsatisfactory. This study sought to explore the key DNA methylation-driven genes in the development of papillary thyroid carcinoma (PTC) via a bioinformatic analysis based on the Cancer Genome Atlas (TCGA) database and was validated using the Gene Expression Omnibus (GEO) database. The level 3 DNA methylation, mRNA expression, and clinical data of 499 patients with PTC were obtained from the TCGA database. The R package LIMMA, edgeR, and MethylMix were applied to explore the DNA methylation-driven genes in PTC. The ConsensusPathDB software, DAVID, and STRING databases were used for Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes pathway analyses, as well as protein/protein interaction network construction individually. To verify the result, the explored genes were validated using GSE97466 data set retrieved from the GEO database. Fifty-seven (57) methylation-driven genes were detected via MethylMix based on a beta mixture model that compared the DNA methylation state of tumor tissues with that of the normal tissues. Eventually, three genes (TNFRSF1A, CLDN1, and CASP1) were identified to be the most potential biomarkers for the diagnosis or treatment of PTC. These results suggest the crucial roles of TNFRSF1A, CLDN1, and CASP1 in the tumorigenesis of PTC and provide a vital bioinformatic basis for further experimental validations and clinical applications.

Journal ArticleDOI
TL;DR: Triple non-negative matrix factorization with community detection (triUMPF) as mentioned in this paper combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network.
Abstract: Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges, including pathway features engineering, multiple mapping of enzymatic reactions, and emergent or distributed metabolism within populations or communities of cells, can limit prediction performance. In this article, we present triUMPF (triple non-negative matrix factorization [NMF] with community detection for metabolic pathway inference), which combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract a higher-order structure based on the clustering of vertices that share similar statistical properties. We evaluated triUMPF performance by using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.