scispace - formally typeset
Search or ask a question

Showing papers in "Briefings in Bioinformatics in 2023"


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a dual-view drug representation learning network for drug-drug interaction prediction, which employs local and global representation learning modules iteratively and learns drug substructures from the single drug and the drug pair simultaneously.
Abstract: Drug-drug interaction (DDI) prediction identifies interactions of drug combinations in which the adverse side effects caused by the physicochemical incompatibility have attracted much attention. Previous studies usually model drug information from single or dual views of the whole drug molecules but ignore the detailed interactions among atoms, which leads to incomplete and noisy information and limits the accuracy of DDI prediction. In this work, we propose a novel dual-view drug representation learning network for DDI prediction ('DSN-DDI'), which employs local and global representation learning modules iteratively and learns drug substructures from the single drug ('intra-view') and the drug pair ('inter-view') simultaneously. Comprehensive evaluations demonstrate that DSN-DDI significantly improved performance on DDI prediction for the existing drugs by achieving a relatively improved accuracy of 13.01% and an over 99% accuracy under the transductive setting. More importantly, DSN-DDI achieves a relatively improved accuracy of 7.07% to unseen drugs and shows the usefulness for real-world DDI applications. Finally, DSN-DDI exhibits good transferability on synergistic drug combination prediction and thus can serve as a generalized framework in the drug discovery field.

6 citations


Journal ArticleDOI
TL;DR: In this article , the effect of the pH on solubility of proteins is investigated. And the results show that pH-dependent predictions can achieve an accuracy comparable with that of experimental methods.
Abstract: Abstract Solubility is a property of central importance for the use of proteins in research in molecular and cell biology and in applications in biotechnology and medicine. Since experimental methods for measuring protein solubility are material intensive and time consuming, computational methods have recently emerged to enable the rapid and inexpensive screening of solubility for large libraries of proteins, as it is routinely required in development pipelines. Here, we describe the development of one such method to include in the predictions the effect of the pH on solubility. We illustrate the resulting pH-dependent predictions on a variety of antibodies and other proteins to demonstrate that these predictions achieve an accuracy comparable with that of experimental methods. We make this method publicly available at https://www-cohsoftware.ch.cam.ac.uk/index.php/camsolph, as the version 3.0 of CamSol.

5 citations


Journal ArticleDOI
TL;DR: In this article , a new prediction approach, termed SSMF-BLNP, based on organically combining selective similarity matrix fusion (SSMF) and bidirectional linear neighborhood label propagation (BLNP), is proposed to predict lncRNA-disease associations.
Abstract: Recent studies have revealed that long noncoding RNAs (lncRNAs) are closely linked to several human diseases, providing new opportunities for their use in detection and therapy. Many graph propagation and similarity fusion approaches can be used for predicting potential lncRNA-disease associations. However, existing similarity fusion approaches suffer from noise and self-similarity loss in the fusion process. To address these problems, a new prediction approach, termed SSMF-BLNP, based on organically combining selective similarity matrix fusion (SSMF) and bidirectional linear neighborhood label propagation (BLNP), is proposed in this paper to predict lncRNA-disease associations. In SSMF, self-similarity networks of lncRNAs and diseases are obtained by selective preprocessing and nonlinear iterative fusion. The fusion process assigns weights to each initial similarity network and introduces a unit matrix that can reduce noise and compensate for the loss of self-similarity. In BLNP, the initial lncRNA-disease associations are employed in both lncRNA and disease directions as label information for linear neighborhood label propagation. The propagation was then performed on the self-similarity network obtained from SSMF to derive the scoring matrix for predicting the relationships between lncRNAs and diseases. Experimental results showed that SSMF-BLNP performed better than seven other state of-the-art approaches. Furthermore, a case study demonstrated up to 100% and 80% accuracy in 10 lncRNAs associated with hepatocellular carcinoma and 10 lncRNAs associated with renal cell carcinoma, respectively. The source code and datasets used in this paper are available at: https://github.com/RuiBingo/SSMF-BLNP.

5 citations


Journal ArticleDOI
TL;DR: In this article , the authors rigorously benchmarked nine state-of-the-art conformational B-cell epitope prediction webservers, including generic and antibody-specific methods, on a dataset of over 250 antibody-antigen structures.
Abstract: Accurate in silico prediction of conformational B-cell epitopes would lead to major improvements in disease diagnostics, drug design and vaccine development. A variety of computational methods, mainly based on machine learning approaches, have been developed in the last decades to tackle this challenging problem. Here, we rigorously benchmarked nine state-of-the-art conformational B-cell epitope prediction webservers, including generic and antibody-specific methods, on a dataset of over 250 antibody-antigen structures. The results of our assessment and statistical analyses show that all the methods achieve very low performances, and some do not perform better than randomly generated patches of surface residues. In addition, we also found that commonly used consensus strategies that combine the results from multiple webservers are at best only marginally better than random. Finally, we applied all the predictors to the SARS-CoV-2 spike protein as an independent case study, and showed that they perform poorly in general, which largely recapitulates our benchmarking conclusions. We hope that these results will lead to greater caution when using these tools until the biases and issues that limit current methods have been addressed, promote the use of state-of-the-art evaluation methodologies in future publications and suggest new strategies to improve the performance of conformational B-cell epitope prediction methods.

4 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a comprehensive end-to-end single-cell multimodal analysis framework named Deep Parametric Inference (DPI), which can transform single cell data into a multimodial parameter space by inferring individual modal parameters.
Abstract: The proliferation of single-cell multimodal sequencing technologies has enabled us to understand cellular heterogeneity with multiple views, providing novel and actionable biological insights into the disease-driving mechanisms. Here, we propose a comprehensive end-to-end single-cell multimodal analysis framework named Deep Parametric Inference (DPI). DPI transforms single-cell multimodal data into a multimodal parameter space by inferring individual modal parameters. Analysis of cord blood mononuclear cells (CBMC) reveals that the multimodal parameter space can characterize the heterogeneity of cells more comprehensively than individual modalities. Furthermore, comparisons with the state-of-the-art methods on multiple datasets show that DPI has superior performance. Additionally, DPI can reference and query cell types without batch effects. As a result, DPI can successfully analyze the progression of COVID-19 disease in peripheral blood mononuclear cells (PBMC). Notably, we further propose a cell state vector field and analyze the transformation pattern of bone marrow cells (BMC) states. In conclusion, DPI is a powerful single-cell multimodal analysis framework that can provide new biological insights into biomedical researchers. The python packages, datasets and user-friendly manuals of DPI are freely available at https://github.com/studentiz/dpi.

4 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a network structure refinement method for gene regulatory networks (NSRGRN) that effectively combines the topological properties and edge importance measures during GRN inference.
Abstract: The elucidation of gene regulatory networks (GRNs) is one of the central challenges of systems biology, which is crucial for understanding pathogenesis and curing diseases. Various computational methods have been developed for GRN inference, but identifying redundant regulation remains a fundamental problem. Although considering topological properties and edge importance measures simultaneously can identify and reduce redundant regulations, how to address their respective weaknesses whilst leveraging their strengths is a critical problem faced by researchers. Here, we propose a network structure refinement method for GRN (NSRGRN) that effectively combines the topological properties and edge importance measures during GRN inference. NSRGRN has two major parts. The first part constructs a preliminary ranking list of gene regulations to avoid starting the GRN inference from a directed complete graph. The second part develops a novel network structure refinement (NSR) algorithm to refine the network structure from local and global topology perspectives. Specifically, the Conditional Mutual Information with Directionality and network motifs are applied to optimise the local topology, and the lower and upper networks are used to balance the bilateral relationship between the local topology's optimisation and the global topology's maintenance. NSRGRN is compared with six state-of-the-art methods on three datasets (26 networks in total), and it shows the best all-round performance. Furthermore, when acting as a post-processing step, the NSR algorithm can improve the results of other methods in most datasets.

4 citations


Journal ArticleDOI
TL;DR: In this article , the authors introduce new multivariate and non-parametric batch effect correction methods based on Partial Least Squares Discriminant Analysis (PLSDA) for microbiome data.
Abstract: Abstract Microbial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to and obscure any factors of interest. Existing batch effect correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce new multivariate and non-parametric batch effect correction methods based on Partial Least Squares Discriminant Analysis (PLSDA). PLSDA-batch first estimates treatment and batch variation with latent components, then subtracts batch-associated components from the data. The resulting batch-effect-corrected data can then be input in any downstream statistical analysis. Two variants are proposed to handle unbalanced batch x treatment designs and to avoid overfitting when estimating the components via variable selection. We compare our approaches with popular methods managing batch effects, namely, removeBatchEffect, ComBat and Surrogate Variable Analysis, in simulated and three case studies using various visual and numerical assessments. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, especially for unbalanced batch \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$\times $\end{document} treatment designs. Our downstream analyses show selections of biologically relevant taxa. This work demonstrates that batch effect correction methods can improve microbiome research outputs. Reproducible code and vignettes are available on GitHub.

4 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors developed eccDNA Atlas (http://lcbb.swjtu.edu.cn/eccDNAatlas), a user-friendly database of eccDNAs that aims to provide a high-quality and integrated resource for browsing, searching and analyzing eccDNA from multiple species.
Abstract: Extrachromosomal circular DNA (eccDNA) represents a large category of non-mitochondrial and non-plasmid circular extrachromosomal DNA, playing an indispensable role in various aspects such as tumorigenesis, immune responses. However, the information of characteristics and functions about eccDNA is fragmented, hiding behind abundant literatures and massive whole-genome sequencing (WGS) data, which has not been sufficiently used for the identification of eccDNAs. Therefore, establishing an integrated repository portal is essential for identifying and analyzing eccDNAs. Here, we developed eccDNA Atlas (http://lcbb.swjtu.edu.cn/eccDNAatlas), a user-friendly database of eccDNAs that aims to provide a high-quality and integrated resource for browsing, searching and analyzing eccDNAs from multiple species. eccDNA Atlas currently contains 629 987 eccDNAs and 8221 ecDNAs manually curated from literatures and 1105 ecDNAs predicted by AmpliconArchitect based on WGS data involved in 66 diseases, 57 tissues and 319 cell lines. The content of each eccDNA entry includes multiple aspects such as sequence, disease, function, characteristic, validation strategies. Furthermore, abundant annotations and analyzing utilities were provided to explore existing eccDNAs in eccDNA Atlas or user-defined eccDNAs including oncogenes, typical enhancers, super enhancers, CTCF-binding sites, SNPs, chromatin accessibility, eQTLs, gene expression, survival and genome visualization. Overall, eccDNA Atlas provides an integrated eccDNA data warehouse and serves as an important tool for future research.

3 citations


Journal ArticleDOI
TL;DR: A comprehensive assessment of ML approaches in network pharmacology can be found in this paper , where the authors provide a broad overview of the current state-of-the-art of ML in drug discovery.
Abstract: Network pharmacology is an emerging area of systematic drug research that attempts to understand drug actions and interactions with multiple targets. Network pharmacology has changed the paradigm from 'one-target one-drug' to highly potent 'multi-target drug'. Despite that, this synergistic approach is currently facing many challenges particularly mining effective information such as drug targets, mechanism of action, and drug and organism interaction from massive, heterogeneous data. To overcome bottlenecks in multi-target drug discovery, computational algorithms are highly welcomed by scientific community. Machine learning (ML) and especially its subfield deep learning (DL) have seen impressive advances. Techniques developed within these fields are now able to analyze and learn from huge amounts of data in disparate formats. In terms of network pharmacology, ML can improve discovery and decision making from big data. Opportunities to apply ML occur in all stages of network pharmacology research. Examples include screening of biologically active small molecules, target identification, metabolic pathways identification, protein-protein interaction network analysis, hub gene analysis and finding binding affinity between compounds and target proteins. This review summarizes the premier algorithmic concepts of ML in network pharmacology and forecasts future opportunities, potential applications as well as several remaining challenges of implementing ML in network pharmacology. To our knowledge, this study provides the first comprehensive assessment of ML approaches in network pharmacology, and we hope that it encourages additional efforts toward the development and acceptance of network pharmacology in the pharmaceutical industry.

3 citations


Journal ArticleDOI
Yamao Chen, Sheng Zhou, Ming Li, Fangqing Zhao, Ji Qi 
TL;DR: An unsupervised and manifold learning-based algorithm, STEEL, which identifies different cell types from spatial transcriptome by clustering cells/beads exhibiting both highly similar gene expression profiles and close spatial distance in the manner of graphs is proposed.
Abstract: Advances in spatial transcriptomics enlarge the use of single cell technologies to unveil the expression landscape of the tissues with valuable spatial context. Here, we propose an unsupervised and manifold learning-based algorithm, Spatial Transcriptome based cEll typE cLustering (STEEL), which identifies domains from spatial transcriptome by clustering beads exhibiting both highly similar gene expression profiles and close spatial distance in the manner of graphs. Comprehensive evaluation of STEEL on spatial transcriptomic datasets from 10X Visium platform demonstrates that it not only achieves a high resolution to characterize fine structures of mouse brain but also enables the integration of multiple tissue slides individually analyzed into a larger one. STEEL outperforms previous methods to effectively distinguish different cell types/domains of various tissues on Slide-seq datasets, featuring in higher bead density but lower transcript detection efficiency. Application of STEEL on spatial transcriptomes of early-stage mouse embryos (E9.5-E12.5) successfully delineates a progressive development landscape of tissues from ectoderm, mesoderm and endoderm layers, and further profiles dynamic changes on cell differentiation in heart and other organs. With the advancement of spatial transcriptome technologies, our method will have great applicability on domain identification and gene expression atlas reconstruction.

3 citations


Journal ArticleDOI
TL;DR: In this article , a data-driven analysis pipeline covering species inference, genome-specific data preprocessing and regression modeling was developed for DNA methylation array analysis for undesigned genomes.
Abstract: The arrival of the Infinium DNA methylation BeadChips for mice and other nonhuman mammalian species has outpaced the development of the informatics that supports their use for epigenetics study in model organisms. Here, we present informatics infrastructure and methods to allow easy DNA methylation analysis on multiple species, including domesticated animals and inbred laboratory mice (in SeSAMe version 1.16.0+). First, we developed a data-driven analysis pipeline covering species inference, genome-specific data preprocessing and regression modeling. We targeted genomes of 310 species and 37 inbred mouse strains and showed that genome-specific preprocessing prevents artifacts and yields more accurate measurements than generic pipelines. Second, we uncovered the dynamics of the epigenome evolution in different genomic territories and tissue types through comparative analysis. We identified a catalog of inbred mouse strain-specific methylation differences, some of which are linked to the strains' immune, metabolic and neurological phenotypes. By streamlining DNA methylation array analysis for undesigned genomes, our methods extend epigenome research to broad species contexts.

Journal ArticleDOI
TL;DR: In this paper , the authors present a graph-based model of the brain connectome, which is based on the representation of the regions of interest as nodes, and representation of functional or anatomical connections as edges.
Abstract: Networks are present in different aspects of our life: communication networks, World Wide Web, Social Networks, and can be used to conveniently describe biological and clinical data, such as the interactions of proteins in an organism or the connections of neurons in the brain. Therefore, network science, focusing on the network representations of physical, biological and social phenomena and leading to predictive models of these phenomena, currently represents a vast field of application and research for many scientific and social disciplines. The mathematical background for the study and analysis of networks has its roots in the theory of graphs that allows studying real phenomena in a quantitative way. According to the formalism coming from graph theory, nodes of the graph represent entities, whereas edges represent the associations among them. Currently, in bioinformatics and systems biology, there is a growing interest in analyzing associations among biological molecules at a network level. Since the study of associations in a system-level scale has shown great potential, the use of networks has become the de facto standard for representing such associations, and its application fields span from molecular biology to brain connectome analysis [1]. Molecules of different types, e.g. genes, proteins, ribonucleic acids and metabolites, have fundamental roles in the mechanisms of the cellular processes. The study of their structure and interactions is crucial for different reasons, comprising the development of new drugs and the discovery of disease pathways. Thus, the modeling of the complete set of interactions and associations among biological molecules as a graph is convenient for a variety of reasons. Networks provide a simple and intuitive representation of heterogeneous and complex biological processes. Moreover, they facilitate modeling and understanding of complicated molecular mechanisms combining graph theory, machine learning and deep learning techniques. While proteomics and genomics data, represented as data streams or data tables, are mainly used to screen large populations in case–control studies (e.g. for early detection of diseases), interactomics data are represented as graphs and they add a new dimension of analysis, allowing, for instance, the graph-based comparison of organism’s properties. In general, complex biological systems represented as networks provide an integrated way to look into the dynamic behavior of the cellular system through the interactions of components. For instance, biological networks also referred to as Protein–Protein Interaction Networks, model biochemical interactions among proteins. Nodes represent the proteins from a given organism, and the edges represent the protein–protein interactions [2]. Also, gene regulatory network (GRN) is a collection of genes in a cell, which interact each other and with other substances in the cell, such as proteins or metabolites, thereby governing the rates at which genes in the network are transcribed into mRNA. Similarly, the graph-based modeling of the whole system of the brain elements and their relations, so-called brain connectome, is based on the representation of the regions of interest as nodes, and the representation of functional or anatomical connections as edges [3]. Furthermore, recent discoveries in biology have elucidated that the interplay of molecules of different types (e.g. genes, proteins and ribonucleic acids) is a constitutive block of mechanisms inside cells. Consequently, models describing the interplay should be able to consider the presence of multiple different agents and associations, i.e. multiple different types of nodes and edges, that yield to the so-called heterogeneous networks [4].

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors developed the largest-to-date online TCM active ingredients-based pharmacotranscriptomic platform integrated traditional Chinese medicine (ITCM) for the effective screening of active ingredients.
Abstract: With the emergence of high-throughput technologies, computational screening based on gene expression profiles has become one of the most effective methods for drug discovery. More importantly, profile-based approaches remarkably enhance novel drug-disease pair discovery without relying on drug- or disease-specific prior knowledge, which has been widely used in modern medicine. However, profile-based systematic screening of active ingredients of traditional Chinese medicine (TCM) has been scarcely performed due to inadequate pharmacotranscriptomic data. Here, we develop the largest-to-date online TCM active ingredients-based pharmacotranscriptomic platform integrated traditional Chinese medicine (ITCM) for the effective screening of active ingredients. First, we performed unified high-throughput experiments and constructed the largest data repository of 496 representative active ingredients, which was five times larger than the previous one built by our team. The transcriptome-based multi-scale analysis was also performed to elucidate their mechanism. Then, we developed six state-of-art signature search methods to screen active ingredients and determine the optimal signature size for all methods. Moreover, we integrated them into a screening strategy, TCM-Query, to identify the potential active ingredients for the special disease. In addition, we also comprehensively collected the TCM-related resource by literature mining. Finally, we applied ITCM to an active ingredient bavachinin, and two diseases, including prostate cancer and COVID-19, to demonstrate the power of drug discovery. ITCM was aimed to comprehensively explore the active ingredients of TCM and boost studies of pharmacological action and drug discovery. ITCM is available at http://itcm.biotcm.net.

Journal ArticleDOI
TL;DR: In this paper , the impact of structural dynamic information on the binding affinity prediction was evaluated by comparing the models trained on different dimensional descriptors, using three targets (i.e. JAK1, TAF1-BD2 and DDR1) and their corresponding ligands as the examples.
Abstract: Binding affinity prediction largely determines the discovery efficiency of lead compounds in drug discovery. Recently, machine learning (ML)-based approaches have attracted much attention in hopes of enhancing the predictive performance of traditional physics-based approaches. In this study, we evaluated the impact of structural dynamic information on the binding affinity prediction by comparing the models trained on different dimensional descriptors, using three targets (i.e. JAK1, TAF1-BD2 and DDR1) and their corresponding ligands as the examples. Here, 2D descriptors are traditional ECFP4 fingerprints, 3D descriptors are the energy terms of the Smina and NNscore scoring functions and 4D descriptors contain the structural dynamic information derived from the trajectories based on molecular dynamics (MD) simulations. We systematically investigate the MD-refined binding affinity prediction performance of three classical ML algorithms (i.e. RF, SVR and XGB) as well as two common virtual screening methods, namely Glide docking and MM/PBSA. The outcomes of the ML models built using various dimensional descriptors and their combinations reveal that the MD refinement with the optimized protocol can improve the predictive performance on the TAF1-BD2 target with considerable structural flexibility, but not for the less flexible JAK1 and DDR1 targets, when taking docking poses as the initial structure instead of the crystal structures. The results highlight the importance of the initial structures to the final performance of the model through conformational analysis on the three targets with different flexibility.

Journal ArticleDOI
TL;DR: In this paper , the authors present a complete analysis from raw reads to DE and functional enrichment analysis, visually illustrating how results are not absolute truths and how algorithmic decisions can greatly impact results and interpretation.
Abstract: DNA and RNA sequencing technologies have revolutionized biology and biomedical sciences, sequencing full genomes and transcriptomes at very high speeds and reasonably low costs. RNA sequencing (RNA-Seq) enables transcript identification and quantification, but once sequencing has concluded researchers can be easily overwhelmed with questions such as how to go from raw data to differential expression (DE), pathway analysis and interpretation. Several pipelines and procedures have been developed to this effect. Even though there is no unique way to perform RNA-Seq analysis, it usually follows these steps: 1) raw reads quality check, 2) alignment of reads to a reference genome, 3) aligned reads' summarization according to an annotation file, 4) DE analysis and 5) gene set analysis and/or functional enrichment analysis. Each step requires researchers to make decisions, and the wide variety of options and resulting large volumes of data often lead to interpretation challenges. There also seems to be insufficient guidance on how best to obtain relevant information and derive actionable knowledge from transcription experiments. In this paper, we explain RNA-Seq steps in detail and outline differences and similarities of different popular options, as well as advantages and disadvantages. We also discuss non-coding RNA analysis, multi-omics, meta-transcriptomics and the use of artificial intelligence methods complementing the arsenal of tools available to researchers. Lastly, we perform a complete analysis from raw reads to DE and functional enrichment analysis, visually illustrating how results are not absolute truths and how algorithmic decisions can greatly impact results and interpretation.

Journal ArticleDOI
TL;DR: In this paper , the authors systematically evaluated nine ligand-based target fishing methods based on target and ligand target pair statistical strategies, which will help practitioners make choices among multiple TF methods.
Abstract: Identification of potential targets for known bioactive compounds and novel synthetic analogs is of considerable significance. In silico target fishing (TF) has become an alternative strategy because of the expensive and laborious wet-lab experiments, explosive growth of bioactivity data and rapid development of high-throughput technologies. However, these TF methods are based on different algorithms, molecular representations and training datasets, which may lead to different results when predicting the same query molecules. This can be confusing for practitioners in practical applications. Therefore, this study systematically evaluated nine popular ligand-based TF methods based on target and ligand-target pair statistical strategies, which will help practitioners make choices among multiple TF methods. The evaluation results showed that SwissTargetPrediction was the best method to produce the most reliable predictions while enriching more targets. High-recall similarity ensemble approach (SEA) was able to find real targets for more compounds compared with other TF methods. Therefore, SwissTargetPrediction and SEA can be considered as primary selection methods in future studies. In addition, the results showed that k = 5 was the optimal number of experimental candidate targets. Finally, a novel ensemble TF method based on consensus voting is proposed to improve the prediction performance. The precision of the ensemble TF method outperforms the individual TF method, indicating that the ensemble TF method can more effectively identify real targets within a given top-k threshold. The results of this study can be used as a reference to guide practitioners in selecting the most effective methods in computational drug discovery.

Journal ArticleDOI
TL;DR: DeepTracer-2.0 as mentioned in this paper is an artificial intelligence-based pipeline that can build amino acid and nucleic acid backbones from a single cryo-EM map, and even predict the best-fitting residues according to the density of side chains.
Abstract: Cryo-electron microscopy (cryo-EM) allows a macromolecular structure such as protein-DNA/RNA complexes to be reconstructed in a three-dimensional coulomb potential map. The structural information of these macromolecular complexes forms the foundation for understanding the molecular mechanism including many human diseases. However, the model building of large macromolecular complexes is often difficult and time-consuming. We recently developed DeepTracer-2.0, an artificial-intelligence-based pipeline that can build amino acid and nucleic acid backbones from a single cryo-EM map, and even predict the best-fitting residues according to the density of side chains. The experiments showed improved accuracy and efficiency when benchmarking the performance on independent experimental maps of protein-DNA/RNA complexes and demonstrated the promising future of macromolecular modeling from cryo-EM maps. Our method and pipeline could benefit researchers worldwide who work in molecular biomedicine and drug discovery, and substantially increase the throughput of the cryo-EM model building. The pipeline has been integrated into the web portal https://deeptracer.uw.edu/.

Journal ArticleDOI
TL;DR: In this article , effective small molecule clustering in the positive dataset, together with a putative negative dataset generation strategy, was adopted in the process of model constructions, and the proposed strategy turned out to reduce the false discovery rate successfully.
Abstract: Due to its promising capacity in improving drug efficacy, polypharmacology has emerged to be a new theme in the drug discovery of complex disease. In the process of novel multi-target drugs (MTDs) discovery, in silico strategies come to be quite essential for the advantage of high throughput and low cost. However, current researchers mostly aim at typical closely related target pairs. Because of the intricate pathogenesis networks of complex diseases, many distantly related targets are found to play crucial role in synergistic treatment. Therefore, an innovational method to develop drugs which could simultaneously target distantly related target pairs is of utmost importance. At the same time, reducing the false discovery rate in the design of MTDs remains to be the daunting technological difficulty. In this research, effective small molecule clustering in the positive dataset, together with a putative negative dataset generation strategy, was adopted in the process of model constructions. Through comprehensive assessment on 10 target pairs with hierarchical similarity-levels, the proposed strategy turned out to reduce the false discovery rate successfully. Constructed model types with much smaller numbers of inhibitor molecules gained considerable yields and showed better false-hit controllability than before. To further evaluate the generalization ability, an in-depth assessment of high-throughput virtual screening on ChEMBL database was conducted. As a result, this novel strategy could hierarchically improve the enrichment factors for each target pair (especially for those distantly related/unrelated target pairs), corresponding to target pair similarity-levels.

Journal ArticleDOI
TL;DR: TE as discussed by the authors employs two separately pretrained encoders to transform TCR and epitope sequences into numerical vectors, which are subsequently fed into a fully connected neural network to predict their binding specificities.
Abstract: The adaptive immune response to foreign antigens is initiated by T-cell receptor (TCR) recognition on the antigens. Recent experimental advances have enabled the generation of a large amount of TCR data and their cognate antigenic targets, allowing machine learning models to predict the binding specificity of TCRs. In this work, we present TEINet, a deep learning framework that utilizes transfer learning to address this prediction problem. TEINet employs two separately pretrained encoders to transform TCR and epitope sequences into numerical vectors, which are subsequently fed into a fully connected neural network to predict their binding specificities. A major challenge for binding specificity prediction is the lack of a unified approach to sampling negative data. Here, we first assess the current negative sampling approaches comprehensively and suggest that the Unified Epitope is the most suitable one. Subsequently, we compare TEINet with three baseline methods and observe that TEINet achieves an average AUROC of 0.760, which outperforms baseline methods by 6.4-26%. Furthermore, we investigate the impacts of the pretraining step and notice that excessive pretraining may lower its transferability to the final prediction task. Our results and analysis show that TEINet can make an accurate prediction using only the TCR sequence (CDR3$\beta $) and the epitope sequence, providing novel insights to understand the interactions between TCRs and epitopes.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new deep contrastive clustering algorithm called scDCCA, which integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework.
Abstract: The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.

Journal ArticleDOI
TL;DR: Dockey as mentioned in this paper is a flexible and intuitive graphical interface tool with seamless integration of several useful tools, which implements a complete docking pipeline covering molecular sanitization, molecular preparation, paralleled docking execution, interaction detection and conformation visualization.
Abstract: Molecular docking is a structure-based and computer-aided drug design approach that plays a pivotal role in drug discovery and pharmaceutical research. AutoDock is the most widely used molecular docking tool for study of protein-ligand interactions and virtual screening. Although many tools have been developed to streamline and automate the AutoDock docking pipeline, some of them still use outdated graphical user interfaces and have not been updated for a long time. Meanwhile, some of them lack cross-platform compatibility and evaluation metrics for screening lead compound candidates. To overcome these limitations, we have developed Dockey, a flexible and intuitive graphical interface tool with seamless integration of several useful tools, which implements a complete docking pipeline covering molecular sanitization, molecular preparation, paralleled docking execution, interaction detection and conformation visualization. Specifically, Dockey can detect the non-covalent interactions between small molecules and proteins and perform cross-docking between multiple receptors and ligands. It has the capacity to automatically dock thousands of ligands to multiple receptors and analyze the corresponding docking results in parallel. All the generated data will be kept in a project file that can be shared between any systems and computers with the pre-installation of Dockey. We anticipate that these unique characteristics will make it attractive for researchers to conduct large-scale molecular docking without complicated operations, particularly for beginners. Dockey is implemented in Python and freely available at https://github.com/lmdu/dockey.

Journal ArticleDOI
TL;DR: UniDL4BioPep as discussed by the authors is a universal deep learning model architecture for transfer learning in bioactive peptide binary classification modeling, which can directly assist users in training a high-performance deep-learning model with a fixed architecture.
Abstract: Identification of potent peptides through model prediction can reduce benchwork in wet experiments. However, the conventional process of model buildings can be complex and time consuming due to challenges such as peptide representation, feature selection, model selection and hyperparameter tuning. Recently, advanced pretrained deep learning-based language models (LMs) have been released for protein sequence embedding and applied to structure and function prediction. Based on these developments, we have developed UniDL4BioPep, a universal deep-learning model architecture for transfer learning in bioactive peptide binary classification modeling. It can directly assist users in training a high-performance deep-learning model with a fixed architecture and achieve cutting-edge performance to meet the demands in efficiently novel bioactive peptide discovery. To the best of our best knowledge, this is the first time that a pretrained biological language model is utilized for peptide embeddings and successfully predicts peptide bioactivities through large-scale evaluations of those peptide embeddings. The model was also validated through uniform manifold approximation and projection analysis. By combining the LM with a convolutional neural network, UniDL4BioPep achieved greater performances than the respective state-of-the-art models for 15 out of 20 different bioactivity dataset prediction tasks. The accuracy, Mathews correlation coefficient and area under the curve were 0.7-7, 1.23-26.7 and 0.3-25.6% higher, respectively. A user-friendly web server of UniDL4BioPep for the tested bioactivities is established and freely accessible at https://nepc2pvmzy.us-east-1.awsapprunner.com. The source codes, datasets and templates of UniDL4BioPep for other bioactivity fitting and prediction tasks are available at https://github.com/dzjxzyd/UniDL4BioPep.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a DDI extraction framework, which integrates the article-level and sentence-level position information of the instances into the model to strengthen the connections between instances generated from the same article or sentence, and introduced a comprehensive similarity-matching method that uses string and word sense similarity to improve the matching accuracy between the target drug and external text.
Abstract: Determining drug-drug interactions (DDIs) is an important part of pharmacovigilance and has a vital impact on public health. Compared with drug trials, obtaining DDI information from scientific articles is a faster and lower cost but still a highly credible approach. However, current DDI text extraction methods consider the instances generated from articles to be independent and ignore the potential connections between different instances in the same article or sentence. Effective use of external text data could improve prediction accuracy, but existing methods cannot extract key information from external data accurately and reasonably, resulting in low utilization of external data. In this study, we propose a DDI extraction framework, instance position embedding and key external text for DDI (IK-DDI), which adopts instance position embedding and key external text to extract DDI information. The proposed framework integrates the article-level and sentence-level position information of the instances into the model to strengthen the connections between instances generated from the same article or sentence. Moreover, we introduce a comprehensive similarity-matching method that uses string and word sense similarity to improve the matching accuracy between the target drug and external text. Furthermore, the key sentence search method is used to obtain key information from external data. Therefore, IK-DDI can make full use of the connection between instances and the information contained in external text data to improve the efficiency of DDI extraction. Experimental results show that IK-DDI outperforms existing methods on both macro-averaged and micro-averaged metrics, which suggests our method provides complete framework that can be used to extract relationships between biomedical entities and process external text data.

Journal ArticleDOI
TL;DR: Huang et al. as mentioned in this paper conducted a systematic evaluation of event-based tools for differential splicing analysis in B-cell acute lymphoblastic leukemia (ALC) and uncovered the differential splice of TCF12.
Abstract: RNA alternative splicing, a post-transcriptional stage in eukaryotes, is crucial in cellular homeostasis and disease processes. Due to the rapid development of the next-generation sequencing (NGS) technology and the flood of NGS data, the detection of differential splicing from RNA-seq data has become mainstream. A range of bioinformatic tools has been developed. However, until now, an independent and comprehensive comparison of available algorithms/tools at the event level is still lacking. Here, 21 different tools are subjected to systematic evaluation, based on simulated RNA-seq data where exact differential splicing events are introduced. We observe immense discrepancies among these tools. SUPPA, DARTS, rMATS and LeafCutter outperforme other event-based tools. We also examine the abilities of the tools to identify novel splicing events, which shows that most event-based tools are unsuitable for discovering novel splice sites. To improve the overall performance, we present two methodological approaches i.e. low-expression transcript filtering and tool-pair combination. Finally, a new protocol of selecting tools to perform differential splicing analysis for different analytical tasks (e.g. precision and recall rate) is proposed. Under this protocol, we analyze the distinct splicing landscape in the DUX4/IGH subgroup of B-cell acute lymphoblastic leukemia and uncover the differential splicing of TCF12. All codes needed to reproduce the results are available at https://github.com/mhjiang97/Benchmarking_DS.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a computational model to predict potential circRNA-disease associations based on collaborative learning with circRNA multi-view functional annotations, respectively, to enable effective network fusion.
Abstract: Emerging studies have shown that circular RNAs (circRNAs) are involved in a variety of biological processes and play a key role in disease diagnosing, treating and inferring. Although many methods, including traditional machine learning and deep learning, have been developed to predict associations between circRNAs and diseases, the biological function of circRNAs has not been fully exploited. Some methods have explored disease-related circRNAs based on different views, but how to efficiently use the multi-view data about circRNA is still not well studied. Therefore, we propose a computational model to predict potential circRNA-disease associations based on collaborative learning with circRNA multi-view functional annotations. First, we extract circRNA multi-view functional annotations and build circRNA association networks, respectively, to enable effective network fusion. Then, a collaborative deep learning framework for multi-view information is designed to get circRNA multi-source information features, which can make full use of the internal relationship among circRNA multi-view information. We build a network consisting of circRNAs and diseases by their functional similarity and extract the consistency description information of circRNAs and diseases. Last, we predict potential associations between circRNAs and diseases based on graph auto encoder. Our computational model has better performance in predicting candidate disease-related circRNAs than the existing ones. Furthermore, it shows the high practicability of the method that we use several common diseases as case studies to find some unknown circRNAs related to them. The experiments show that CLCDA can efficiently predict disease-related circRNAs and are helpful for the diagnosis and treatment of human disease.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors introduced a knowledge-distillation insights drug-target affinity prediction model with feature fusion inputs to make fast, accurate and explainable predictions, which outperformed previous state-of-the-art models.
Abstract: Rapid and accurate prediction of drug-target affinity can accelerate and improve the drug discovery process. Recent studies show that deep learning models may have the potential to provide fast and accurate drug-target affinity prediction. However, the existing deep learning models still have their own disadvantages that make it difficult to complete the task satisfactorily. Complex-based models rely heavily on the time-consuming docking process, and complex-free models lacks interpretability. In this study, we introduced a novel knowledge-distillation insights drug-target affinity prediction model with feature fusion inputs to make fast, accurate and explainable predictions. We benchmarked the model on public affinity prediction and virtual screening dataset. The results show that it outperformed previous state-of-the-art models and achieved comparable performance to previous complex-based models. Finally, we study the interpretability of this model through visualization and find it can provide meaningful explanations for pairwise interaction. We believe this model can further improve the drug-target affinity prediction for its higher accuracy and reliable interpretability.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a persistent Tor-algebra (PTA) model for a unified algebraic representation of the multiphysical interactions, where protein structures and interactions are described as a series of face rings and Tor modules.
Abstract: Protein-protein interactions (PPIs) play crucial roles in almost all biological processes from cell-signaling and membrane transport to metabolism and immune systems. Efficient characterization of PPIs at the molecular level is key to the fundamental understanding of PPI mechanisms. Even with the gigantic amount of PPI models from graphs, networks, geometry and topology, it remains as a great challenge to design functional models that efficiently characterize the complicated multiphysical information within PPIs. Here we propose persistent Tor-algebra (PTA) model for a unified algebraic representation of the multiphysical interactions. Mathematically, our PTA is inherently algebraic data analysis. In our PTA model, protein structures and interactions are described as a series of face rings and Tor modules, from which PTA model is developed. The multiphysical information within/between biomolecules are implicitly characterized by PTA and further represented as PTA barcodes. To test our PTA models, we consider PTA-based ensemble learning for PPI binding affinity prediction. The two most commonly used datasets, i.e. SKEMPI and AB-Bind, are employed. It has been found that our model outperforms all the existing models as far as we know. Mathematically, our PTA model provides a highly efficient way for the characterization of molecular structures and interactions.

Journal ArticleDOI
TL;DR: Lin et al. as discussed by the authors performed a comprehensive evaluation of existing differential abundance analysis (DAA)-c tools using real data-based simulations and found that linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models.
Abstract: Abstract Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.

Journal ArticleDOI
TL;DR: Wu et al. as mentioned in this paper proposed a novel single-cell deep fusion clustering model, which contains two modules, i.e., an attributed feature clustering module and a structure-attention feature clusterering module.
Abstract: Clustering methods have been widely used in single-cell RNA-seq data for investigating tumor heterogeneity. Since traditional clustering methods fail to capture the high-dimension methods, deep clustering methods have drawn increasing attention these years due to their promising strengths on the task. However, existing methods consider either the attribute information of each cell or the structure information between different cells. In other words, they cannot sufficiently make use of all of this information simultaneously. To this end, we propose a novel single-cell deep fusion clustering model, which contains two modules, i.e. an attributed feature clustering module and a structure-attention feature clustering module. More concretely, two elegantly designed autoencoders are built to handle both features regardless of their data types. Experiments have demonstrated the validity of the proposed approach, showing that it is efficient to fuse attributes, structure, and attention information on single-cell RNA-seq data. This work will be further beneficial for investigating cell subpopulations and tumor microenvironment. The Python implementation of our work is now freely available at https://github.com/DayuHuu/scDFC.

Journal ArticleDOI
TL;DR: In this paper , a single-cell state transition across-samples of RNA-seq data (scSTAR) is presented, which overcomes this limitation by constructing a paired-cell projection between biological conditions with an arbitrary time span by maximizing the covariance between two feature spaces.
Abstract: Cell-state transition can reveal additional information from single-cell ribonucleic acid (RNA)-sequencing data in time-resolved biological phenomena. However, most of the current methods are based on the time derivative of the gene expression state, which restricts them to the short-term evolution of cell states. Here, we present single-cell State Transition Across-samples of RNA-seq data (scSTAR), which overcomes this limitation by constructing a paired-cell projection between biological conditions with an arbitrary time span by maximizing the covariance between two feature spaces using partial least square and minimum squared error methods. In mouse ageing data, the response to stress in CD4+ memory T cell subtypes was found to be associated with ageing. A novel Treg subtype characterized by mTORC activation was identified to be associated with antitumour immune suppression, which was confirmed by immunofluorescence microscopy and survival analysis in 11 cancers from The Cancer Genome Atlas Program. On melanoma data, scSTAR improved immunotherapy-response prediction accuracy from 0.8 to 0.96.