scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2022"


Journal ArticleDOI
TL;DR: HTSeq 2.0 as mentioned in this paper provides a more extensive API including a new representation for sparse genomic data, enhancements in htseq-count to suit single cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes, and Python 3 support.
Abstract: HTSeq 2.0 provides a more extensive API including a new representation for sparse genomic data, enhancements in htseq-count to suit single cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes, and Python 3 support.HTSeq 2.0 is released as an open-source software under the GNU General Public License and available from the Python Package Index at https://pypi.python.org/pypi/HTSeq. The source code is available on Github at https://github.com/htseq/htseq.Supplementary data are available at Bioinformatics online.

157 citations


Journal ArticleDOI
TL;DR: HTSeq 2.0 provides a more extensive application programming interface including a new representation for sparse genomic data, enhancements for htseq-count to suit single-cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes and Python 3 support.
Abstract: Abstract Summary HTSeq 2.0 provides a more extensive application programming interface including a new representation for sparse genomic data, enhancements for htseq-count to suit single-cell omics, a new script for data using cell and molecular barcodes, improved documentation, testing and deployment, bug fixes and Python 3 support. Availability and implementation HTSeq 2.0 is released as an open-source software under the GNU General Public License and is available from the Python Package Index at https://pypi.python.org/pypi/HTSeq. The source code is available on Github at https://github.com/htseq/htseq. Supplementary information Supplementary data are available at Bioinformatics online.

157 citations


Journal ArticleDOI
TL;DR: Yahs as mentioned in this paper is a command-line tool for the construction of chromosome-scale scaffolds from Hi-C data, which can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools and provides assembly results in multiple formats.
Abstract: Abstract Summary We present YaHS, a user-friendly command-line tool for the construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity. Availability and implementation YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs. Supplementary information Supplementary data are available at Bioinformatics online.

125 citations


Journal ArticleDOI
TL;DR: An update to GTDB-Tk is presented that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives.
Abstract: The Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) have been widely adopted by the microbiology community. However, the growing size of the GTDB bacterial reference tree has resulted in GTDB-Tk requiring substantial amounts of memory (~320 GB) which limits its adoption and ease of use. Here we present an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives. This substantially reduces the memory requirements of GTDB-Tk while having minimal impact on classification. Availability GTDB-Tk is implemented in Python and licenced under the GNU General Public Licence v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk. Contact p.chaumeil@uq.edu.au or donovan.parks@gmail.com

86 citations


Journal ArticleDOI
TL;DR: ProteinBERT as discussed by the authors is a deep language model specifically designed for proteins, which combines language modeling with a novel task of Gene Ontology (GO) annotation prediction and achieves state-of-the-art performance.
Abstract: Abstract Summary Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Availability and implementation Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. Supplementary information Supplementary data are available at Bioinformatics online.

75 citations


Journal ArticleDOI
TL;DR: ABlooper as mentioned in this paper is an end-to-end equivariant deep learning-based CDR loop structure prediction tool, which predicts the structure of CDR loops with high accuracy and provides a confidence estimate for each prediction.
Abstract: Abstract Motivation Antibodies are a key component of the immune system and have been extensively used as biotherapeutics. Accurate knowledge of their structure is central to understanding their antigen-binding function. The key area for antigen binding and the main area of structural variation in antibodies are concentrated in the six complementarity determining regions (CDRs), with the most important for binding and most variable being the CDR-H3 loop. The sequence and structural variability of CDR-H3 make it particularly challenging to model. Recently deep learning methods have offered a step change in our ability to predict protein structures. Results In this work, we present ABlooper, an end-to-end equivariant deep learning-based CDR loop structure prediction tool. ABlooper rapidly predicts the structure of CDR loops with high accuracy and provides a confidence estimate for each of its predictions. On the models of the Rosetta Antibody Benchmark, ABlooper makes predictions with an average CDR-H3 RMSD of 2.49 Å, which drops to 2.05 Å when considering only its 75% most confident predictions. Availability and implementation https://github.com/oxpig/ABlooper. Supplementary information Supplementary data are available at Bioinformatics online.

53 citations


Journal ArticleDOI
TL;DR: GSEApy as discussed by the authors uses a Rust implementation to enable it to calculate the same enrichment statistic as GSEA for a collection of pathways and also provides an interface between Python and Enrichr web services, as well as for BioMart.
Abstract: Abstract Motivation Gene set enrichment analysis (GSEA) is a commonly used algorithm for characterizing gene expression changes. However, the currently available tools used to perform GSEA have a limited ability to analyze large datasets, which is particularly problematic for the analysis of single-cell data. To overcome this limitation, we developed a GSEA package in Python (GSEApy), which could efficiently analyze large single-cell datasets. Results We present a package (GSEApy) that performs GSEA in either the command line or Python environment. GSEApy uses a Rust implementation to enable it to calculate the same enrichment statistic as GSEA for a collection of pathways. The Rust implementation of GSEApy is 3-fold faster than the Numpy version of GSEApy (v0.10.8) and uses >4-fold less memory. GSEApy also provides an interface between Python and Enrichr web services, as well as for BioMart. The Enrichr application programming interface enables GSEApy to perform over-representation analysis for an input gene list. Furthermore, GSEApy consists of several tools, each designed to facilitate a particular type of enrichment analysis. Availability and implementation The new GSEApy with Rust extension is deposited in PyPI: https://pypi.org/project/gseapy/. The GSEApy source code is freely available at https://github.com/zqfang/GSEApy. Also, the documentation website is available at https://gseapy.rtfd.io/. Supplementary information Supplementary data are available at Bioinformatics online.

50 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins, which achieved a higher prediction performance than state-of-the-art methods on the peptide dataset.
Abstract: Abstract Motivation Recently, peptides have emerged as a promising class of pharmaceuticals for various diseases treatment poised between traditional small molecule drugs and therapeutic proteins. However, one of the key bottlenecks preventing them from therapeutic peptides is their toxicity toward human cells, and few available algorithms for predicting toxicity are specially designed for short-length peptides. Results We present ToxIBTL, a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins. Specifically, we use evolutionary information and physicochemical properties of peptide sequences and integrate the information bottleneck principle into a feature representation learning scheme, by which relevant information is retained and the redundant information is minimized in the obtained features. Moreover, transfer learning is introduced to transfer the common knowledge contained in proteins to peptides, which aims to improve the feature representation capability. Extensive experimental results demonstrate that ToxIBTL not only achieves a higher prediction performance than state-of-the-art methods on the peptide dataset, but also has a competitive performance on the protein dataset. Furthermore, a user-friendly online web server is established as the implementation of the proposed ToxIBTL. Availability and implementation The proposed ToxIBTL and data can be freely accessible at http://server.wei-group.net/ToxIBTL. Our source code is available at https://github.com/WLYLab/ToxIBTL. Supplementary information Supplementary data are available at Bioinformatics online.

42 citations


Journal ArticleDOI
TL;DR: ToxIBTL is presented, a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins and achieves a higher prediction performance than state-of-the-art methods on the peptide dataset, but also has a competitive performance on the protein dataset.
Abstract: MOTIVATION Recently, peptides have emerged as a promising class of pharmaceuticals for various diseases treatment poised between traditional small molecule drugs and therapeutic proteins. However, one of the key bottlenecks preventing them from therapeutic peptides is their toxicity toward human cells, and few available algorithms for predicting toxicity are specially designed for short-length peptides. RESULTS We present ToxIBTL, a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins. Specifically, we use evolutionary information and physicochemical properties of peptide sequences and integrate the information bottleneck principle into a feature representation learning scheme, by which relevant information is retained and the redundant information is minimized in the obtained features. Moreover, transfer learning is introduced to transfer the common knowledge contained in proteins to peptides, which aims to improve the feature representation capability. Extensive experimental results demonstrate that ToxIBTL not only achieves a higher prediction performance than state-of-the-art methods on the peptide dataset, but also has a competitive performance on the protein dataset. Furthermore, a user-friendly online web server is established as the implementation of the proposed ToxIBTL. AVAILABILITY The proposed ToxIBTL can be freely accessible at http://server.wei-group.net/ToxIBTL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

36 citations


Journal ArticleDOI
TL;DR: It is found that main-chain flexibility associated with apo-holo pairs of conformers negatively correlates with the predicted local model quality score pl DDT, indicating that plDDT values in a single 3D model could be used to infer local conformational changes linked to ligand binding transitions.
Abstract: MOTIVATION After the outstanding breakthrough of AlphaFold in predicting protein 3D models, new questions appeared and remain unanswered. The ensemble nature of proteins, for example, challenges the structural prediction methods because the models should represent a set of conformers instead of single structures. The evolutionary and structural features captured by effective deep learning techniques may unveil the information to generate several diverse conformations from a single sequence. Here we address the performance of AlphaFold2 predictions obtained through ColabFold under this ensemble paradigm. RESULTS Using a curated collection of apo-holo pairs of conformers, we found that AlphaFold2 predicts the holo form of a protein in ∼70% of the cases, being unable to reproduce the observed conformational diversity with the same error for both conformers. More importantly, we found that AlphaFold2's performance worsens with the increasing conformational diversity of the studied protein. This impairment is related to the heterogeneity in the degree of conformational diversity found between different members of the homologous family of the protein under study. Finally, we found that main-chain flexibility associated with apo-holo pairs of conformers negatively correlates with the predicted local model quality score plDDT, indicating that plDDT values in a single 3D model could be used to infer local conformational changes linked to ligand binding transitions. AVAILABILITY Data and code used in this manuscript are publicly available at https://gitlab.com/sbgunq/publications/af2confdiv-oct2021. SUPPLEMENTARY INFORMATION Supplementary data is available at the journal's web site.

33 citations


Journal ArticleDOI
TL;DR: Plotsr is an efficient tool to visualize structural similarities and rearrangements between multiple genomes that can be used to compare genomes on chromosome level or to zoom in on any selected region.
Abstract: Summary Third-generation genomic technologies have led to a sharp increase in the number of high-quality genome assemblies. This allows the comparison of multiple assembled genomes of individual species and demands for new tools for visualising their structural properties. Here we present plotsr, an efficient tool to visualize structural similarities and rearrangements between multiple genomes. It can be used to compare genomes on chromosome level or to zoom in on any selected region. In addition, plotsr can augment the visualisation with regional identifiers (e.g. genes or genomic markers) or histogram tracks for continuous features (e.g. GC content or polymorphism density). Availability and implementation plotsr is implemented as a python package and uses the standard matplotlib library for plotting. It is freely available under the MIT license at GitHub (https://github.com/schneebergerlab/plotsr) and bioconda (https://anaconda.org/bioconda/plotsr). Contact Manish Goel (manish.goel@lmu.de), Korbinian Schneeberger (k.schneeberger@lmu.de)

Journal ArticleDOI
TL;DR: The Optimized Dynamic Genome/Graph Implementation (ODGI) as discussed by the authors is a suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs.
Abstract: Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way.We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs.ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm.Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work compares the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways and suggests that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding.
Abstract: Abstract Summary Motivation. Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein’s crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equilibrium structure or are just accurate knowledge-based predictors of the final state. Results. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with experimental observables such as intermediate structures and the folding rate constant. These results suggest that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding. Availability. The data underlying this article are available in GitHub at https://github.com/oxpig/structure-vs-folding/ Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Plotsr is an efficient tool to visualize structural similarities and rearrangements between genomes that can be used to compare genomes on chromosome level or to zoom in on any selected region.
Abstract: SUMMARY Third-generation genome sequencing technologies have led to a sharp increase in the number of high-quality genome assemblies. This allows the comparison of multiple assembled genomes of individual species and demands new tools for visualising their structural properties. Here we present plotsr, an efficient tool to visualize structural similarities and rearrangements between genomes. It can be used to compare genomes on chromosome level or to zoom in on any selected region. In addition, plotsr can augment the visualisation with regional identifiers (e.g. genes or genomic markers) or histogram tracks for continuous features (e.g. GC content or polymorphism density). AVAILABILITY AND IMPLEMENTATION plotsr is implemented as a python package and uses the standard matplotlib library for plotting. It is freely available under the MIT license at GitHub (https://github.com/schneebergerlab/plotsr) and bioconda (https://anaconda.org/bioconda/plotsr). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR:
Abstract: Motivation The advent of long-read sequencing technologies has increased demand for the visualisation and interpretation of transcripts. However, tools that perform such visualizations remain inflexible and lack the ability to easily identify differences between transcript structures. Here, we introduce ggtranscript, an R package that provides a fast and flexible method to visualize and compare transcripts. As a ggplot2 extension, ggtranscript inherits the functionality and familiarity of ggplot2 making it easy to use. Availability and implementation ggtranscript is available at https://github.com/dzhang32/ggtranscript, DOI: https://doi.org/10.5281/zenodo.6374061

Journal ArticleDOI
TL;DR: In this article , the authors compare the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways and find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length.
Abstract: Abstract Summary Motivation. Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein’s crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equilibrium structure or are just accurate knowledge-based predictors of the final state. Results. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to experimental data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with experimental observables such as intermediate structures and the folding rate constant. These results suggest that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding. Availability. The data underlying this article are available in GitHub at https://github.com/oxpig/structure-vs-folding/ Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: plotgardener as mentioned in this paper is a coordinate-based genomic data visualization package that offers a new paradigm for multi-plot figure generation in R. Plotgardener allows precise, programmatic control over the placement, esthetics and arrangements of plots while maximizing user experience through fast and memory-efficient data access, support for a wide variety of data and file types, and tight integration with the Bioconductor environment.
Abstract: The R programming language is one of the most widely used programming languages for transforming raw genomic datasets into meaningful biological conclusions through analysis and visualization, which has been largely facilitated by infrastructure and tools developed by the Bioconductor project. However, existing plotting packages rely on relative positioning and sizing of plots, which is often sufficient for exploratory analysis but is poorly suited for the creation of publication-quality multi-panel images inherent to scientific manuscript preparation.We present plotgardener, a coordinate-based genomic data visualization package that offers a new paradigm for multi-plot figure generation in R. Plotgardener allows precise, programmatic control over the placement, esthetics and arrangements of plots while maximizing user experience through fast and memory-efficient data access, support for a wide variety of data and file types, and tight integration with the Bioconductor environment. Plotgardener also allows precise placement and sizing of ggplot2 plots, making it an invaluable tool for R users and data scientists from virtually any discipline.Package: https://bioconductor.org/packages/plotgardener, Code: https://github.com/PhanstielLab/plotgardener, Documentation: https://phanstiellab.github.io/plotgardener/.Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The LLPSDB v2.0 as mentioned in this paper is a new version of the database, with more than double contents of data curated, and a new class "Ambiguous system" is added.
Abstract: Emerging evidences have suggested that liquid-liquid phase separation (LLPS) of proteins plays a vital role both in a wide range of biological processes and in related diseases. Whether a protein undergoes phase separation not only is determined by the chemical and physical properties of biomolecule themselves, but also is regulated by environmental conditions such as temperature, ionic strength, pH, as well as volume excluded by other macromolecules. A web accessible database LLPSDB was developed recently by our group, in which all the proteins involved in LLPS in vitro as well as corresponding experimental conditions were curated comprehensively from published literatures. With the rapid increase of investigations in biomolecular LLPS and growing popularity of LLPSDB, we updated the database, and developed a new version LLPSDB v2.0. In comparison of the previously released version, more than double contents of data are curated, and a new class 'Ambiguous system' is added. In addition, the web interface is improved, such as that users can search the database by selecting option 'phase separation status' alone or combined with other options. We anticipate that this updated database will serve as a more comprehensive and helpful resource for users.LLPSDB v2.0 is freely available at: http://bio-comp.org.cn/llpsdbv2.Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: IGV.js as mentioned in this paper is an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV), which can be easily dropped into any web page with a single line of code and has no external dependencies.
Abstract: igv.js is an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). It can be easily dropped into any web page with a single line of code and has no external dependencies. The viewer runs completely in the web browser, with no backend server and no data pre-processing required.The igv.js JavaScript component can be installed from NPM at https://www.npmjs.com/package/igv. The source code is available at https://github.com/igvteam/igv.js under the MIT open-source license. IGV-Web, the end-user application built around igv.js, is available at https://igv.org/app. The source code is available at https://github.com/igvteam/igv-webapp under the MIT open-source license.Supplementary information is available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An R/Bioconductor package implementing commonly used normalization and differential analysis methods, and three supervised learning models to identify microbiome markers, which allows comparison of different methods of differential analysis and confounder analysis.
Abstract: SUMMARY Characterizing biomarkers based on microbiome profiles has great potential for translational medicine and precision medicine. Here, we present microbiomeMarker, an R/Bioconductor package implementing commonly used normalization and differential analysis methods, and three supervised learning models to identify microbiome markers. microbiomeMarker also allows comparison of different methods of differential analysis and confounder analysis. It uses standardized input and output formats, which renders it highly scalable and extensible, and allows it to seamlessly interface with other microbiome packages and tools. In addition, the package provides a set of functions to visualize and interpret the identified microbiome markers. AVAILABILITY AND IMPLEMENTATION microbiomeMarker is freely available from Bioconductor (https://www.bioconductor.org/packages/microbiomeMarker). Source code is available and maintained at GitHub (https://github.com/yiluheihei/microbiomeMarker). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: It is shown that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions.
Abstract: Motivation A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings – in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0, n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. Availability The C++ implementation of the dictionary is available at https://github.com/jermp/sshash. Contact giulio.ermanno.pibiri@isti.cnr.it

Journal ArticleDOI
TL;DR: A contact-map predictor utilizing the output of a pre-trained language model ESM-1b as an input along with a large training set and an ensemble of residual neural networks is developed, providing a much faster and reasonably accurate alternative to evolution-based methods, useful for large-scale prediction.
Abstract: Abstract Motivation Accurate prediction of protein contact-map is essential for accurate protein structure and function prediction. As a result, many methods have been developed for protein contact map prediction. However, most methods rely on protein-sequence-evolutionary information, which may not exist for many proteins due to lack of naturally occurring homologous sequences. Moreover, generating evolutionary profiles is computationally intensive. Here, we developed a contact-map predictor utilizing the output of a pre-trained language model ESM-1b as an input along with a large training set and an ensemble of residual neural networks. Results We showed that the proposed method makes a significant improvement over a single-sequence-based predictor SSCpred with 15% improvement in the F1-score for the independent CASP14-FM test set. It also outperforms evolutionary-profile-based methods trRosetta and SPOT-Contact with 48.7% and 48.5% respective improvement in the F1-score on the proteins without homologs (Neff = 1) in the independent SPOT-2018 set. The new method provides a much faster and reasonably accurate alternative to evolution-based methods, useful for large-scale prediction. Availability and implementation Stand-alone-version of SPOT-Contact-LM is available at https://github.com/jas-preet/SPOT-Contact-Single. Direct prediction can also be made at https://sparks-lab.org/server/spot-contact-single. The datasets used in this research can also be downloaded from the GitHub. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a graph Markov neural network (GMNN) algorithm to predict unknown circRNA-disease associations by calculating semantic similarity and Gaussian interactive profile kernel similarity (GIPs) of the disease and the GIPs of circRNA and then merge them to form a unified descriptor.
Abstract: Abstract Motivation With the analysis of the characteristic and function of circular RNAs (circRNAs), people have realized that they play a critical role in the diseases. Exploring the relationship between circRNAs and diseases is of far-reaching significance for searching the etiopathogenesis and treatment of diseases. Nevertheless, it is inefficient to learn new associations only through biotechnology. Results Consequently, we present a computational method, GMNN2CD, which employs a graph Markov neural network (GMNN) algorithm to predict unknown circRNA–disease associations. First, used verified associations, we calculate semantic similarity and Gaussian interactive profile kernel similarity (GIPs) of the disease and the GIPs of circRNA and then merge them to form a unified descriptor. After that, GMNN2CD uses a fusion feature variational map autoencoder to learn deep features and uses a label propagation map autoencoder to propagate tags based on known associations. Based on variational inference, GMNN alternate training enhances the ability of GMNN2CD to obtain high-efficiency high-dimensional features from low-dimensional representations. Finally, 5-fold cross-validation of five benchmark datasets shows that GMNN2CD is superior to the state-of-the-art methods. Furthermore, case studies have shown that GMNN2CD can detect potential associations. Availability and implementation The source code and data are available at https://github.com/nmt315320/GMNN2CD.git.

Journal ArticleDOI
TL;DR: This article presents BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by employing a multi-task NER model and neural network -based NEN models to achieve much faster and more accurate inference.
Abstract: Abstract In biomedical natural language processing, named entity recognition (NER) and named entity normalization (NEN) are key tasks that enable the automatic extraction of biomedical entities (e.g. diseases and drugs) from the ever-growing biomedical literature. In this article, we present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by employing a multi-task NER model and neural network-based NEN models to achieve much faster and more accurate inference. We hope that our tool can help annotate large-scale biomedical texts for various tasks such as biomedical knowledge graph construction. Availability and implementation Web service of BERN2 is publicly available at http://bern2.korea.ac.kr. We also provide local installation of BERN2 at https://github.com/dmis-lab/BERN2. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: NerLTR-DTA is a powerful tool for predicting drug-target associations and can contribute to new drug discovery and drug repurposing and smartly realize the multi-scenario application of the method, including the discovery of new drugs and new targets.
Abstract: MOTIVATION Drug-target interaction prediction plays an important role in new drug discovery and drug repurposing. Binding affinity indicates the strength of drug-target interactions. Predicting drug-target binding affinity is expected to provide promising candidates for biologists, which can effectively reduce the workload of wet laboratory experiments and speed up the entire process of drug research. Given that numerous new proteins are sequenced and compounds are synthesized, several improved computational methods have been proposed for such predictions, but there are still some challenges. i. many methods only discuss and implement one application scenario, they focus on drug repurposing and ignore the discovery of new drugs and targets. ii. many methods do not consider the priority order of proteins (or drugs) related to each target drug (or protein). Therefore, it is necessary to develop a comprehensive method that can be used in multiple scenarios and focuses on candidate order. RESULTS In this study, we propose a method called NerLTR-DTA that uses the neighbor relationship of similarity and sharing to extract features, and applies a ranking framework with regression attributes to predict affinity values and priority order of query drug (or query target) and its related proteins (or compounds). It is worth noting that using the characteristics of learning to rank to set different queries can smartly realize the multi-scenario application of the method, including the discovery of new drugs and new targets. Experimental results on two commonly used datasets show that NerLTR-DTA outperforms some state-of-the-art competing methods. NerLTR-DTA achieves excellent performance in all application scenarios mentioned in this study, and the r2m(test) values guarantee such excellent performance is not obtained by chance. Moreover, it can be concluded that NerLTR-DTA can provide accurate ranking lists for the relevant results of most queries through the statistics of the association relationship of each query drug (or query protein). In general, NerLTR-DTA is a powerful tool for predicting drug-target associations and can contribute to new drug discovery and drug repurposing. AVAILABILITY The proposed method is implemented in Python and Java. Source codes and datasets are available at https://github.com/RUXIAOQING964914140/NerLTR-DTA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Topsy-Turvy as mentioned in this paper synthesizes both views in a sequence-based, multi-scale, deep-learning model for protein-protein interaction prediction and achieves state-of-the-art performance for both well-and sparsely-characterized proteins.
Abstract: Abstract Summary Computational methods to predict protein–protein interaction (PPI) typically segregate into sequence-based ‘bottom-up’ methods that infer properties from the characteristics of the individual protein sequences, or global ‘top-down’ methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g. AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms. Availability and implementation https://topsyturvy.csail.mit.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: StainedGlass as discussed by the authors is a tool that can generate publication-quality figures and interactive visualizations that depict the identity and orientation of multi-megabase tandem repeat structures at a genome-wide scale.
Abstract: Abstract Summary The visualization and analysis of genomic repeats is typically accomplished using dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce StainedGlass, which can generate publication-quality figures and interactive visualizations that depict the identity and orientation of multi-megabase tandem repeat structures at a genome-wide scale. The tool can rapidly reveal higher-order structures and improve the inference of evolutionary history for some of the most complex regions of genomes. Availability and implementation StainedGlass is implemented using Snakemake and available open source under the MIT license at https://mrvollger.github.io/StainedGlass/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Improved search and query facilities for cognate ligands in the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature, and classification that ChEBI provides are developed.
Abstract: Motivation To provide high quality, computationally tractable annotation of binding sites for biologically relevant (cognate) ligands in UniProtKB using the chemical ontology ChEBI (Chemical Entities of Biological Interest), to better support efforts to study and predict functionally relevant interactions between proteins and small molecule ligands. Results We structured the data model for cognate ligand binding site annotations in UniProtKB and performed a complete reannotation of all cognate ligand binding sites using stable unique identifiers from ChEBI, which we now use as the reference vocabulary for all such annotations. We developed improved search and query facilities for cognate ligands in the UniProt website, REST API and SPARQL endpoint that leverage the chemical structure data, nomenclature, and classification that ChEBI provides. Availability Binding site annotations for cognate ligands described using ChEBI are available for UniProtKB protein sequence records in several formats (text, XML, and RDF), and are freely available to query and download through the UniProt website (www.uniprot.org), REST API (www.uniprot.org/help/api), SPARQL endpoint (sparql.uniprot.org/), and FTP site (https://ftp.uniprot.org/pub/databases/uniprot/). Contact alan.bridge@sib.swiss Supplementary information Supplementary Table 1.

Journal ArticleDOI
TL;DR: The seqwish algorithm is designed, which builds a variation graph from a set of sequences and alignments between them, and it is demonstrated that the method scales to very large graph induction problems by applying it to build pangenome graphs for several species.
Abstract: Motivation Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes, or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. Results We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. Availability seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm. Contact egarris5@uthsc.edu

Journal ArticleDOI
TL;DR: Simplot++ as discussed by the authors is an open-source multiplatform application implemented in Python, which can be used to produce publication quality sequence similarity plots using 63 nucleotide and 20 amino acid distance models, to detect intergenic and intragenic recombination events using Φ, Max-χ2, NSS or proportion tests, and to generate and analyze interactive sequence similarity networks.
Abstract: Abstract Motivation Accurate detection of sequence similarity and homologous recombination are essential parts of many evolutionary analyses. Results We have developed SimPlot++, an open-source multiplatform application implemented in Python, which can be used to produce publication quality sequence similarity plots using 63 nucleotide and 20 amino acid distance models, to detect intergenic and intragenic recombination events using Φ, Max-χ2, NSS or proportion tests, and to generate and analyze interactive sequence similarity networks. SimPlot++ supports multicore data processing and provides useful distance calculability diagnostics. Availability and implementation SimPlot++ is freely available on GitHub at: https://github.com/Stephane-S/Simplot_PlusPlus, as both an executable file (for Windows) and Python scripts (for Windows/Linux/MacOS).