scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2021"


Journal ArticleDOI
TL;DR: Clinker as mentioned in this paper is a Python-based tool that can automatically generate accurate, interactive, publication-quality gene cluster comparison figures directly from sequence files, which can give valuable insights into their function and evolutionary history.
Abstract: Summary Genes involved in biological pathways are often collocalised in gene clusters, the comparison of which can give valuable insights into their function and evolutionary history. However, comparison and visualisation of gene cluster similarity is a tedious process, particularly when many clusters are being compared. Here, we present clinker, a Python based tool, and clustermap.js, a companion JavaScript visualisation library, which used together can automatically generate accurate, interactive, publication-quality gene cluster comparison figures directly from sequence files. Availability and implementation Source code and documentation for clinker and clustermap.js is available on GitHub (github.com/gamcil/clinker and github.com/gamcil/clustermap.js, respectively) under the MIT license. clinker can be installed directly from the Python Package Index via pip. Supplementary information Supplementary data are available at Bioinformatics online.

336 citations


Journal ArticleDOI
TL;DR: Liftoff is described, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene.
Abstract: Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or 'lift over' the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff. Supplementary information Supplementary data are available at Bioinformatics online.

212 citations


Journal ArticleDOI
TL;DR: This work proposes a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug-target affinity better than non-deep learning models, but also outperform competing deep learning methods.
Abstract: The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug–target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent molecules. We propose a new model called GraphDTA that represents drugs as graphs and uses graph neural networks to predict drug–target affinity. We show that graph neural networks not only predict drug–target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug–target binding affinity prediction, and that representing drugs as graphs can lead to further improvements. Availability of data and materials The proposed models are implemented in Python. Related data, pre-trained models, and source code are publicly available at https://github.com/thinng/GraphDTA. All scripts and data needed to reproduce the post-hoc statistical analysis are available from https://doi.org/10.5281/zenodo.3603523.

210 citations


Journal ArticleDOI
TL;DR: This database is the first consolidation of antibodies known to bind Sars-CoV-2 as well as other betacoronaviruses such as SARS- coV-1 and MERS-Cov, and contains relevant metadata including evidence of cross-neutralisation, antibody/nanobody origin, full variable domain sequence, and germline assignments.
Abstract: Motivation The emergence of a novel strain of betacoronavirus, SARS-CoV-2, has led to a pandemic that has been associated with over 700 000 deaths as of August 5, 2020. Research is ongoing around the world to create vaccines and therapies to minimize rates of disease spread and mortality. Crucial to these efforts are molecular characterizations of neutralizing antibodies to SARS-CoV-2. Such antibodies would be valuable for measuring vaccine efficacy, diagnosing exposure and developing effective biotherapeutics. Here, we describe our new database, CoV-AbDab, which already contains data on over 1400 published/patented antibodies and nanobodies known to bind to at least one betacoronavirus. This database is the first consolidation of antibodies known to bind SARS-CoV-2 as well as other betacoronaviruses such as SARS-CoV-1 and MERS-CoV. It contains relevant metadata including evidence of cross-neutralization, antibody/nanobody origin, full variable domain sequence (where available) and germline assignments, epitope region, links to relevant PDB entries, homology models and source literature. Results On August 5, 2020, CoV-AbDab referenced sequence information on 1402 anti-coronavirus antibodies and nanobodies, spanning 66 papers and 21 patents. Of these, 1131 bind to SARS-CoV-2. Availabilityand implementation CoV-AbDab is free to access and download without registration at http://opig.stats.ox.ac.uk/webapps/coronavirus. Community submissions are encouraged. Supplementary information Supplementary data are available at Bioinformatics online.

202 citations


Journal ArticleDOI
TL;DR: LDpred2 is presented, a new version of LDpred that addresses limitations that may reduce its predictive performance and outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%.
Abstract: Motivation Polygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. Results Here we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a "sparse" option that can learn effects that are exactly 0, and an "auto" option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that LDpred2 provides more accurate polygenic scores when run genome-wide, instead of per chromosome. Availability LDpred2 is implemented in R package bigsnpr. Supplementary information Supplementary data are available at Bioinformatics online.

187 citations


Journal ArticleDOI
TL;DR: DeepPurpose as discussed by the authors is a comprehensive and easy-to-use DL library for drug-target interaction prediction, which supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures.
Abstract: Summary Accurate prediction of drug-target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Supplementary information Supplementary data are available at Bioinformatics online.

164 citations


Journal ArticleDOI
TL;DR: STREME as mentioned in this paper is the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility, and it can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA-and RNA-binding proteins.
Abstract: Motivation Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA-and RNA-binding proteins. Results The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive and thorough than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs) and two other representative algorithms (ProSampler and Weeder). STREME's capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME reports a useful estimate of the statistical significance of each motif it discovers. STREME is easy to use individually via its web server or via the command line, and is completely integrated with the widely-used MEME Suite of sequence analysis tools. The name STREME stands for "Simple, Thorough, Rapid, Enriched Motif Elicitation". Availability The STREME web server and source code are provided freely for non-commercial use at http://meme-suite.org.

158 citations


Journal ArticleDOI
TL;DR: Jiang et al. as mentioned in this paper developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts.
Abstract: Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios Results To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy, and efficiency We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites, and transcription factor binding sites, after easy fine-tuning using small task-specific labelled data Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks Availability The source code, pretrained and finetuned model for DNABERT are available at GitHub https://githubcom/jerryji1993/DNABERT Supplementary information Supplementary data are available at Bioinformatics online

158 citations


Journal ArticleDOI
TL;DR: This work presents CAFE 5, a completely re-written software package with numerous performance and user-interface enhancements over previous versions, including improved support for multithreading, the explicit modelling of rate variation among families using gamma-distributed rate categories, and command-line arguments that preclude the use of accessory scripts.
Abstract: Motivation Genome sequencing projects have revealed frequent gains and losses of genes between species. Previous versions of our software, CAFE (Computational Analysis of gene Family Evolution), have allowed researchers to estimate parameters of gene gain and loss across a phylogenetic tree. However, the underlying model assumed that all gene families had the same rate of evolution, despite evidence suggesting a large amount of variation in rates among families. Results Here we present CAFE 5, a completely re-written software package with numerous performance and user-interface enhancements over previous versions. These include improved support for multithreading, the explicit modelling of rate variation among families using gamma-distributed rate categories, and command-line arguments that preclude the use of accessory scripts. Availability CAFE 5 source code, documentation, test data, and a detailed manual with examples are freely available at https://github.com/hahnlab/CAFE5/releases.

139 citations


Journal ArticleDOI
Heng Li1
TL;DR: Minimap2 v2.22 as mentioned in this paper can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default, addressing major weakness in minimap2v2.18 or earlier.
Abstract: SUMMARY We present several recent improvements to minimap2, a versatile pairwise aligner for nucleotide sequences. Now minimap2 v2.22 can more accurately map long reads to highly repetitive regions and align through insertions or deletions up to 100kb by default, addressing major weakness in minimap2 v2.18 or earlier. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/minimap2.

130 citations


Journal ArticleDOI
TL;DR: A statistical software package, dream, is introduced that increases power, controls the false positive rate, enables multiple types of hypothesis tests, and integrates with standard workflows and yields biological insight not found with existing software while addressing the issue of reproducible false positive findings.
Abstract: Summary Large-scale transcriptome studies with multiple samples per individual are widely used to study disease biology. Yet, current methods for differential expression are inadequate for cross-individual testing for these repeated measures designs. Most problematic, we observe across multiple datasets that current methods can give reproducible false-positive findings that are driven by genetic regulation of gene expression, yet are unrelated to the trait of interest. Here, we introduce a statistical software package, dream, that increases power, controls the false positive rate, enables multiple types of hypothesis tests, and integrates with standard workflows. In 12 analyses in 6 independent datasets, dream yields biological insight not found with existing software while addressing the issue of reproducible false-positive findings. Availability and implementation Dream is available within the variancePartition Bioconductor package at http://bioconductor.org/packages/variancePartition. Contact gabriel.hoffman@mssm.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Huang et al. as discussed by the authors proposed a molecular interaction transformer (MolTrans) to better extract and capture the semantic relations among sub-structures extracted from massive unlabeled biomedical data.
Abstract: Motivation Drug-target interaction (DTI) prediction is a foundational task for in-silico drug discovery, which is costly and time-consuming due to the need of experimental search over large drug compound space. Recent years have witnessed promising progress for deep learning in DTI predictions. However, the following challenges are still open: (i) existing molecular representation learning approaches ignore the sub-structural nature of DTI, thus produce results that are less accurate and difficult to explain and (ii) existing methods focus on limited labeled data while ignoring the value of massive unlabeled molecular data. Results We propose a Molecular Interaction Transformer (MolTrans) to address these limitations via: (i) knowledge inspired sub-structural pattern mining algorithm and interaction modeling module for more accurate and interpretable DTI prediction and (ii) an augmented transformer encoder to better extract and capture the semantic relations among sub-structures extracted from massive unlabeled biomedical data. We evaluate MolTrans on real-world data and show it improved DTI prediction performance compared to state-of-the-art baselines. Availability and implementation The model scripts are available at https://github.com/kexinhuang12345/moltrans. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: LocusZoom as mentioned in this paper is a JavaScript library for creating interactive web-based visualizations of genetic association study results, which can display one or more traits in the context of relevant biological data (such as gene models and other genomic annotation), and allows interactive refinement of analysis models (by selecting linkage disequilibrium reference panels, identifying sets of likely causal variants, or comparisons to the GWAS catalog).
Abstract: LocusZoom.js is a JavaScript library for creating interactive web-based visualizations of genetic association study results. It can display one or more traits in the context of relevant biological data (such as gene models and other genomic annotation), and allows interactive refinement of analysis models (by selecting linkage disequilibrium reference panels, identifying sets of likely causal variants, or comparisons to the GWAS catalog). It can be embedded in web pages to enable data sharing and exploration. Views can be customized and extended to display other data types such as phenome-wide association study (PheWAS) results, chromatin co-accessibility, or eQTL measurements. A new web upload service harmonizes datasets, adds annotations, and makes it easy to explore user-provided result sets. Availability LocusZoom.js is open-source software under a permissive MIT license. Code and documentation are available at: https://github.com/statgen/locuszoom/. Installable packages for all versions are also distributed via NPM. Additional features are provided as standalone libraries to promote reuse. Use with your own GWAS results at https://my.locuszoom.org/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Nebulosa as discussed by the authors uses weighted kernel density estimation to recover signals lost through drop-out or low expression, which can be easily installed from www.github.com/powellgenomicslab/Nebulo.
Abstract: SUMMARY Data sparsity in single-cell experiments prevents an accurate assessment of gene expression when visualised in a low-dimensional space. Here, we introduce Nebulosa, an R package that uses weighted kernel density estimation to recover signals lost through drop-out or low expression. AVAILABILITY AND IMPLEMENTATION Nebulosa can be easily installed from www.github.com/powellgenomicslab/Nebulosa. SUPPLEMENTARY INFORMATION Supplementary data are available online.

Journal ArticleDOI
TL;DR: Fastsimcoal2 as mentioned in this paper extends fastsimcoal, a continuous time coalescent-based genetic simulation program, by enabling the estimation of demographic parameters under very complex scenarios from the site frequency spectrum under a maximum-likelihood framework.
Abstract: Motivation: fastsimcoal2 extends fastsimcoal, a continuous time coalescent-based genetic simulation program, by enabling the estimation of demographic parameters under very complex scenarios from the site frequency spectrum under a maximum-likelihood framework. Results: Other improvements include multi-threading, handling of population inbreeding, extended input file syntax facilitating the description of complex demographic scenarios, and more efficient simulations of sparsely structured populations and of large chromosomes. Availability and implementation: fastsimcoal2 is freely available on http://cmpg.unibe.ch/software/fastsimcoal2/. It includes console versions for Linux, Windows and MacOS, additional scripts for the analysis and visualization of simulated and estimated scenarios, as well as a detailed documentation and ready-to-use examples.

Journal ArticleDOI
TL;DR: colour deconvolution is presented here in two open-source forms: a MATLAB program/function and an ImageJ plugin written in Java, which run in Windows, Macintosh, and UNIX-based systems under the respective platforms.
Abstract: Motivation Microscopy images of stained cells and tissues play a central role in most biomedical experiments and routine histopathology. Storing colour histological images digitally opens the possibility to process numerically colour distribution and intensity to extract quantitative data. Among those numerical procedures is colour deconvolution, which enables decomposing an RGB image into channels representing the optical absorbance and transmittance of the dyes when their RGB representation is known. Consequently, a range of new applications become possible for morphological and histochemical segmentation, automated marker localisation and image enhancement. Availability and implementation Colour deconvolution is presented here in two open-source forms: a MATLAB program/function and an ImageJ plugin written in Java. Both versions run in Windows, Macintosh, and UNIX-based systems under the respective platforms. Source code and further documentation are available at: https://blog.bham.ac.uk/intellimic/g-landini-software/colour-deconvolution-2/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
Taedong Yun1, Helen Li1, Pi-Chuan Chang1, Michael F. Lin, Andrew Carroll1, Cory Y. McLean1 
TL;DR: In this paper, an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus is introduced, using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios.
Abstract: Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: MMseqs2 as mentioned in this paper is a new tool to assign taxonomic labels to metagenomic contigs, which extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig's taxonomic identity by weighted voting.
Abstract: Summary MMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig's taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments. Availability MMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com. Supplementary information Supplementary data is available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) was applied to map protein sequences to "semantic space" to reflect the structure patterns with the help of predicted Residue-Residue Contacts and other sequence-based features and showed promising results.
Abstract: Motivation Related to many important biological functions, intrinsically disordered regions (IDRs) are widely distributed in proteins. Accurate prediction of IDRs is critical for the protein structure and function analysis. However, the existing computational methods construct the predictive models solely in the sequence space, failing to convert the sequence space into the 'semantic space' to reflect the structure characteristics of proteins. Furthermore, although the length-dependent predictors showed promising results, new fusion strategies should be explored to improve their predictive performance and the generalization. Results In this study, we applied the Sequence to Sequence Learning (Seq2Seq) derived from natural language processing (NLP) to map protein sequences to 'semantic space' to reflect the structure patterns with the help of predicted residue-residue contacts (CCMs) and other sequence-based features. Furthermore, the Attention mechanism was used to capture the global associations between all residue pairs in the proteins. Three length-dependent predictors were constructed: IDP-Seq2Seq-L for long disordered region prediction, IDP-Seq2Seq-S for short disordered region prediction and IDP-Seq2Seq-G for both long and short disordered region predictions. Finally, these three predictors were fused into one predictor called IDP-Seq2Seq to improve the discriminative power and generalization. Experimental results on four independent test datasets and the CASP test dataset showed that IDP-Seq2Seq is insensitive with the ratios of long and short disordered regions and outperforms other competing methods. Availability and implementation For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bliulab.net/IDP-Seq2Seq/. It is anticipated that IDP-Seq2Seq will become a very useful tool for identification of IDRs. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: BERT4Bitter as mentioned in this paper is a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information.
Abstract: Motivation The identification of bitter peptides through experimental approaches is an expensive and time-consuming endeavor. Due to the huge number of newly available peptide sequences in the post-genomic era, the development of automated computational models for the identification of novel bitter peptides is highly desira-ble. Results In this work, we present BERT4Bitter, a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information. To the best of our knowledge, this is the first time a BERT-based model has been employed to identify bitter peptides. Compared to widely used machine learning models, BERT4Bitter achieved the best performance with accuracy of 0.861 and 0.922 for cross-validation and independent tests, respectively. Furthermore, extensive empirical benchmarking experiments on the independent dataset demonstrated that BERT4Bitter clearly outperformed the existing method with improvements of > 8% accuracy and >16% Matthews coefficient correlation, highlighting the effectiveness and robustness of BERT4Bitter. We believe that the BERT4Bitter method proposed herein will be a useful tool for rapidly screening and identifying novel bitter peptides for drug development and nutritional research. Availability The user-friendly web server of the proposed BERT4Bitter is freely accessible at: http://pmlab.pythonanywhere.com/BERT4Bitter. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods is presented.
Abstract: Motivation The Gamma-Poisson distribution is a theoretically and empirically motivated model for the sampling variability of single cell RNA-sequencing counts and an essential building block for analysis approaches including differential expression analysis, principal component analysis and factor analysis. Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which can comprise millions of cells; at the same time, they do not take full advantage of the fact that zero and other small numbers are frequent in the data. These limitations have hampered uptake of the model, leaving room for statistically inferior approaches such as logarithm(-like) transformation. Results We present a new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods. The software can work with data on disk without having to load them into RAM simultaneously. Availabilityand implementation The package glmGamPoi is available from Bioconductor for Windows, macOS and Linux, and source code is available on github.com/const-ae/glmGamPoi under a GPL-3 license. The scripts to reproduce the results of this paper are available on github.com/const-ae/glmGamPoi-Paper. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: It is concluded that omitting the mtDNA% QC filter or adopting a suboptimal mt DNA% threshold may lead to erroneous biological interpretations of scRNA-seq data.
Abstract: MOTIVATION Quality control (QC) is a critical step in single-cell RNA-seq (scRNA-seq) data analysis. Low-quality cells are removed from the analysis during the QC process to avoid misinterpretation of the data. An important QC metric is the mitochondrial proportion (mtDNA%), which is used as a threshold to filter out low-quality cells. Early publications in the field established a threshold of 5% and since then, it has been used as a default in several software packages for scRNA-seq data analysis, and adopted as a standard in many scRNA-seq studies. However, the validity of using a uniform threshold across different species, single-cell technologies, tissues and cell types has not been adequately assessed. RESULTS We systematically analyzed 5 530 106 cells reported in 1349 annotated datasets available in the PanglaoDB database and found that the average mtDNA% in scRNA-seq data across human tissues is significantly higher than in mouse tissues. This difference is not confounded by the platform used to generate the data. Based on this finding, we propose new reference values of the mtDNA% for 121 tissues of mouse and 44 tissues of humans. In general, for mouse tissues, the 5% threshold performs well to distinguish between healthy and low-quality cells. However, for human tissues, the 5% threshold should be reconsidered as it fails to accurately discriminate between healthy and low-quality cells in 29.5% (13 of 44) tissues analyzed. We conclude that omitting the mtDNA% QC filter or adopting a suboptimal mtDNA% threshold may lead to erroneous biological interpretations of scRNA-seq data. AVAILABILITYAND IMPLEMENTATION The code used to download datasets, perform the analyzes and produce the figures is available at https://github.com/dosorio/mtProportion. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An overview of the latest release of the VEGA suite of programs, which is primarily developed for drug design studies, includes cheminformatics and modeling features, which can be fruitfully utilized in various contexts of the computational chemistry.
Abstract: The purpose of the article is to offer an overview of the latest release of the VEGA suite of programs. This software has been constantly developed and freely released during the last 20 years and has now reached a significant diffusion and technology level as confirmed by the about 22 500 registered users. While being primarily developed for drug design studies, the VEGA package includes cheminformatics and modeling features, which can be fruitfully utilized in various contexts of the computational chemistry. To offer a glimpse of the remarkable potentials of the software, some examples of the implemented features in the cheminformatics field and for structure-based studies are discussed. Finally, the flexible architecture of the VEGA program which can be expanded and customized by plug-in technology or scripting languages will be described focusing attention on the HyperDrive library including highly optimized functions. Availability and implementation: The VEGA suite of programs and the source code of the VEGA command-line version are available free of charge for non-profit organizations at http://www.vegazz.net. Contact: alessandro.pedretti@unimi.it.

Journal ArticleDOI
TL;DR: The UCSC Cell Browser as discussed by the authors is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets, allowing them to explore a growing collection of singlecell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single cell datasets.
Abstract: Summary As the use of single-cell technologies has grown, so has the need for tools to explore these large, complicated datasets. The UCSC Cell Browser is a tool that allows scientists to visualize gene expression and metadata annotation distribution throughout a single-cell dataset or multiple datasets. Availability and implementation We provide the UCSC Cell Browser as a free website where scientists can explore a growing collection of single-cell datasets and a freely available python package for scientists to create stable, self-contained visualizations for their own single-cell datasets. Learn more at https://cells.ucsc.edu. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A unified Geometric Deep Neural Network, "PInet" (Protein Interface Network), based on joint segmentation of a representation of a protein surfaces, which predicts the interface regions on both interacting proteins, achieving performance equivalent to or much better than the state-of-the-art predictor for each dataset.
Abstract: Motivation Protein-protein interactions drive wide-ranging molecular processes, and characterizing at the atomic level how proteins interact (beyond just the fact that they interact) can provide key insights into understanding and controlling this machinery. Unfortunately, experimental determination of three-dimensional protein complex structures remains difficult and does not scale to the increasingly large sets of proteins whose interactions are of interest. Computational methods are thus required to meet the demands of large-scale, high-throughput prediction of how proteins interact, but unfortunately both physical modeling and machine learning methods suffer from poor precision and/or recall. Results In order to improve performance in predicting protein interaction interfaces, we leverage the best properties of both data- and physics-driven methods to develop a unified Geometric Deep Neural Network, "PInet" (Protein Interface Network). PInet consumes pairs of point clouds encoding the structures of two partner proteins, in order to predict their structural regions mediating interaction. To make such predictions, PInet learns and utilizes models capturing both geometrical and physicochemical molecular surface complementarity. In application to a set of benchmarks, PInet simultaneously predicts the interface regions on both interacting proteins, achieving performance equivalent to or even much better than the state-of-the-art predictor for each dataset. Furthermore, since PInet is based on joint segmentation of a representation of a protein surfaces, its predictions are meaningful in terms of the underlying physical complementarity driving molecular recognition. Availability PInet scripts and models are available at https://github.com/FTD007/PInet. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The R package BP4RNAseq is developed, which integrates the state-of-art tools from both alignment-based and alignment-free quantification workflows to improve the sensitivity and accuracy of RNA-seq analyses.
Abstract: Summary Processing raw reads of RNA-sequencing (RNA-seq) data, no matter public or newly sequenced data, involves a lot of specialized tools and technical configurations that are often unfamiliar and time-consuming to learn for non-bioinformatics researchers. Here, we develop the R package BP4RNAseq, which integrates the state-of-art tools from both alignment-based and alignment-free quantification workflows. The BP4RNAseq package is a highly automated tool using an optimized pipeline to improve the sensitivity and accuracy of RNA-seq analyses. It can take only two non-technical parameters and output six formatted gene expression quantification at gene and transcript levels. The package applies to both retrospective and newly generated bulk RNA-seq data analyses and is also applicable for single-cell RNA-seq analyses. It, therefore, greatly facilitates the application of RNA-seq. Availability and implementation The BP4RNAseq package for R and its documentation are freely available at https://github.com/sunshanwen/BP4RNAseq. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A generative model for quality scores is introduced, in which a hidden Markov Model with a latest model selection method, called factorized information criteria, is utilized, and this simulator successfully simulates reads that are consistent with real reads.
Abstract: Motivation Recent advances in high-throughput long-read sequencers, such as PacBio and Oxford Nanopore sequencers, produce longer reads with more errors than short-read sequencers. In addition to the high error rates of reads, non-uniformity of errors leads to difficulties in various downstream analyses using long reads. Many useful simulators, which characterize long-read error patterns and simulate them, have been developed. However, there is still room for improvement in the simulation of the non-uniformity of errors. Results To capture characteristics of errors in reads for long-read sequencers, here, we introduce a generative model for quality scores, in which a hidden Markov Model with a latest model selection method, called factorized information criteria, is utilized. We evaluated our developed simulator from various points, indicating that our simulator successfully simulates reads that are consistent with real reads. Availability and implementation The source codes of PBSIM2 are freely available from https://github.com/yukiteruono/pbsim2. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: An existing deep sequence model that had been pretrained in an unsupervised setting is applied on the supervised task of protein molecular function prediction, and it is found that this complex feature representation is effective for this task, outperforming hand-crafted features.
Abstract: Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.

Journal ArticleDOI
Yujie Chen1, Tengfei Ma1, Xixi Yang1, Jianmin Wang1, Bosheng Song1, Xiangxiang Zeng1 
TL;DR: Zeng et al. as discussed by the authors proposed a multi-scale feature fusion deep learning model named MUFFIN, which can jointly learn the drug representation based on both the drug self structure information and the KG with rich bio-medical information.
Abstract: Motivation Adverse drug-drug interactions (DDIs) are crucial for drug research and mainly cause morbidity and mortality. Thus, the identification of potential DDIs is essential for doctors, patients, and the society. Existing traditional machine learning models rely heavily on handcraft features and lack generalization. Recently, the deep learning approaches that can automatically learn drug features from the molecular graph or drug-related network have improved the ability of computational models to predict unknown DDIs. However, previous works utilized large labeled data and merely considered the structure or sequence information of drugs without considering the relations or topological information between drug and other biomedical objects (e.g., gene, disease, and pathway), or considered knowledge graph (KG) without considering the information from the drug molecular structure. Results Accordingly, to effectively explore the joint effect of drug molecular structure and semantic information of drugs in knowledge graph for DDI prediction, we propose a multi-scale feature fusion deep learning model named MUFFIN. MUFFIN can jointly learn the drug representation based on both the drug-self structure information and the KG with rich bio-medical information. In MUFFIN, we designed a bi-level cross strategy that includes cross- and scalar-level components to fuse multi-modal features well. MUFFIN can alleviate the restriction of limited labeled data on deep learning models by crossing the features learned from large-scale KG and drug molecular graph. We evaluated our approach on three datasets and three different tasks including binary-class, multi-class, and multi-label DDI prediction tasks. The results showed that MUFFIN outperformed other state-of-the-art baselines. Availability The source code and data are available at https://github.com/xzenglab/MUFFIN. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new method SumGNN is proposed: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions.
Abstract: MOTIVATION Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less beneficial to directly integrate KGs with other smaller but higher quality data (e.g., experimental data). Most of existing approaches ignore KGs altogether. Some tries to directly integrate KGs with other data via graph neural networks with limited success. Furthermore most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is more meaningful but harder task. RESULTS To fill the gaps, we propose a new method SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions. SumGNN outperforms the best baseline by up to 5.54%, and performance gain is particularly significant in low data relation types. In addition, SumGNN provides interpretable prediction via the generated reasoning paths for each prediction. AVAILABILITY The code is available in the supplementary. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.