scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2020"


Journal ArticleDOI
TL;DR: ShinyGO is an intuitive, graphical web application that can help researchers gain actionable insights from gene lists, based on a large annotation database derived from Ensembl and STRING-db for 59 plant, 256 animal, 115 archaeal, and 1678 bacterial species.
Abstract: MotivationGene lists are routinely produced from various omic studies. Enrichment analysis can link these gene lists with underlying molecular pathways and functional categories such as gene ontology (GO) and other databases. ResultsTo complement existing tools, we developed ShinyGO based on a large annotation database derived from Ensembl and STRING-db for 59 plant, 256 animal, 115 archeal and 1678 bacterial species. ShinyGO’s novel features include graphical visualization of enrichment results and gene characteristics, and application program interface access to KEGG and STRING for the retrieval of pathway diagrams and protein–protein interaction networks. ShinyGO is an intuitive, graphical web application that can help researchers gain actionable insights from gene-sets. Availability and implementationhttp://ge-lab.org/go/. Supplementary informationSupplementary data are available at Bioinformatics online.

1,174 citations


Journal ArticleDOI
TL;DR: A novel tool, purge_dups, is presented, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and can reduce heter allele duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Abstract: Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

728 citations


Journal ArticleDOI
TL;DR: KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds.
Abstract: SUMMARY KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction. AVAILABILITY AND IMPLEMENTATION KofamKOALA, KofamScan and KOfam are freely available from GenomeNet (https://www.genome.jp/tools/kofamkoala/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

607 citations


Journal ArticleDOI
TL;DR: The single model composite score QMEAN is extended by introducing a consensus-based distance constraint (DisCo) score, which combines the accuracy of consensus methods with the broad applicability of single model approaches and demonstrates that CASP models are not the ideal data source to train predictive methods for model quality estimation.
Abstract: Motivation Methods that estimate the quality of a 3D protein structure model in absence of an experimental reference structure are crucial to determine a model's utility and potential applications. Single model methods assess individual models whereas consensus methods require an ensemble of models as input. In this work, we extend the single model composite score QMEAN that employs statistical potentials of mean force and agreement terms by introducing a consensus-based distance constraint (DisCo) score. Results DisCo exploits distance distributions from experimentally determined protein structures that are homologous to the model being assessed. Feed-forward neural networks are trained to adaptively weigh contributions by the multi-template DisCo score and classical single model QMEAN parameters. The result is the composite score QMEANDisCo, which combines the accuracy of consensus methods with the broad applicability of single model approaches. We also demonstrate that, despite being the de-facto standard for structure prediction benchmarking, CASP models are not the ideal data source to train predictive methods for model quality estimation. For performance assessment, QMEANDisCo is continuously benchmarked within the CAMEO project and participated in CASP13. For both, it ranks among the top performers and excels with low response times. Availability and implementation QMEANDisCo is available as web-server at https://swissmodel.expasy.org/qmean. The source code can be downloaded from https://git.scicore.unibas.ch/schwede/QMEAN. Supplementary information Supplementary data are available at Bioinformatics online.

433 citations


Journal ArticleDOI
TL;DR: ipyrad is a free and open source tool for assembling and analyzing restriction-site associated DNA sequence (RADseq) datasets using de novo and/or reference-based approaches that can be massively scalable to hundreds of taxa and thousands of samples, and can be efficiently parallelized on high performance computing clusters.
Abstract: SUMMARY ipyrad is a free and open source tool for assembling and analyzing restriction site-associated DNA sequence datasets using de novo and/or reference-based approaches. It is designed to be massively scalable to hundreds of taxa and thousands of samples, and can be efficiently parallelized on high performance computing clusters. It is available both as a command line interface and as a Python package with an application programming interface, the latter of which can be used interactively to write complex, reproducible scripts and implement a suite of downstream analysis tools. AVAILABILITY AND IMPLEMENTATION ipyrad is a free and open source program written in Python. Source code is available from the GitHub repository (https://github.com/dereneaton/ipyrad/), and Linux and MacOS installs are distributed through the conda package manager. Complete documentation, including numerous tutorials, and Jupyter notebooks demonstrating example assemblies and applications of downstream analysis tools are available online: https://ipyrad.readthedocs.io/.

424 citations


Journal ArticleDOI
TL;DR: NextPolish is a tool that efficiently corrects sequence errors in genomes assembled with long reads by consisting of two interlinked modules designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors.
Abstract: MOTIVATION Although long-read sequencing technologies can produce genomes with long contiguity, they suffer from high error rates. Thus, we developed NextPolish, a tool that efficiently corrects sequence errors in genomes assembled with long reads. This new tool consists of two interlinked modules that are designed to score and count K-mers from high quality short reads, and to polish genome assemblies containing large numbers of base errors. RESULTS When evaluated for the speed and efficiency using human and a plant (Arabidopsis thaliana) genomes, NextPolish outperformed Pilon by correcting sequence errors faster, and with a higher correction accuracy. AVAILABILITY AND IMPLEMENTATION NextPolish is implemented in C and Python. The source code is available from https://github.com/Nextomics/NextPolish. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

383 citations


Journal ArticleDOI
TL;DR: A file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution and has the flexibility to accommodate various descriptions of the data axes, resolutions, data density patterns, and metadata is developed.
Abstract: Motivation Most existing coverage-based (epi)genomic datasets are one-dimensional, but newer technologies probing interactions (physical, genetic, etc.) produce quantitative maps with two-dimensional genomic coordinate systems. Storage and computational costs mount sharply with data resolution when such maps are stored in dense form. Hence, there is a pressing need to develop data storage strategies that handle the full range of useful resolutions in multidimensional genomic datasets by taking advantage of their sparse nature, while supporting efficient compression and providing fast random access to facilitate development of scalable algorithms for data analysis. Results We developed a file format called cooler, based on a sparse data model, that can support genomically labeled matrices at any resolution. It has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns and metadata. Cooler is based on HDF5 and is supported by a Python library and command line suite to create, read, inspect and manipulate cooler data collections. The format has been adopted as a standard by the NIH 4D Nucleome Consortium. Availability and implementation Cooler is cross-platform, BSD-licensed and can be installed from the Python package index or the bioconda repository. The source code is maintained on Github at https://github.com/mirnylab/cooler. Supplementary information Supplementary data are available at Bioinformatics online.

340 citations


Journal ArticleDOI
TL;DR: Two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template and help reduce TPOT computation time and may provide more interpretable results.
Abstract: Motivation Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. Results We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. Availability and implementation Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. Supplementary information Supplementary data are available at Bioinformatics online.

229 citations


Journal ArticleDOI
TL;DR: Logomaker is a Python API for creating publication-quality sequence logos that can produce both standard and highly customized logos from either a matrix-like array of numbers or a multiple-sequence alignment.
Abstract: Summary Sequence logos are visually compelling ways of illustrating the biological properties of DNA, RNA and protein sequences, yet it is currently difficult to generate and customize such logos within the Python programming environment. Here we introduce Logomaker, a Python API for creating publication-quality sequence logos. Logomaker can produce both standard and highly customized logos from either a matrix-like array of numbers or a multiple-sequence alignment. Logos are rendered as native matplotlib objects that are easy to stylize and incorporate into multi-panel figures. Availability and implementation Logomaker can be installed using the pip package manager and is compatible with both Python 2.7 and Python 3.6. Documentation is provided at http://logomaker.readthedocs.io; source code is available at http://github.com/jbkinney/logomaker.

206 citations


Journal ArticleDOI
TL;DR: RICOPILI, an open-sourced Perl-based pipeline was developed to address the challenges of rapidly processing large-scale multi-cohort GWAS studies including quality control, imputation and downstream analyses, is computationally efficient with portability to a wide range of high-performance computing environments.
Abstract: SUMMARY: Genome-wide association study (GWAS) analyses, at sufficient sample sizes and power, have successfully revealed biological insights for several complex traits. RICOPILI, an open-sourced Perl-based pipeline was developed to address the challenges of rapidly processing large-scale multi-cohort GWAS studies including quality control (QC), imputation and downstream analyses. The pipeline is computationally efficient with portability to a wide range of high-performance computing environments. RICOPILI was created as the Psychiatric Genomics Consortium pipeline for GWAS and adopted by other users. The pipeline features (i) technical and genomic QC in case-control and trio cohorts, (ii) genome-wide phasing and imputation, (iv) association analysis, (v) meta-analysis, (vi) polygenic risk scoring and (vii) replication analysis. Notably, a major differentiator from other GWAS pipelines, RICOPILI leverages on automated parallelization and cluster job management approaches for rapid production of imputed genome-wide data. A comprehensive meta-analysis of simulated GWAS data has been incorporated demonstrating each step of the pipeline. This includes all the associated visualization plots, to allow ease of data interpretation and manuscript preparation. Simulated GWAS datasets are also packaged with the pipeline for user training tutorials and developer work. AVAILABILITY AND IMPLEMENTATION: RICOPILI has a flexible architecture to allow for ongoing development and incorporation of newer available algorithms and is adaptable to various HPC environments (QSUB, BSUB, SLURM and others). Specific links for genomic resources are either directly provided in this paper or via tutorials and external links. The central location hosting scripts and tutorials is found at this URL: https://sites.google.com/a/broadinstitute.org/RICOPILI/home. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

182 citations


Journal ArticleDOI
TL;DR: Pavian is a web application for exploring classification results from metagenomics experiments using interactive data tables, heatmaps and Sankey flow diagrams, which can help in the validation of matches to a particular genome.
Abstract: Summary Pavian is a web application for exploring classification results from metagenomics experiments. With Pavian, researchers can analyze, visualize and transform results from various classifiers-such as Kraken, Centrifuge and MethaPhlAn-using interactive data tables, heatmaps and Sankey flow diagrams. An interactive alignment coverage viewer can help in the validation of matches to a particular genome, which can be crucial when using metagenomics experiments for pathogen detection. Availability and implementation Pavian is implemented in the R language as a modular Shiny web app and is freely available under GPL-3 from http://github.com/fbreitwieser/pavian.

Journal ArticleDOI
TL;DR: PGT as discussed by the authors is a modular plotting tool that easily combines multiple genomic tracks and enables a reproducible and standardized generation of highly customizable and publication ready images, and it is available through a graphical interface on usegalaxy.eu and through the command line.
Abstract: Motivation Generating publication ready plots to display multiple genomic tracks can pose a serious challenge. Making desirable and accurate figures requires considerable effort. This is usually done by hand or using a vector graphic software. Results pyGenomeTracks (PGT) is a modular plotting tool that easily combines multiple tracks. It enables a reproducible and standardized generation of highly customizable and publication ready images. Availability and implementation PGT is available through a graphical interface on https://usegalaxy.eu and through the command line. It is provided on conda via the bioconda channel, on pip and it is openly developed on github: https://github.com/deeptools/pyGenomeTracks. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
Jin Li1, Sai Zhang1, Tao Liu1, Chenxi Ning1, Zhuoxuan Zhang1, Zhou Wei1 
TL;DR: A novel method of neural inductive matrix completion with graph convolutional network (NIMCGCN) for predicting miRNA-disease association and compared with other state-of-the-art methods showed that the method is significantly superior to existing methods.
Abstract: Motivation Predicting the association between microRNAs (miRNAs) and diseases plays an import role in identifying human disease-related miRNAs. As identification of miRNA-disease associations via biological experiments is time-consuming and expensive, computational methods are currently used as effective complements to determine the potential associations between disease and miRNA. Results We present a novel method of neural inductive matrix completion with graph convolutional network (NIMCGCN) for predicting miRNA-disease association. NIMCGCN first uses graph convolutional networks to learn miRNA and disease latent feature representations from the miRNA and disease similarity networks. Then, learned features were input into a novel neural inductive matrix completion (NIMC) model to generate an association matrix completion. The parameters of NIMCGCN were learned based on the known miRNA-disease association data in a supervised end-to-end way. We compared the proposed method with other state-of-the-art methods. The area under the receiver operating characteristic curve results showed that our method is significantly superior to existing methods. Furthermore, 50, 47 and 48 of the top 50 predicted miRNAs for three high-risk human diseases, namely, colon cancer, lymphoma and kidney cancer, were verified using experimental literature. Finally, 100% prediction accuracy was achieved when breast cancer was used as a case study to evaluate the ability of NIMCGCN for predicting a new disease without any known related miRNAs. Availability and implementation https://github.com/ljatynu/NIMCGCN/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A multimodal deep learning framework named DDIMDL is proposed that combines diverse drug features with deep learning to build a model for predicting drug-drug interaction-associated events and outperforms state-of-the-art DDI event prediction methods and baseline methods.
Abstract: Motivation Drug-drug interactions (DDIs) are one of the major concerns in pharmaceutical research. Many machine learning based methods have been proposed for the DDI prediction, but most of them predict whether two drugs interact or not. The studies revealed that DDIs could cause different subsequent events, and predicting DDI-associated events is more useful for investigating the mechanism hidden behind the combined drug usage or adverse reactions. Results In this article, we collect DDIs from DrugBank database, and extract 65 categories of DDI events by dependency analysis and events trimming. We propose a multimodal deep learning framework named DDIMDL that combines diverse drug features with deep learning to build a model for predicting DDI-associated events. DDIMDL first constructs deep neural network (DNN)-based sub-models, respectively, using four types of drug features: chemical substructures, targets, enzymes and pathways, and then adopts a joint DNN framework to combine the sub-models to learn cross-modality representations of drug-drug pairs and predict DDI events. In computational experiments, DDIMDL produces high-accuracy performances and has high efficiency. Moreover, DDIMDL outperforms state-of-the-art DDI event prediction methods and baseline methods. Among all the features of drugs, the chemical substructures seem to be the most informative. With the combination of substructures, targets and enzymes, DDIMDL achieves an accuracy of 0.8852 and an area under the precision-recall curve of 0.9208. Availability and implementation The source code and data are available at https://github.com/YifanDengWHU/DDIMDL. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: New datasets specific for CPI prediction are constructed, a novel transformer neural network named TransformerCPI is proposed, and a more rigorous label reversal experiment is introduced to test whether a model learns true interaction features.
Abstract: Motivation Identifying compound-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional structure account for a large part of potential biological targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias and splitting datasets inappropriately, resulting in overestimation of their prediction performance. Results To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal experiment to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new experiments, and it can be deconvolved to highlight important interacting regions of protein sequences and compound atoms, which may contribute chemical biology studies with useful guidance for further ligand structural optimization. Availability and implementation https://github.com/lifanchen-simm/transformerCPI.

Journal ArticleDOI
TL;DR: This work proposes a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data that avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell.
Abstract: Motivation Models for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations. Results We propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq datasets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types. Availability and implementation Our method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The Genome Detective Coronavirus Typing Tool that can accurately identify the novel severe acute respiratory syndrome-related coronavirus sequences isolated in China and around the world may help to accelerate the development of novel diagnostics, drugs and vaccines to stop the COVID-19 disease.
Abstract: Summary Genome detective is a web-based, user-friendly software application to quickly and accurately assemble all known virus genomes from next-generation sequencing datasets. This application allows the identification of phylogenetic clusters and genotypes from assembled genomes in FASTA format. Since its release in 2019, we have produced a number of typing tools for emergent viruses that have caused large outbreaks, such as Zika and Yellow Fever Virus in Brazil. Here, we present the Genome Detective Coronavirus Typing Tool that can accurately identify the novel severe acute respiratory syndrome (SARS)-related coronavirus (SARS-CoV-2) sequences isolated in China and around the world. The tool can accept up to 2000 sequences per submission and the analysis of a new whole-genome sequence will take approximately 1 min. The tool has been tested and validated with hundreds of whole genomes from 10 coronavirus species, and correctly classified all of the SARS-related coronavirus (SARSr-CoV) and all of the available public data for SARS-CoV-2. The tool also allows tracking of new viral mutations as the outbreak expands globally, which may help to accelerate the development of novel diagnostics, drugs and vaccines to stop the COVID-19 disease. Availability and implementation https://www.genomedetective.com/app/typingtool/cov. Contact koen@emweb.be or deoliveira@ukzn.ac.za. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model (HMM) algorithms, is developed.
Abstract: Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: CO VID-19 Docking Server is introduced, a web server that predicts the binding modes between COVID-19 targets and the ligands including small molecules, peptides and antibodies, and the meta platform provides a free and interactive tool for the prediction of COvid-19 target-ligand interactions and following drug discovery for COID-19.
Abstract: MOTIVATION The coronavirus disease 2019 (COVID-19) caused by a new type of coronavirus has been emerging from China and led to thousands of death globally since December 2019. Despite many groups have engaged in studying the newly emerged virus and searching for the treatment of COVID-19, the understanding of the COVID-19 target-ligand interactions represents a key challenge. Herein, we introduce COVID-19 Docking Server, a web server that predicts the binding modes between COVID-19 targets and the ligands including small molecules, peptides and antibodies. RESULTS Structures of proteins involved in the virus life cycle were collected or constructed based on the homologs of coronavirus, and prepared ready for docking. The meta-platform provides a free and interactive tool for the prediction of COVID-19 target-ligand interactions and following drug discovery for COVID-19. AVAILABILITY AND IMPLEMENTATION http://ncov.schanglab.org.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven.
Abstract: Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods and achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets.
Abstract: Motivation Mitochondria are an essential organelle in most eukaryotes. They not only play an important role in energy metabolism but also take part in many critical cytopathological processes. Abnormal mitochondria can trigger a series of human diseases, such as Parkinson's disease, multifactor disorder and Type-II diabetes. Protein submitochondrial localization enables the understanding of protein function in studying disease pathogenesis and drug design. Results We proposed a new method, SubMito-XGBoost, for protein submitochondrial localization prediction. Three steps are included: (i) the g-gap dipeptide composition (g-gap DC), pseudo-amino acid composition (PseAAC), auto-correlation function (ACF) and Bi-gram position-specific scoring matrix (Bi-gram PSSM) are employed to extract protein sequence features, (ii) Synthetic Minority Oversampling Technique (SMOTE) is used to balance samples, and the ReliefF algorithm is applied for feature selection and (iii) the obtained feature vectors are fed into XGBoost to predict protein submitochondrial locations. SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods. The prediction accuracies of the SubMito-XGBoost method on the two training datasets M317 and M983 were 97.7% and 98.9%, which are 2.8-12.5% and 3.8-9.9% higher than other methods, respectively. The prediction accuracy of the independent test set M495 was 94.8%, which is significantly better than the existing studies. The proposed method also achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets. SubMito-XGBoost also plays an important role in new drug design for the treatment of related diseases. Availability and implementation The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/SubMito-XGBoost/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Proline is presented, a robust software suite for analysis of MS-based proteomics data, which collects, processes and allows visualization and publication of proteomics datasets, and its ease of use for various steps in the validation and quantification workflow, its data curation capabilities and its computational efficiency are illustrated.
Abstract: Motivation The proteomics field requires the production and publication of reliable mass spectrometry-based identification and quantification results. Although many tools or algorithms exist, very few consider the importance of combining, in a unique software environment, efficient processing algorithms and a data management system to process and curate hundreds of datasets associated with a single proteomics study. Results Here, we present Proline, a robust software suite for analysis of MS-based proteomics data, which collects, processes and allows visualization and publication of proteomics datasets. We illustrate its ease of use for various steps in the validation and quantification workflow, its data curation capabilities and its computational efficiency. The DDA label-free quantification workflow efficiency was assessed by comparing results obtained with Proline to those obtained with a widely used software using a spiked-in sample. This assessment demonstrated Proline's ability to provide high quantification accuracy in a user-friendly interface for datasets of any size. Availability and implementation Proline is available for Windows and Linux under CECILL open-source license. It can be deployed in client-server mode or in standalone mode at http://proline.profiproteomics.fr/#downloads. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Performance comparisons over empirical cross-validation analysis, independent test, and case study against state-of-the-art methods demonstrate that HLPpred-Fuse consistently outperformed these methods in the identification of hemolytic activity.
Abstract: Motivation Therapeutic peptides failing at clinical trials could be attributed to their toxicity profiles like hemolytic activity, which hamper further progress of peptides as drug candidates. The accurate prediction of hemolytic peptides (HLPs) and its activity from the given peptides is one of the challenging tasks in immunoinformatics, which is essential for drug development and basic research. Although there are a few computational methods that have been proposed for this aspect, none of them are able to identify HLPs and their activities simultaneously. Results In this study, we proposed a two-layer prediction framework, called HLPpred-Fuse, that can accurately and automatically predict both hemolytic peptides (HLPs or non-HLPs) as well as HLPs activity (high and low). More specifically, feature representation learning scheme was utilized to generate 54 probabilistic features by integrating six different machine learning classifiers and nine different sequence-based encodings. Consequently, the 54 probabilistic features were fused to provide sufficiently converged sequence information which was used as an input to extremely randomized tree for the development of two final prediction models which independently identify HLP and its activity. Performance comparisons over empirical cross-validation analysis, independent test and case study against state-of-the-art methods demonstrate that HLPpred-Fuse consistently outperformed these methods in the identification of hemolytic activity. Availability and implementation For the convenience of experimental scientists, a web-based tool has been established at http://thegleelab.org/HLPpred-Fuse. Contact glee@ajou.ac.kr or watshara.sho@mahidol.ac.th or bala@ajou.ac.kr. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new predictor called iMRM is developed, which is able to simultaneously identify m6A, m5C, m1A, ψ and A-to-I modifications in Homo sapiens, Mus musculus and Saccharomyces cerevisiae.
Abstract: Motivation RNA modifications play critical roles in a series of cellular and developmental processes. Knowledge about the distributions of RNA modifications in the transcriptomes will provide clues to revealing their functions. Since experimental methods are time consuming and laborious for detecting RNA modifications, computational methods have been proposed for this aim in the past five years. However, there are some drawbacks for both experimental and computational methods in simultaneously identifying modifications occurred on different nucleotides. Results To address such a challenge, in this article, we developed a new predictor called iMRM, which is able to simultaneously identify m6A, m5C, m1A, ψ and A-to-I modifications in Homo sapiens, Mus musculus and Saccharomyces cerevisiae. In iMRM, the feature selection technique was used to pick out the optimal features. The results from both 10-fold cross-validation and jackknife test demonstrated that the performance of iMRM is superior to existing methods for identifying RNA modifications. Availability and implementation A user-friendly web server for iMRM was established at http://www.bioml.cn/XG_iRNA/home. The off-line command-line version is available at https://github.com/liukeweiaway/iMRM. Contact greatchen@ncst.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A metaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes is described.
Abstract: Motivation Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth's virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies. Results We describe a MetaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes. We benchmarked MetaviralSPAdes on diverse metagenomic datasets, verified our predictions using a set of virus-specific Hidden Markov Models and demonstrated that it improves on the state-of-the-art viral identification pipelines. Availability and implementation Metaviral SPAdes includes ViralAssembly, ViralVerify and ViralComplete modules that are available as standalone packages: https://github.com/ablab/spades/tree/metaviral_publication, https://github.com/ablab/viralVerify/ and https://github.com/ablab/viralComplete/. Contact d.antipov@spbu.ru. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The Protein Imager interface is a lightweight, powerful, and easy-to-use interface as a next-gen online molecular viewer linked to an automated server-side rendering system able to generate publication-quality molecular illustrations.
Abstract: Summary Molecular viewers' long learning curve is hindering researchers in approaching the field of structural biology for the first time. Herein, we present 'The Protein Imager', a lightweight, powerful and easy-to-use interface as a next-gen online molecular viewer. Furthermore, the interface is linked to an automated server-side rendering system able to generate publication-quality molecular illustrations. The Protein Imager interface has been designed for easy usage for beginners and experts in the field alike. The interface allows the preparation of very complex molecular views maintaining a high level of responsiveness even on mobile devices. Availability and implementation The Protein Imager interface is freely available online at https://3dproteinimaging.com/protein-imager. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: In this article, the authors proposed two new approaches for in silico doublet identification: co-expression based doublet scoring (cxds) and binary classification based doublets scoring (bcds).
Abstract: MOTIVATION Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study's conclusions, and therefore computational strategies for the identification of doublets are needed. RESULTS With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds. AVAILABILITY AND IMPLEMENTATION scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The proposed algorithm, Winnowmap, improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.
Abstract: MOTIVATION In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g. Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions. RESULTS We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while considering a weight for each k-mer; i.e. the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. AVAILABILITY AND IMPLEMENTATION Winnowmap is built on top of the Minimap2 codebase and is available at https://github.com/marbl/winnowmap.

Journal ArticleDOI
TL;DR: This work presents an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy, and demonstrates that this approach enables identification of gene programs in massive datasets.
Abstract: Motivation Single-cell RNA-seq makes possible the investigation of variability in gene expression among cells, and dependence of variation on cell type. Statistical inference methods for such analyses must be scalable, and ideally interpretable. Results We present an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy. We demonstrate that our approach enables identification of gene programs in massive datasets. Our strategy, namely the learning of factor models with the auto-encoding variational Bayes framework, is not domain specific and may be useful for other applications. Availability and implementation The factor model is available in the scVI package hosted at https://github.com/YosefLab/scVI/. Contact v@nxn.se. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A graph convolutional network (GCN) based method, named DeepLGP, is presented for prioritizing target PCGs of lncRNAs and found that lncRNA pairs with high similarity had more overlapped target genes.
Abstract: Motivation Although long non-coding RNAs (lncRNAs) have limited capacity for encoding proteins, they have been verified as biomarkers in the occurrence and development of complex diseases. Recent wet-lab experiments have shown that lncRNAs function by regulating the expression of protein-coding genes (PCGs), which could also be the mechanism responsible for causing diseases. Currently, lncRNA-related biological data are increasing rapidly. Whereas, no computational methods have been designed for predicting the novel target genes of lncRNA. Results In this study, we present a graph convolutional network (GCN) based method, named DeepLGP, for prioritizing target PCGs of lncRNA. First, gene and lncRNA features were selected, these included their location in the genome, expression in 13 tissues and miRNA-mediated lncRNA-gene pairs. Next, GCN was applied to convolve a gene interaction network for encoding the features of genes and lncRNAs. Then, these features were used by the convolutional neural network for prioritizing target genes of lncRNAs. In 10-cross validations on two independent datasets, DeepLGP obtained high area under curves (0.90-0.98) and area under precision-recall curves (0.91-0.98). We found that lncRNA pairs with high similarity had more overlapped target genes. Further experiments showed that genes targeted by the same lncRNA sets had a strong likelihood of causing the same diseases, which could help in identifying disease-causing PCGs. Availability and implementation https://github.com/zty2009/LncRNA-target-gene. Supplementary information Supplementary data are available at Bioinformatics online.