scispace - formally typeset
Search or ask a question

Showing papers in "Bioinformatics in 2018"


Journal ArticleDOI
TL;DR: Fastp is developed as an ultra‐fast FASTQ preprocessor with useful quality control and data‐filtering features that can perform quality control, adapter trimming, quality filtering, per‐read quality pruning and many other operations with a single scan of the FAST Q data.
Abstract: Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g. Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient. Results We developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality pruning and many other operations with a single scan of the FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2-5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools. Availability and implementation The open-source code and corresponding instructions are available at https://github.com/OpenGene/fastp.

7,461 citations


Journal ArticleDOI
Heng Li1
TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

6,264 citations


Journal ArticleDOI
TL;DR: Nextstrain consists of a database of viral genomes, a bioinformatics pipeline for phylodynamics analysis, and an interactive visualization platform that presents a real-time view into the evolution and spread of a range of viral pathogens of high public health importance.
Abstract: Summary Understanding the spread and evolution of pathogens is important for effective public health measures and surveillance. Nextstrain consists of a database of viral genomes, a bioinformatics pipeline for phylodynamics analysis, and an interactive visualization platform. Together these present a real-time view into the evolution and spread of a range of viral pathogens of high public health importance. The visualization integrates sequence data with other data types such as geographic information, serology, or host species. Nextstrain compiles our current understanding into a single accessible location, open to health professionals, epidemiologists, virologists and the public alike. Availability and implementation All code (predominantly JavaScript and Python) is freely available from github.com/nextstrain and the web-application is available at nextstrain.org.

2,305 citations


Journal ArticleDOI
TL;DR: NanoPack, a set of tools developed for visualization and processing of long‐read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences, is described.
Abstract: Summary Here we describe NanoPack, a set of tools developed for visualization and processing of long-read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences. Availability and implementation The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools. Supplementary information Supplementary data are available at Bioinformatics online.

1,296 citations


Journal ArticleDOI
TL;DR: Decagon is presented, an approach for modeling polypharmacy side effects that develops a new graph convolutional neural network for multirelational link prediction in multimodal networks and can predict the exact side effect, if any, through which a given drug combination manifests clinically.
Abstract: Motivation The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity Results Here, we present Decagon, an approach for modeling polypharmacy side effects The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type Decagon is developed specifically to handle such multimodal graphs with a large number of edge types Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69% We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies Availability and implementation Source code and preprocessed datasets are at: http://snapstanfordedu/decagon

850 citations


Journal ArticleDOI
TL;DR: A deep learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities is proposed, outperforming the KronRLS algorithm and SimBoost, a state‐of‐the‐art method for DT binding affinity prediction.
Abstract: Motivation The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs). Results The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction. Availability and implementation https://github.com/hkmztrk/DeepDTA. Supplementary information Supplementary data are available at Bioinformatics online.

634 citations


Journal ArticleDOI
TL;DR: A newly implemented background annotation engine for DFAST is presented, which can annotate a typical‐sized bacterial genome within 10 min, with rich information such as pseudogenes, translation exceptions and orthologous gene assignment between given reference genomes.
Abstract: Summary We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 min, with rich information such as pseudogenes, translation exceptions and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future. Availability and implementation The software is implemented in Python 3 and runs in both Python 2.7 and 3.4-on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/. Contact yn@nig.ac.jp. Supplementary information Supplementary data are available at Bioinformatics online.

603 citations


Journal ArticleDOI
TL;DR: An update for the MAFFT multiple sequence alignment program is reported to enable parallel calculation of large numbers of sequences, and introduces a scalable variant, G-large-INS-1, which has equivalent accuracy to G- INS-1 and is applicable to 50 000 or more sequences.
Abstract: Summary We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information Supplementary data are available at Bioinformatics online.

587 citations


Journal ArticleDOI
TL;DR: This manuscript demonstrates performance of the state‐of‐the‐art genome assembly software on six eukaryotic datasets sequenced using different technologies and introduces a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness.
Abstract: Motivation The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes. Results In this manuscript, we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG-a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference. Availability and implementation http://cab.spbu.ru/software/quast-lg. Supplementary information Supplementary data are available at Bioinformatics online.

562 citations


Journal ArticleDOI
TL;DR: GSCALite is a user-friendly web server for dynamic analysis and visualization of gene set in cancer and drug sensitivity correlation, which will be of broad utilities to cancer researchers.
Abstract: Summary The availability of cancer genomic data makes it possible to analyze genes related to cancer. Cancer is usually the result of a set of genes and the signal of a single gene could be covered by background noise. Here, we present a web server named Gene Set Cancer Analysis (GSCALite) to analyze a set of genes in cancers with the following functional modules. (i) Differential expression in tumor versus normal, and the survival analysis; (ii) Genomic variations and their survival analysis; (iii) Gene expression associated cancer pathway activity; (iv) miRNA regulatory network for genes; (v) Drug sensitivity for genes; (vi) Normal tissue expression and eQTL for genes. GSCALite is a user-friendly web server for dynamic analysis and visualization of gene set in cancer and drug sensitivity correlation, which will be of broad utilities to cancer researchers. Availability and implementation GSCALite is available on http://bioinfo.life.hust.edu.cn/web/GSCALite/. Supplementary information Supplementary data are available at Bioinformatics online.

554 citations


Journal ArticleDOI
TL;DR: This Galaxy‐supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation to highlight databases conflicts and uncertainties.
Abstract: Motivation Metagenomics leads to major advances in microbial ecology and biologists need user friendly tools to analyze their data on their own. Results This Galaxy-supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation. The clustering uses Swarm. The chimera removal uses VSEARCH, combined with original cross-sample validation. The taxonomic affiliation returns an innovative multi-affiliation output to highlight databases conflicts and uncertainties. Statistical results and numerous graphical illustrations are produced along the way to monitor the pipeline. FROGS was tested for the detection and quantification of OTUs on real and in silico datasets and proved to be rapid, robust and highly sensitive. It compares favorably with the widespread mothur, UPARSE and QIIME. Availability and implementation Source code and instructions for installation: https://github.com/geraldinepascal/FROGS.git. A companion website: http://frogs.toulouse.inra.fr. Contact geraldine.pascal@inra.fr. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Mosdepth is a new command‐line tool for rapidly calculating genome‐wide sequencing coverage that uses a simple algorithm that is computationally efficient and enables it to quickly produce coverage summaries.
Abstract: Summary Mosdepth is a new command-line tool for rapidly calculating genome-wide sequencing coverage. It measures depth from BAM or CRAM files at either each nucleotide position in a genome or for sets of genomic regions. Genomic regions may be specified as either a BED file to evaluate coverage across capture regions, or as a fixed-size window as required for copy-number calling. Mosdepth uses a simple algorithm that is computationally efficient and enables it to quickly produce coverage summaries. We demonstrate that mosdepth is faster than existing tools and provides flexibility in the types of coverage profiles produced. Availability and implementation mosdepth is available from https://github.com/brentp/mosdepth under the MIT license. Contact bpederse@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new visualization tool that is specifically designed for chloroplast genomes is announced that allows the users to depict the genetic architecture of up to ten chlorop last genomes in the vicinity of the sites connecting the inverted repeats to the short and long single copy regions.
Abstract: Motivation Genome plotting is performed using a wide range of visualizations tools each with emphasis on a different informative dimension of the genome. These tools can provide a deeper insight into the genomic structure of the organism. Results Here, we announce a new visualization tool that is specifically designed for chloroplast genomes. It allows the users to depict the genetic architecture of up to ten chloroplast genomes in the vicinity of the sites connecting the inverted repeats to the short and long single copy regions. The software and its dependent libraries are fully coded in R and the reflected plot is scaled up to realistic size of nucleotide base pairs in the vicinity of the junction sites. We introduce a website for easier use of the program and R source code of the software to be used in case of preferences to be changed and integrated into personal pipelines. The input of the program is an annotation GenBank (.gb) file, the accession or GI number of the sequence or a DOGMA output file. The software was tested using over a 100 embryophyte chloroplast genomes and in all cases a reliable output was obtained. Availability and implementation Source codes and the online suit available at https://irscope.shinyapps.io/irapp/ or https://github.com/Limpfrog/irscope.

Journal ArticleDOI
TL;DR: Using a large set of high‐quality 16S rRNA sequences from finished genomes, the correspondence of OTUs to species is assessed for five representative clustering algorithms using four accuracy metrics and all algorithms had comparable accuracy when tuned to a given metric.
Abstract: Motivation The 16S ribosomal RNA (rRNA) gene is widely used to survey microbial communities Sequences are often clustered into Operational Taxonomic Units (OTUs) as proxies for species The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data Results Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics All algorithms had comparable accuracy when tuned to a given metric Optimal identity thresholds were ∼99% for full-length sequences and ∼100% for the V4 hypervariable region Availability and implementation Reference sequences and source code are provided in the Supplementary Material Supplementary information Supplementary data are available at Bioinformatics online

Journal ArticleDOI
TL;DR: A companion R package based on the R code base of the MetaboAnalyst web server to facilitate transparent, flexible and reproducible analysis of metabolomics data.
Abstract: Summary The MetaboAnalyst web application has been widely used for metabolomics data analysis and interpretation. Despite its user-friendliness, the web interface has presented its inherent limitations (especially for advanced users) with regard to flexibility in creating customized workflow, support for reproducible analysis, and capacity in dealing with large data. To address these limitations, we have developed a companion R package (MetaboAnalystR) based on the R code base of the web server. The package has been thoroughly tested to ensure that the same R commands will produce identical results from both interfaces. MetaboAnalystR complements the MetaboAnalyst web server to facilitate transparent, flexible and reproducible analysis of metabolomics data. Availability and implementation MetaboAnalystR is freely available from https://github.com/xia-lab/MetaboAnalystR.

Journal ArticleDOI
TL;DR: iFeature is a versatile Python‐based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences, capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors.
Abstract: Summary Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit. Availability and implementation http://iFeature.erc.monash.edu/; https://github.com/Superzchen/iFeature/. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Phandango is an interactive application running in a web browser allowing fast exploration of large-scale population genomics datasets combining the output from multiple genomic analysis methods in an intuitive and interactive manner.
Abstract: Summary Fully exploiting the wealth of data in current bacterial population genomics datasets requires synthesizing and integrating different types of analysis across millions of base pairs in hundreds or thousands of isolates. Current approaches often use static representations of phylogenetic, epidemiological, statistical and evolutionary analysis results that are difficult to relate to one another. Phandango is an interactive application running in a web browser allowing fast exploration of large-scale population genomics datasets combining the output from multiple genomic analysis methods in an intuitive and interactive manner. Availability and implementation Phandango is a web application freely available for use at www.phandango.net and includes a diverse collection of datasets as examples. Source code together with a detailed wiki page is available on GitHub at https://github.com/jameshadfield/phandango.

Journal ArticleDOI
TL;DR: This work has developed highly memory‐efficient and scalable extensions for the NGL WebGL‐based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF that enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike.
Abstract: Motivation The interactive visualization of very large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate and deliver structures of increasing size. Results We have tackled this problem by developing highly memory-efficient and scalable extensions for the NGL WebGL-based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF. These enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education. Availability and implementation The source code is freely available under the MIT license at github.com/arose/ngl and distributed on NPM (npmjs.com/package/ngl). MMTF-JavaScript encoders and decoders are available at github.com/rcsb/mmtf-javascript.

Journal ArticleDOI
TL;DR: A fast approach to debias impurity‐based variable importance measures for classification, regression and survival forests is set up, showing that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance.
Abstract: Motivation Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A novel model of Inductive Matrix Completion for MiRNA‐Disease Association prediction (IMCMDA) to complete the missing miRNA‐disease association based on the known associations and the integrated miRNA similarity and disease similarity.
Abstract: Motivation It has been shown that microRNAs (miRNAs) play key roles in variety of biological processes associated with human diseases. In Consideration of the cost and complexity of biological experiments, computational methods for predicting potential associations between miRNAs and diseases would be an effective complement. Results This paper presents a novel model of Inductive Matrix Completion for MiRNA-Disease Association prediction (IMCMDA). The integrated miRNA similarity and disease similarity are calculated based on miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity. The main idea is to complete the missing miRNA-disease association based on the known associations and the integrated miRNA similarity and disease similarity. IMCMDA achieves AUC of 0.8034 based on leave-one-out-cross-validation and improved previous models. In addition, IMCMDA was applied to five common human diseases in three types of case studies. In the first type, respectively, 42, 44, 45 out of top 50 predicted miRNAs of Colon Neoplasms, Kidney Neoplasms, Lymphoma were confirmed by experimental reports. In the second type of case study for new diseases without any known miRNAs, we chose Breast Neoplasms as the test example by hiding the association information between the miRNAs and Breast Neoplasms. As a result, 50 out of top 50 predicted Breast Neoplasms-related miRNAs are verified. In the third type of case study, IMCMDA was tested on HMDD V1.0 to assess the robustness of IMCMDA, 49 out of top 50 predicted Esophageal Neoplasms-related miRNAs are verified. Availability and implementation The code and dataset of IMCMDA are freely available at https://github.com/IMCMDAsourcecode/IMCMDA. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Heatmaply as discussed by the authors is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file, which includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels.
Abstract: Summary heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels. Thanks to the synergistic relationship between heatmaply and other R packages, the user is empowered by a refined control over the statistical and visual aspects of the heatmap layout. Availability and implementation The heatmaply package is available under the GPL-2 Open Source license. It comes with a detailed vignette, and is freely available from: http://cran.r-project.org/package=heatmaply. Contact tal.galili@math.tau.ac.il. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The new release of the ARGs‐OAP database, termed SARG version 2.0, contains sequences not only from CARD and ARDB databases, but also carefully selected and curated sequences from the latest protein collection of the NCBI‐NR database, to keep up to date with the increasing number of ARG deposited sequences.
Abstract: Motivation Much global attention has been paid to antibiotic resistance in monitoring its emergence, accumulation and dissemination. For rapid characterization and quantification of antibiotic resistance genes (ARGs) in metagenomic datasets, an online analysis pipeline, ARGs-OAP has been developed consisting of a database termed Structured Antibiotic Resistance Genes (the SARG) with a hierarchical structure (ARGs type-subtype-reference sequence). Results The new release of the database, termed SARG version 2.0, contains sequences not only from CARD and ARDB databases, but also carefully selected and curated sequences from the latest protein collection of the NCBI-NR database, to keep up to date with the increasing number of ARG deposited sequences. SARG v2.0 has tripled the sequences of the first version and demonstrated improved coverage of ARGs detection in metagenomes from various environmental samples. In addition to annotation of high-throughput raw reads using a similarity search strategy, ARGs-OAP v2.0 now provides model-based identification of assembled sequences using SARGfam, a high-quality profile Hidden Markov Model (HMM), containing profiles of ARG subtypes. Additionally, ARGs-OAP v2.0 improves cell number quantification by using the average coverage of essential single copy marker genes, as an option in addition to the previous method based on the 16S rRNA gene. Availability and implementation ARGs-OAP can be accessed through http://smile.hku.hk/SARGs. The database could be downloaded from the same site. Source codes for this study can be downloaded from https://github.com/xiaole99/ARGs-OAP-v2.0. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: DeepSynergy uses chemical and genomic information as input information, a normalization strategy to account for input data heterogeneity, and conical layers to model drug synergies and could be a valuable tool for selecting novel synergistic drug combinations.
Abstract: Motivation While drug combination therapies are a well-established concept in cancer treatment, identifying novel synergistic combinations is challenging due to the size of combinatorial space. However, computational approaches have emerged as a time- and cost-efficient way to prioritize combinations to test, based on recently available large-scale combination screening data. Recently, Deep Learning has had an impact in many research areas by achieving new state-of-the-art model performance. However, Deep Learning has not yet been applied to drug synergy prediction, which is the approach we present here, termed DeepSynergy. DeepSynergy uses chemical and genomic information as input information, a normalization strategy to account for input data heterogeneity, and conical layers to model drug synergies. Results DeepSynergy was compared to other machine learning methods such as Gradient Boosting Machines, Random Forests, Support Vector Machines and Elastic Nets on the largest publicly available synergy dataset with respect to mean squared error. DeepSynergy significantly outperformed the other methods with an improvement of 7.2% over the second best method at the prediction of novel drug combinations within the space of explored drugs and cell lines. At this task, the mean Pearson correlation coefficient between the measured and the predicted values of DeepSynergy was 0.73. Applying DeepSynergy for classification of these novel drug combinations resulted in a high predictive performance of an AUC of 0.90. Furthermore, we found that all compared methods exhibit low predictive performance when extrapolating to unexplored drugs or cell lines, which we suggest is due to limitations in the size and diversity of the dataset. We envision that DeepSynergy could be a valuable tool for selecting novel synergistic drug combinations. Availability and implementation DeepSynergy is available via www.bioinf.jku.at/software/DeepSynergy. Contact klambauer@bioinf.jku.at. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This work has developed a novel method to predict protein function from sequence that uses deep learning to learn features from protein sequences as well as a cross-species protein–protein interaction network.
Abstract: Motivation A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. Contact robert.hoehndorf@kaust.edu.sa. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A novel deep neural network is developed estimating the binding affinity of ligand‐receptor complexes by utilizing a 3D convolution to produce a feature map of this representation, treating the atoms of both proteins and ligands in the same manner.
Abstract: Motivation Structure based ligand discovery is one of the most successful approaches for augmenting the drug discovery process. Currently, there is a notable shift towards machine learning (ML) methodologies to aid such procedures. Deep learning has recently gained considerable attention as it allows the model to 'learn' to extract features that are relevant for the task at hand. Results We have developed a novel deep neural network estimating the binding affinity of ligand-receptor complexes. The complex is represented with a 3D grid, and the model utilizes a 3D convolution to produce a feature map of this representation, treating the atoms of both proteins and ligands in the same manner. Our network was tested on the CASF-2013 'scoring power' benchmark and Astex Diverse Set and outperformed classical scoring functions. Availability and implementation The model, together with usage instructions and examples, is available as a git repository at http://gitlab.com/cheminfIBB/pafnucy. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A new R package, named ‘castor’, is presented, for comparative phylogenetics on large trees comprising millions of tips, which is often 100‐1000 times faster than existing tools.
Abstract: Motivation Biodiversity databases now comprise hundreds of thousands of sequences and trait records. For example, the Open Tree of Life includes over 1 491 000 metazoan and over 300 000 bacterial taxa. These data provide unique opportunities for analysis of phylogenetic trait distribution and reconstruction of ancestral biodiversity. However, existing tools for comparative phylogenetics scale poorly to such large trees, to the point of being almost unusable. Results Here we present a new R package, named 'castor', for comparative phylogenetics on large trees comprising millions of tips. On large trees castor is often 100-1000 times faster than existing tools. Availability and implementation The castor source code, compiled binaries, documentation and usage examples are freely available at the Comprehensive R Archive Network (CRAN). Contact louca.research@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Drawing on an extensive feature set, FATHMM‐XF outperforms competitors on benchmark tests, particularly in non‐coding regions where the majority of pathogenic mutations are likely to be found.
Abstract: Summary: We present FATHMM-XF, a method for predicting pathogenic point mutations in the human genome. Drawing on an extensive feature set, FATHMM-XF outperforms competitors on benchmark tests, particularly in non-coding regions where the majority of pathogenic mutations are likely to be found. Availability and implementation: The FATHMM-XF web server is available at http://fathmm.bio compute.org.uk/fathmm-xf/, and as tracks on the Genome Tolerance Browser: http://gtb.biocom pute.org.uk. Predictions are provided for human genome version GRCh37/hg19. The data used for this project can be downloaded from: http://fathmm.biocompute.org.uk/fathmm-xf/

Journal ArticleDOI
TL;DR: An effective feature representation learning model is developed that can extract and learn a set of informative features from a pool of support vector machine-based models trained using sequence-based feature descriptors and provide the most discriminative power for identifying ACPs.
Abstract: Motivation Anti-cancer peptides (ACPs) have recently emerged as promising therapeutic agents for cancer treatment. Due to the avalanche of protein sequence data in the post-genomic era, there is an urgent need to develop automated computational methods to enable fast and accurate identification of novel ACPs within the vast number of candidate proteins and peptides. Results To address this, we propose a novel predictor named Anti-Cancer peptide Predictor with Feature representation Learning (ACPred-FL) for accurate prediction of ACPs based on sequence information. More specifically, we develop an effective feature representation learning model, with which we can extract and learn a set of informative features from a pool of support vector machine-based models trained using sequence-based feature descriptors. By doing so, the class label information of data samples is fully utilized. To improve the feature representation, we further employ a two-step feature selection technique, resulting in a most informative five-dimensional feature vector for the final peptide representation. Experimental results show that such five features provide the most discriminative power for identifying ACPs than currently available feature descriptors, highlighting the effectiveness of the proposed feature representation learning approach. The developed ACPred-FL method significantly outperforms state-of-the-art methods. Availability and implementation The web-server of ACPred-FL is available at http://server.malab.cn/ACPred-FL. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Novel selection strategies to identify highly tissue‐specific CpG sites are introduced and the random forest approach is used to construct the classifiers that can efficiently predict the origin of tumors.
Abstract: MOTIVATION: A clear identification of the primary site of tumor is of great importance to the next targeted site-specific treatments and could efficiently improve patient‘s overall survival. Even though many classifiers based on gene expression had been proposed to predict the tumor primary, only a few studies focus on using DNA methylation (DNAm) profiles to develop classifiers, and none of them compares the performance of classifiers based on different profiles. RESULTS: We introduced novel selection strategies to identify highly tissue-specific CpG sites and then used the random forest approach to construct the classifiers to predict the origin of tumors. We also compared the prediction performance by applying similar strategy on miRNA expression profiles. Our analysis indicated that these classifiers had an accuracy of 96.05% (Maximum–Relevance–Maximum–Distance: 90.02–99.99%) or 95.31% (principal component analysis: 79.82–99.91%) on independent DNAm datasets, and an overall accuracy of 91.30% (range 79.33–98.74%) on independent miRNA test sets for predicting tumor origin. This suggests that our feature selection methods are very effective to identify tissue-specific biomarkers and the classifiers we developed can efficiently predict the origin of tumors. We also developed a user-friendly webserver that helps users to predict the tumor origin by uploading miRNA expression or DNAm profile of their interests. AVAILABILITY AND IMPLEMENTATION: The webserver, and relative data, code are accessible at http://server.malab.cn/MMCOP/.

Journal ArticleDOI
TL;DR: A neural network approach, i.e. attention‐based bidirectional Long Short‐Term Memory with a conditional random field layer (Att‐BiLSTM‐CRF), to document‐level chemical NER that achieves better performances with little feature engineering than other state‐of‐the‐art methods.
Abstract: Motivation In biomedical research, chemical is an important class of entities, and chemical named entity recognition (NER) is an important task in the field of biomedical information extraction. However, most popular chemical NER methods are based on traditional machine learning and their performances are heavily dependent on the feature engineering. Moreover, these methods are sentence-level ones which have the tagging inconsistency problem. Results In this paper, we propose a neural network approach, i.e. attention-based bidirectional Long Short-Term Memory with a conditional random field layer (Att-BiLSTM-CRF), to document-level chemical NER. The approach leverages document-level global information obtained by attention mechanism to enforce tagging consistency across multiple instances of the same token in a document. It achieves better performances with little feature engineering than other state-of-the-art methods on the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus and the BioCreative V chemical-disease relation (CDR) task corpus (the F-scores of 91.14 and 92.57%, respectively). Availability and implementation Data and code are available at https://github.com/lingluodlut/Att-ChemdNER. Contact yangzh@dlut.edu.cn or wangleibihami@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.