Showing papers in "Bioinformatics in 2019"

PDF

Open Access

Journal Article•DOI•

ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R.

[...]

Emmanuel Paradis¹, Klaus Schliep²•Institutions (2)

Centre national de la recherche scientifique¹, University of Massachusetts Boston²

01 Feb 2019-Bioinformatics

TL;DR: Efforts have been put to improve efficiency, flexibility, support for 'big data' (R's long vectors), ease of use and quality check before a new release of ape.

...read moreread less

Abstract: Summary After more than fifteen years of existence, the R package ape has continuously grown its contents, and has been used by a growing community of users The release of version 50 has marked a leap towards a modern software for evolutionary analyses Efforts have been put to improve efficiency, flexibility, support for 'big data' (R's long vectors), ease of use and quality check before a new release These changes will hopefully make ape a useful software for the study of biodiversity and evolution in a context of increasing data quantity Availability and implementation ape is distributed through the Comprehensive R Archive Network: http://cranr-projectorg/package=ape Further information may be found at http://ape-packageirdfr/

...read moreread less

4,303 citations

Journal Article•DOI•

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

[...]

Jinhyuk Lee¹, Wonjin Yoon¹, Sungdong Kim², Donghyeon Kim¹, Sunkyu Kim¹, Chan Ho So¹, Jaewoo Kang¹ - Show less +3 more•Institutions (2)

Korea University¹, Naver Corporation²

25 Jan 2019-Bioinformatics

TL;DR: This article proposed BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.

...read moreread less

Abstract: Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

...read moreread less

2,680 citations

Journal Article•DOI•

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

[...]

Pierre-Alain Chaumeil¹, Aaron J. Mussig¹, Philip Hugenholtz¹, Donovan H. Parks¹•Institutions (1)

University of Queensland¹

15 Nov 2019-Bioinformatics

TL;DR: The accuracy of the GTDB-Tk taxonomic assignments is demonstrated by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.

...read moreread less

Abstract: A Summary: The Genome Taxonomy Database Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the GTDB. GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.

...read moreread less

2,053 citations

Journal Article•DOI•

RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference.

[...]

Alexey M. Kozlov¹, Diego Darriba¹, Tomas Flouri¹, Benoit Morel¹, Alexandros Stamatakis¹, Alexandros Stamatakis² - Show less +2 more•Institutions (2)

Heidelberg Institute for Theoretical Studies¹, Karlsruhe Institute of Technology²

01 Nov 2019-Bioinformatics

TL;DR: RAxML-NG is presented, a from-scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML, which offers improved accuracy, flexibility, speed, scalability, and usability compared with RAx ML/ exaML.

...read moreread less

Abstract: MOTIVATION Phylogenies are important for fundamental biological research, but also have numerous applications in biotechnology, agriculture and medicine. Finding the optimal tree under the popular maximum likelihood (ML) criterion is known to be NP-hard. Thus, highly optimized and scalable codes are needed to analyze constantly growing empirical datasets. RESULTS We present RAxML-NG, a from-scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML. RAxML-NG offers improved accuracy, flexibility, speed, scalability, and usability compared with RAxML/ExaML. On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference (although IQ-Tree shows better stability). Finally, RAxML-NG introduces several new features, such as the detection of terraces in tree space and the recently introduced transfer bootstrap support metric. AVAILABILITY AND IMPLEMENTATION The code is available under GNU GPL at https://github.com/amkozlov/raxml-ng. RAxML-NG web service (maintained by Vital-IT) is available at https://raxml-ng.vital-it.ch/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

...read moreread less

1,765 citations

Journal Article•DOI•

TISIDB: an integrated repository portal for tumor-immune system interactions

[...]

Beibei Ru¹, CN Wong¹, Yin Tong¹, Jia Yi Zhong¹, Sophia Shek Wa Zhong¹, Wai Chung Wu¹, Ka Chi Chu¹, Choi Yiu Wong¹, Chit Ying Lau¹, Ian Chen¹, Nam Wai Chan¹, Jiangwen Zhang¹ - Show less +8 more•Institutions (1)

University of Hong Kong¹

15 Oct 2019-Bioinformatics

TL;DR: A user-friendly web portal TISIDB is designed, which integrated multiple types of data resources in oncoimmunology, and biologists can cross-check a gene of interest about its role in tumor-immune interactions through literature mining and high-throughput data analysis.

...read moreread less

Abstract: Summary The interaction between tumor and immune system plays a crucial role in both cancer development and treatment response. To facilitate comprehensive investigation of tumor-immune interactions, we have designed a user-friendly web portal TISIDB, which integrated multiple types of data resources in oncoimmunology. First, we manually curated 4176 records from 2530 publications, which reported 988 genes related to anti-tumor immunity. Second, genes associated with the resistance or sensitivity of tumor cells to T cell-mediated killing and immunotherapy were identified by analyzing high-throughput screening and genomic profiling data. Third, associations between any gene and immune features, such as lymphocytes, immunomodulators and chemokines, were pre-calculated for 30 TCGA cancer types. In TISIDB, biologists can cross-check a gene of interest about its role in tumor-immune interactions through literature mining and high-throughput data analysis, and generate testable hypotheses and high quality figures for publication. Availability and implementation http://cis.hku.hk/TISIDB. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

991 citations

Journal Article•DOI•

VarSome: the human genomic variant search engine

[...]

Christos Kopanos¹, Vasilis Tsiolkas¹, Alexandros Kouris¹, Charles E. Chapple¹, Monica Albarca Aguilera¹, Richard Meyer¹, Andreas Massouras¹ - Show less +3 more•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jun 2019-Bioinformatics

TL;DR: VarSome.com is a search engine, aggregator and impact analysis tool for human genetic variation and a community-driven project aiming at sharing global expertise on human variants.

...read moreread less

Abstract: Summary VarSome.com is a search engine, aggregator and impact analysis tool for human genetic variation and a community-driven project aiming at sharing global expertise on human variants. Availability and implementation VarSome is freely available at http://varsome.com. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

913 citations

Journal Article•DOI•

Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences.

[...]

Anqi Zhu¹, Joseph G. Ibrahim¹, Michael I. Love¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jun 2019-Bioinformatics

TL;DR: The proposed method, Approximate Posterior Estimation for generalized linear model, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference.

...read moreread less

Abstract: MOTIVATION In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across conditions, despite technical and biological variability in the observations. A common task is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC). RESULTS When the read counts are low or highly variable, the maximum likelihood estimates for the LFCs has high variance, leading to large estimates not representative of true differences, and poor ranking of genes by effect size. One approach is to introduce filtering thresholds and pseudocounts to exclude or moderate estimated LFCs. Filtering may result in a loss of genes from the analysis with true differences in expression, while pseudocounts provide a limited solution that must be adapted per dataset. Here, we propose the use of a heavy-tailed Cauchy prior distribution for effect sizes, which avoids the use of filter thresholds or pseudocounts. The proposed method, Approximate Posterior Estimation for generalized linear model, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference. AVAILABILITY AND IMPLEMENTATION The apeglm package is available as an R/Bioconductor package at https://bioconductor.org/packages/apeglm, and the methods can be called from within the DESeq2 software. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

...read moreread less

828 citations

Journal Article•DOI•

PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations

[...]

Mihir A Kamat¹, James A. Blackshaw¹, Robin Young¹, Praveen Surendran¹, Stephen Burgess², John Danesh¹, Adam S. Butterworth¹, James R Staley² - Show less +4 more•Institutions (2)

University of Cambridge¹, University of Bristol²

01 Nov 2019-Bioinformatics

TL;DR: A major update of PhenoScanner is presented, including over 150 million genetic variants and more than 65 billion associations with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers.

...read moreread less

Abstract: SUMMARY PhenoScanner is a curated database of publicly available results from large-scale genetic association studies in humans. This online tool facilitates 'phenome scans', where genetic variants are cross-referenced for association with many phenotypes of different types. Here we present a major update of PhenoScanner ('PhenoScanner V2'), including over 150 million genetic variants and more than 65 billion associations (compared to 350 million associations in PhenoScanner V1) with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers. The query options have been extended to include searches by genes, genomic regions and phenotypes, as well as for genetic variants. All variants are positionally annotated using the Variant Effect Predictor and the phenotypes are mapped to Experimental Factor Ontology terms. Linkage disequilibrium statistics from the 1000 Genomes project can be used to search for phenotype associations with proxy variants. AVAILABILITY AND IMPLEMENTATION PhenoScanner V2 is available at www.phenoscanner.medschl.cam.ac.uk.

...read moreread less

643 citations

Journal Article•DOI•

PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files.

[...]

Chi Zhang¹, Chi Zhang², Shan-Shan Dong¹, Jun-Yang Xu², Weiming He¹, Weiming He², Tie-Lin Yang¹ - Show less +3 more•Institutions (2)

Xi'an Jiaotong University¹, Beijing Genomics Institute²

15 May 2019-Bioinformatics

TL;DR: PopLDdecay, an open source software, for LD decay analysis from VCF files is fast and is able to handle large number of variants from sequencing data and is also storage saving by avoiding exporting pair-wise results of LD measurements.

...read moreread less

Abstract: Motivation Linkage disequilibrium (LD) decay is of great interest in population genetic studies. However, no tool is available now to do LD decay analysis from variant call format (VCF) files directly. In addition, generation of pair-wise LD measurements for whole genome SNPs usually resulting in large storage wasting files. Results We developed PopLDdecay, an open source software, for LD decay analysis from VCF files. It is fast and is able to handle large number of variants from sequencing data. It is also storage saving by avoiding exporting pair-wise results of LD measurements. Subgroup analyses are also supported. Availability and implementation PopLDdecay is freely available at https://github.com/BGI-shenzhen/PopLDdecay.

...read moreread less

613 citations

Journal Article•DOI•

admetSAR 2.0: web-service for prediction and optimization of chemical ADMET properties

[...]

Hongbin Yang¹, Chaofeng Lou¹, Lixia Sun¹, Jie Li¹, Yingchun Cai¹, Zhuang Wang¹, Weihua Li¹, Guixia Liu¹, Yun Tang¹ - Show less +5 more•Institutions (1)

East China University of Science and Technology¹

15 Mar 2019-Bioinformatics

TL;DR: This update of admetSAR, developed as a comprehensive source and free tool for the prediction of chemical ADMET properties, focuses on extension and optimization of existing models with significant quantity and quality improvement on training data.

...read moreread less

Abstract: Summary admetSAR was developed as a comprehensive source and free tool for the prediction of chemical ADMET properties. Since its first release in 2012 containing 27 predictive models, admetSAR has been widely used in chemical and pharmaceutical fields. This update, admetSAR 2.0, focuses on extension and optimization of existing models with significant quantity and quality improvement on training data. Now 47 models are available for either drug discovery or environmental risk assessment. In addition, we added a new module named ADMETopt for lead optimization based on predicted ADMET properties. Availability and implementation Free available on the web at http://lmmd.ecust.edu.cn/admetsar2/. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

606 citations

Journal Article•DOI•

Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology

[...]

Gregor Sturm¹, Francesca Finotello², Florent Petitprez³, Jitao David Zhang⁴, Jan Baumbach¹, Wolf H. Fridman³, Markus List¹, Tatsiana Aneichyk - Show less +4 more•Institutions (4)

Technische Universität München¹, Innsbruck Medical University², University of Paris³, Hoffmann-La Roche⁴

15 Jul 2019-Bioinformatics

TL;DR: It is demonstrated that computational deconvolution performs at high accuracy for well-defined cell- type signatures and proposed how fuzzy cell-type signatures can be improved, and suggested that future efforts should be dedicated to refining cell population definitions and finding reliable signatures.

...read moreread less

Abstract: Motivation The composition and density of immune cells in the tumor microenvironment (TME) profoundly influence tumor progression and success of anti-cancer therapies. Flow cytometry, immunohistochemistry staining or single-cell sequencing are often unavailable such that we rely on computational methods to estimate the immune-cell composition from bulk RNA-sequencing (RNA-seq) data. Various methods have been proposed recently, yet their capabilities and limitations have not been evaluated systematically. A general guideline leading the research community through cell type deconvolution is missing. Results We developed a systematic approach for benchmarking such computational methods and assessed the accuracy of tools at estimating nine different immune- and stromal cells from bulk RNA-seq samples. We used a single-cell RNA-seq dataset of ∼11 000 cells from the TME to simulate bulk samples of known cell type proportions, and validated the results using independent, publicly available gold-standard estimates. This allowed us to analyze and condense the results of more than a hundred thousand predictions to provide an exhaustive evaluation across seven computational methods over nine cell types and ∼1800 samples from five simulated and real-world datasets. We demonstrate that computational deconvolution performs at high accuracy for well-defined cell-type signatures and propose how fuzzy cell-type signatures can be improved. We suggest that future efforts should be dedicated to refining cell population definitions and finding reliable signatures. Availability and implementation A snakemake pipeline to reproduce the benchmark is available at https://github.com/grst/immune_deconvolution_benchmark. An R package allows the community to perform integrated deconvolution using different methods (https://grst.github.io/immunedeconv). Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

BBKNN: fast batch alignment of single cell transcriptomes.

[...]

Krzysztof Polanski¹, Matthew D. Young¹, Zhichao Miao², Zhichao Miao¹, Kerstin B. Meyer¹, Sarah A. Teichmann³, Sarah A. Teichmann¹, Jong-Eun Park¹ - Show less +4 more•Institutions (3)

Wellcome Trust Sanger Institute¹, European Bioinformatics Institute², University of Cambridge³

10 Aug 2019-Bioinformatics

TL;DR: BBKNN, an extremely fast graph-based data integration algorithm, is developed and illustrated the power of BBKNN on large scale mouse atlasing data, and can be installed from pip.

...read moreread less

Abstract: Motivation Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. A number of methods have been developed to combine diverse datasets by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration algorithm. We illustrate the power of BBKNN on large scale mouse atlasing data, and favourably benchmark its run time against a number of competing methods. Availability and implementation BBKNN is available at https://github.com/Teichlab/bbknn, along with documentation and multiple example notebooks, and can be installed from pip. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequences.

[...]

Masashi Tsubaki¹, Kentaro Tomii¹, Kentaro Tomii², Jun Sese², Jun Sese¹ - Show less +1 more•Institutions (2)

National Institute of Advanced Industrial Science and Technology¹, Tokyo Institute of Technology²

15 Jan 2019-Bioinformatics

TL;DR: This study investigates the use of end‐to‐end representation learning for compounds and proteins, integrates the representations, and develops a new CPI prediction approach by combining a graph neural network (GNN) for compound and a convolutional Neural Network (CNN) for proteins.

...read moreread less

Abstract: Motivation In bioinformatics, machine learning-based methods that predict the compound-protein interactions (CPIs) play an important role in the virtual screening for drug discovery. Recently, end-to-end representation learning for discrete symbolic data (e.g. words in natural language processing) using deep neural networks has demonstrated excellent performance on various difficult problems. For the CPI problem, data are provided as discrete symbolic data, i.e. compounds are represented as graphs where the vertices are atoms, the edges are chemical bonds, and proteins are sequences in which the characters are amino acids. In this study, we investigate the use of end-to-end representation learning for compounds and proteins, integrate the representations, and develop a new CPI prediction approach by combining a graph neural network (GNN) for compounds and a convolutional neural network (CNN) for proteins. Results Our experiments using three CPI datasets demonstrated that the proposed end-to-end approach achieves competitive or higher performance as compared to various existing CPI prediction methods. In addition, the proposed approach significantly outperformed existing methods on an unbalanced dataset. This suggests that data-driven representations of compounds and proteins obtained by end-to-end GNNs and CNNs are more robust than traditional chemical and biological features obtained from databases. Although analyzing deep learning models is difficult due to their black-box nature, we address this issue using a neural attention mechanism, which allows us to consider which subsequences in a protein are more important for a drug compound when predicting its interaction. The neural attention mechanism also provides effective visualization, which makes it easier to analyze a model even when modeling is performed using real-valued representations instead of discrete features. Availability and implementation https://github.com/masashitsubaki. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Scaling read aligners to hundreds of threads on general-purpose processors.

[...]

Ben Langmead¹, Christopher Wilks¹, Valentin Antonescu¹, Rone Charles¹•Institutions (1)

Johns Hopkins University¹

01 Feb 2019-Bioinformatics

TL;DR: This work greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture, and highlights how bottlenecks are exacerbated by variable‐record‐length file formats like FASTQ and suggest changes that enable superior scaling.

...read moreread less

Abstract: Motivation General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. Results We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling. Availability and implementation Experiments for this study: https://github.com/BenLangmead/bowtie-scaling. Bowtie http://bowtie-bio.sourceforge.net. Bowtie 2 http://bowtie-bio.sourceforge.net/bowtie2. Hisat http://www.ccb.jhu.edu/software/hisat. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays

[...]

Amrit Singh¹, Casey P. Shannon¹, Benoit Gautier², Florian Rohart², Michael Vacher³, Scott J. Tebbutt¹, Kim-Anh Lê Cao⁴ - Show less +3 more•Institutions (4)

University of British Columbia¹, University of Queensland², Commonwealth Scientific and Industrial Research Organisation³, University of Melbourne⁴

01 Sep 2019-Bioinformatics

TL;DR: DIABLO is a multi-omics integrative method that seeks for common information across different data types through the selection of a subset of molecular features, while discriminating between multiple phenotypic groups, while achieving predictive performance comparable to state-of-the-art supervised approaches.

...read moreread less

Abstract: In the continuously expanding omics era, novel computational and statistical strategies are needed for data integration and identification of biomarkers and molecular signatures. We present DIABLO, a multi-omics integrative method that seeks for common information across different data types through the selection of a subset of molecular features, while discriminating between multiple phenotypic groups. Using simulations and benchmark multi-omics studies, we show that DIABLO identifies features with superior biological relevance compared to existing unsupervised integrative methods, while achieving predictive performance comparable to state-of-the-art supervised approaches. DIABLO is versatile, allowing for modular-based analyses and cross-over study designs. In two case studies, DIABLO identified both known and novel multi-omics biomarkers consisting of mRNAs, miRNAs, CpGs, proteins and metabolites. DIABLO is implemented in the mixOmics R Bioconductor package with functions for parameters choise and visualisation to assist in the interpretation of the integrative analyses, along with tutorials on http://mixomics.org and in our Bioconductor vignette. Supplementary information is available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

deepDR: a network-based deep learning approach to in silico drug repositioning.

[...]

Xiangxiang Zeng¹, Siyi Zhu¹, Xiangrong Liu¹, Yadi Zhou², Ruth Nussinov³, Feixiong Cheng⁴, Feixiong Cheng², Feixiong Cheng⁵ - Show less +4 more•Institutions (5)

Xiamen University¹, Cleveland Clinic Lerner Research Institute², Tel Aviv University³, Cleveland Clinic Lerner College of Medicine⁴, Case Western Reserve University⁵

15 Dec 2019-Bioinformatics

TL;DR: A network-based deep-learning approach for in silico drug repurposing by integrating 10 networks, termed deepDR, which learns high-level features of drugs from the heterogeneous networks by a multimodal deep autoencoder and infer candidates for approved drugs for which they were not originally approved.

...read moreread less

Abstract: Motivation Traditional drug discovery and development are often time-consuming and high risk. Repurposing/repositioning of approved drugs offers a relatively low-cost and high-efficiency approach toward rapid development of efficacious treatments. The emergence of large-scale, heterogeneous biological networks has offered unprecedented opportunities for developing in silico drug repositioning approaches. However, capturing highly non-linear, heterogeneous network structures by most existing approaches for drug repositioning has been challenging. Results In this study, we developed a network-based deep-learning approach, termed deepDR, for in silico drug repurposing by integrating 10 networks: one drug-disease, one drug-side-effect, one drug-target and seven drug-drug networks. Specifically, deepDR learns high-level features of drugs from the heterogeneous networks by a multi-modal deep autoencoder. Then the learned low-dimensional representation of drugs together with clinically reported drug-disease pairs are encoded and decoded collectively via a variational autoencoder to infer candidates for approved drugs for which they were not originally approved. We found that deepDR revealed high performance [the area under receiver operating characteristic curve (AUROC) = 0.908], outperforming conventional network-based or machine learning-based approaches. Importantly, deepDR-predicted drug-disease associations were validated by the ClinicalTrials.gov database (AUROC = 0.826) and we showcased several novel deepDR-predicted approved drugs for Alzheimer's disease (e.g. risperidone and aripiprazole) and Parkinson's disease (e.g. methylphenidate and pergolide). Availability and implementation Source code and data can be downloaded from https://github.com/ChengF-Lab/deepDR. Supplementary information Supplementary data are available online at Bioinformatics.

...read moreread less

Journal Article•DOI•

Benchmarking fold detection by DaliLite v.5.

[...]

Liisa Holm¹•Institutions (1)

University of Helsinki¹

15 Dec 2019-Bioinformatics

TL;DR: A new version of the DaliLite standalone software is released, with novelties are hierarchical search of the structure database organized into sequence based clusters, and remote access to the knowledge base of structural neighbors.

...read moreread less

Abstract: Motivation Protein structure comparison plays a fundamental role in understanding the evolutionary relationships between proteins. Here, we release a new version of the DaliLite standalone software. The novelties are hierarchical search of the structure database organized into sequence based clusters, and remote access to our knowledge base of structural neighbors. The detection of fold, superfamily and family level similarities by DaliLite and state-of-the-art competitors was benchmarked against a manually curated structural classification. Results Database search strategies were evaluated using Fmax with query-specific thresholds. DaliLite and DeepAlign outperformed TM-score based methods at all levels of the benchmark, and DaliLite outperformed DeepAlign at fold level. Hierarchical and knowledge-based searches got close to the performance of systematic pairwise comparison. The knowledge-based search was four times as efficient as the hierarchical search. The knowledge-based search dynamically adjusts the depth of the search, enabling a trade-off between speed and recall. Availability and implementation http://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks.

[...]

Thomas Moerman¹, Thomas Moerman², Sara Aibar Santos², Carmen Bravo González-Blas², Jaak Simm², Jaak Simm¹, Yves Moreau¹, Yves Moreau², Jan Aerts², Jan Aerts¹, Stein Aerts² - Show less +7 more•Institutions (2)

IMEC¹, Katholieke Universiteit Leuven²

01 Jun 2019-Bioinformatics

TL;DR: Arboreto as discussed by the authors is a computational framework that scales up GRN inference algorithms complying with the GENIE3 architecture and includes both GRNBoost2 and an improved implementation of GENIE-3, as a user-friendly open source Python package.

...read moreread less

Abstract: Summary Inferring a Gene Regulatory Network (GRN) from gene expression data is a computationally expensive task, exacerbated by increasing data sizes due to advances in high-throughput gene profiling technology, such as single-cell RNA-seq. To equip researchers with a toolset to infer GRNs from large expression datasets, we propose GRNBoost2 and the Arboreto framework. GRNBoost2 is an efficient algorithm for regulatory network inference using gradient boosting, based on the GENIE3 architecture. Arboreto is a computational framework that scales up GRN inference algorithms complying with this architecture. Arboreto includes both GRNBoost2 and an improved implementation of GENIE3, as a user-friendly open source Python package. Availability and implementation Arboreto is available under the 3-Clause BSD license at http://arboreto.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Genome Detective: an automated system for virus identification from high-throughput sequencing data.

[...]

Michael Vilsker, Yumna Moosa¹, Sam Nooij, Vagner Fonseca², Vagner Fonseca¹, Vagner Fonseca³, Yoika Ghysens, Korneel Dumon, Raf Pauwels, Luiz Carlos Junior Alcantara², Luiz Carlos Junior Alcantara³, Ewout Vanden Eynden⁴, Anne-Mieke Vandamme⁴, Anne-Mieke Vandamme⁵, Koen Deforche, Tulio de Oliveira¹ - Show less +12 more•Institutions (5)

University of KwaZulu-Natal¹, Oswaldo Cruz Foundation², Universidade Federal de Minas Gerais³, Rega Institute for Medical Research⁴, Universidade Nova de Lisboa⁵

01 Mar 2019-Bioinformatics

TL;DR: An easy to use web-based software application that assembles the genomes of viruses quickly and accurately using a novel alignment method that constructs genomes by reference-based linking of de novo contigs by combining amino-acids and nucleotide scores.

...read moreread less

Abstract: SUMMARY Genome Detective is an easy to use web-based software application that assembles the genomes of viruses quickly and accurately. The application uses a novel alignment method that constructs genomes by reference-based linking of de novo contigs by combining amino-acids and nucleotide scores. The software was optimized using synthetic datasets to represent the great diversity of virus genomes. The application was then validated with next generation sequencing data of hundreds of viruses. User time is minimal and it is limited to the time required to upload the data. AVAILABILITY AND IMPLEMENTATION Available online: http://www.genomedetective.com/app/typingtool/virus/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

AlphaFold at CASP13

[...]

Mohammed AlQuraishi¹•Institutions (1)

Harvard University¹

01 Nov 2019-Bioinformatics

TL;DR: The significance of DeepMind's entry within the broader history of CASP, relate AlphaFold's methodological advances to prior work, and speculate on the future of this important problem are contextualized.

...read moreread less

Abstract: Summary: Computational prediction of protein structure from sequence is broadly viewed as a foundational problem of biochemistry and one of the most difficult challenges in bioinformatics. Once every two years the Critical Assessment of protein Structure Prediction (CASP) experiments are held to assess the state of the art in the field in a blind fashion, by presenting predictor groups with protein sequences whose structures have been solved but have not yet been made publicly available. The first CASP was organized in 1994, and the latest, CASP13, took place last December, when for the first time the industrial laboratory DeepMind entered the competition. DeepMind's entry, AlphaFold, placed first in the Free Modeling (FM) category, which assesses methods on their ability to predict novel protein folds (the Zhang group placed first in the Template-Based Modeling (TBM) category, which assess methods on predicting proteins whose folds are related to ones already in the Protein Data Bank.) DeepMind's success generated significant public interest. Their approach builds on two ideas developed in the academic community during the preceding decade: (i) the use of co-evolutionary analysis to map residue co-variation in protein sequence to physical contact in protein structure, and (ii) the application of deep neural networks to robustly identify patterns in protein sequence and co-evolutionary couplings and convert them into contact maps. In this Letter, we contextualize the significance of DeepMind's entry within the broader history of CASP, relate AlphaFold's methodological advances to prior work, and speculate on the future of this important problem.

...read moreread less

Journal Article•DOI•

DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks.

[...]

Mostafa Karimi, Di Wu, Zhangyang Wang¹, Yang Shen•Institutions (1)

Texas A&M University¹

15 Sep 2019-Bioinformatics

TL;DR: A semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities.

...read moreread less

Abstract: Motivation Drug discovery demands rapid quantification of compound-protein interaction (CPI). However, there is a lack of methods that can predict compound-protein affinity from sequences alone with high applicability, accuracy and interpretability. Results We present a seamless integration of domain knowledges and learning-based approaches. Under novel representations of structurally annotated protein sequences, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities. Our representations and models outperform conventional options in achieving relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included for training. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, separate and joint attention mechanisms are developed and embedded to our model to add to its interpretability, as illustrated in case studies for predicting and explaining selective drug-target interactions. Lastly, alternative representations using protein sequences or compound graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead. Availability and implementation Data and source codes are available at https://github.com/Shen-Lab/DeepAffinity. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Graph embedding on biomedical networks: methods, applications and evaluations.

[...]

Xiang Yue, Zhen Wang, Jingong Huang¹, Srinivasan Parthasarathy, Soheil Moosavinasab², Yungui Huang², Simon Lin², Wen Zhang³, Ping Zhang¹, Huan Sun - Show less +6 more•Institutions (3)

Ohio State University¹, The Research Institute at Nationwide Children's Hospital², Huazhong Agricultural University³

04 Oct 2019-Bioinformatics

TL;DR: Xiang et al. as mentioned in this paper evaluated graph embedding methods on biomedical networks and found that the learned embeddings can be treated as complementary representations for the biological features, and provided general guidelines for properly selecting graph embeding methods and setting their hyper-parameters for different biomedical tasks.

...read moreread less

Abstract: Motivation Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. Results We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performance without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks. Availability and implementation As part of our contributions in the paper, we develop an easy-to-use Python package with detailed instructions, BioNEV, available at: https://github.com/xiangyue9607/BioNEV, including all source code and datasets, to facilitate studying various graph embedding methods on biomedical tasks. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Genetic association testing using the GENESIS R/Bioconductor package.

[...]

Stephanie M. Gogarten¹, Tamar Sofer², Tamar Sofer³, Han Chen⁴, Chaoyu Yu¹, Jennifer A. Brody¹, Timothy A. Thornton¹, Kenneth Rice¹, Matthew P. Conomos¹ - Show less +5 more•Institutions (4)

University of Washington¹, Brigham and Women's Hospital², Harvard University³, University of Texas Health Science Center at Houston⁴

15 Dec 2019-Bioinformatics

TL;DR: GDS format provides efficient storage and retrieval of genotypes measured by microarrays and sequencing, and GENESIS implements highly flexible mixed models, allowing for different link functions, multiple variance components, and phenotypic heteroskedasticity.

...read moreread less

Abstract: Summary The Genomic Data Storage (GDS) format provides efficient storage and retrieval of genotypes measured by microarrays and sequencing. We developed GENESIS to perform various single- and aggregate-variant association tests using genotype data stored in GDS format. GENESIS implements highly flexible mixed models, allowing for different link functions, multiple variance components and phenotypic heteroskedasticity. GENESIS integrates cohesively with other R/Bioconductor packages to build a complete genomic analysis workflow entirely within the R environment. Availability and implementation https://bioconductor.org/packages/GENESIS; vignettes included. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

GToTree: a user-friendly workflow for phylogenomics.

[...]

Michael D. Lee¹•Institutions (1)

Ames Research Center¹

15 Oct 2019-Bioinformatics

TL;DR: GToTree is a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set.

...read moreread less

Abstract: Summary Genome-level evolutionary inference (i.e. phylogenomics) is becoming an increasingly essential step in many biologists' work. Accordingly, there are several tools available for the major steps in a phylogenomics workflow. But for the biologist whose main focus is not bioinformatics, much of the computational work required-such as accessing genomic data on large scales, integrating genomes from different file formats, performing required filtering, stitching different tools together etc.-can be prohibitive. Here I introduce GToTree, a command-line tool that can take any combination of fasta files, GenBank files and/or NCBI assembly accessions as input and outputs an alignment file, estimates of genome completeness and redundancy, and a phylogenomic tree based on a specified single-copy gene (SCG) set. Although GToTree can work with any custom hidden Markov Models (HMMs), also included are 13 newly generated SCG-set HMMs for different lineages and levels of resolution, built based on searches of ∼12 000 bacterial and archaeal high-quality genomes. GToTree aims to give more researchers the capability to make phylogenomic trees. Availability and implementation GToTree is open-source and freely available for download from: github.com/AstrobioMike/GToTree. It is implemented primarily in bash with helper scripts written in python. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files.

[...]

Alexander Payne¹, Nadine Holmes¹, Vardhman K. Rakyan², Matthew Loose¹•Institutions (2)

University of Nottingham¹, Queen Mary University of London²

01 Jul 2019-Bioinformatics

TL;DR: Long reads can be incorrectly divided by MinKNOW resulting in single DNA molecules being split into two or more reads, and helper scripts are provided that identify and reconstruct split reads given a sequencing summary file and alignment to a reference.

...read moreread less

Abstract: MOTIVATION The Oxford Nanopore Technologies (ONT) MinION is used for sequencing a wide variety of sample types with diverse methods of sample extraction. Nanopore sequencers output FAST5 files containing signal data subsequently base called to FASTQ format. Optionally, ONT devices can collect data from all sequencing channels simultaneously in a bulk FAST5 file enabling inspection of signal in any channel at any point. We sought to visualize this signal to inspect challenging or difficult to sequence samples. RESULTS The BulkVis tool can load a bulk FAST5 file and overlays MinKNOW (the software that controls ONT sequencers) classifications on the signal trace and can show mappings to a reference. Users can navigate to a channel and time or, given a FASTQ header from a read, jump to its specific position. BulkVis can export regions as Nanopore base caller compatible reads. Using BulkVis, we find long reads can be incorrectly divided by MinKNOW resulting in single DNA molecules being split into two or more reads. The longest seen to date is 2 272 580 bases in length and reported in eleven consecutive reads. We provide helper scripts that identify and reconstruct split reads given a sequencing summary file and alignment to a reference. We note that incorrect read splitting appears to vary according to input sample type and is more common in 'ultra-long' read preparations. AVAILABILITY AND IMPLEMENTATION The software is available freely under an MIT license at https://github.com/LooseLab/bulkvis. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

MMseqs2 desktop and local web server app for fast, interactive sequence searches

[...]

Milot Mirdita¹, Martin Steinegger¹, Martin Steinegger², Johannes Söding¹•Institutions (2)

Max Planck Society¹, Seoul National University²

15 Aug 2019-Bioinformatics

TL;DR: The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein sequence and profile databases on personal workstations by eliminating its runtime overhead by reducing response times to a few seconds at sensitivities close to BLAST.

...read moreread less

Abstract: Summary The MMseqs2 desktop and web server app facilitates interactive sequence searches through custom protein sequence and profile databases on personal workstations. By eliminating MMseqs2's runtime overhead, we reduced response times to a few seconds at sensitivities close to BLAST. Availability and implementation The app is easy to install for non-experts. GPLv3-licensed code, pre-built desktop app packages for Windows, MacOS and Linux, Docker images for the web server application and a demo web server are available at https://search.mmseqs.com. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

MOLI: multi-omics late integration with deep neural networks for drug response prediction.

[...]

Hossein Sharifi-Noghabi¹, Hossein Sharifi-Noghabi², Olga Zolotareva³, Colin Collins², Colin Collins⁴, Martin Ester¹, Martin Ester² - Show less +3 more•Institutions (4)

Simon Fraser University¹, Vancouver Prostate Centre², Bielefeld University³, University of British Columbia⁴

15 Jul 2019-Bioinformatics

TL;DR: MOLI, a multi-omics late integration method based on deep neural networks, achieves higher prediction accuracy in external validations and its high predictive power suggests it may have utility in precision oncology.

...read moreread less

Abstract: Motivation Historically, gene expression has been shown to be the most informative data for drug response prediction. Recent evidence suggests that integrating additional omics can improve the prediction accuracy which raises the question of how to integrate the additional omics. Regardless of the integration strategy, clinical utility and translatability are crucial. Thus, we reasoned a multi-omics approach combined with clinical datasets would improve drug response prediction and clinical relevance. Results We propose MOLI, a multi-omics late integration method based on deep neural networks. MOLI takes somatic mutation, copy number aberration and gene expression data as input, and integrates them for drug response prediction. MOLI uses type-specific encoding sub-networks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. The former makes the representations of responder samples more similar to each other and different from the non-responders, and the latter makes this representation predictive of the response values. We validate MOLI on in vitro and in vivo datasets for five chemotherapy agents and two targeted therapeutics. Compared to state-of-the-art single-omics and early integration multi-omics methods, MOLI achieves higher prediction accuracy in external validations. Moreover, a significant improvement in MOLI's performance is observed for targeted drugs when training on a pan-drug input, i.e. using all the drugs with the same target compared to training only on drug-specific inputs. MOLI's high predictive power suggests it may have utility in precision oncology. Availability and implementation https://github.com/hosseinshn/MOLI. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

M3Drop: dropout-based feature selection for scRNASeq.

[...]

Tallulah S. Andrews¹, Martin Hemberg¹•Institutions (1)

Wellcome Trust Sanger Institute¹

15 Aug 2019-Bioinformatics

TL;DR: M3Drop is an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features and outperform existing methods on simulated and real datasets.

...read moreread less

Abstract: Motivation Most genomes contain thousands of genes, but for most functional responses, only a subset of those genes are relevant. To facilitate many single-cell RNASeq (scRNASeq) analyses the set of genes is often reduced through feature selection, i.e. by removing genes only subject to technical noise. Results We present M3Drop, an R package that implements popular existing feature selection methods and two novel methods which take advantage of the prevalence of zeros (dropouts) in scRNASeq data to identify features. We show these new methods outperform existing methods on simulated and real datasets. Availability and implementation M3Drop is freely available on github as an R package and is compatible with other popular scRNASeq tools: https://github.com/tallulandrews/M3Drop. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Biomedical image augmentation using Augmentor.

[...]

Marcus Daniel Bloice¹, Peter M. Roth², Andreas Holzinger², Andreas Holzinger¹•Institutions (2)

University of Graz¹, Graz University of Technology²

01 Nov 2019-Bioinformatics

TL;DR: The Augmentor software package provides a stochastic, pipeline-based approach to image augmentation with a number of features that are relevant to biomedical imaging, such as z-stack augmentation and randomised elastic distortions.

...read moreread less

Abstract: Motivation Image augmentation is a frequently used technique in computer vision and has been seeing increased interest since the popularity of deep learning. Its usefulness is becoming more and more recognized due to deep neural networks requiring larger amounts of data to train, and because in certain fields, such as biomedical imaging, large amounts of labelled data are difficult to come by or expensive to produce. In biomedical imaging, features specific to this domain need to be addressed. Results Here we present the Augmentor software package for image augmentation. It provides a stochastic, pipeline-based approach to image augmentation with a number of features that are relevant to biomedical imaging, such as z-stack augmentation and randomized elastic distortions. The software has been designed to be highly extensible meaning an operation that might be specific to a highly specialized task can easily be added to the library, even at runtime. Although it has been designed as a general software library, it has features that are particularly relevant to biomedical imaging and the techniques required for this domain. Availability and implementation Augmentor is a Python package made available under the terms of the MIT licence. Source code can be found on GitHub under https://github.com/mdbloice/Augmentor and installation is via the pip package manager (A Julia version of the package, developed in parallel by Christof Stocker, is also available under https://github.com/Evizero/Augmentor.jl).

...read moreread less

Journal Article•DOI•

NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions.

[...]

Fangping Wan¹, Lixiang Hong¹, An Xiao¹, Tao Jiang¹, Jianyang Zeng¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

01 Jan 2019-Bioinformatics

TL;DR: A new nonlinear end‐to‐end learning model that integrates diverse information from heterogeneous network data and automatically learns topology‐preserving representations of drugs and targets to facilitate DTI prediction is developed, suggesting that NeoDTI can offer a powerful and robust tool for drug development and drug repositioning.

...read moreread less

Abstract: Motivation Accurately predicting drug-target interactions (DTIs) in silico can guide the drug discovery process and thus facilitate drug development. Computational approaches for DTI prediction that adopt the systems biology perspective generally exploit the rationale that the properties of drugs and targets can be characterized by their functional roles in biological networks. Results Inspired by recent advance of information passing and aggregation techniques that generalize the convolution neural networks to mine large-scale graph data and greatly improve the performance of many network-related prediction tasks, we develop a new nonlinear end-to-end learning model, called NeoDTI, that integrates diverse information from heterogeneous network data and automatically learns topology-preserving representations of drugs and targets to facilitate DTI prediction. The substantial prediction performance improvement over other state-of-the-art DTI prediction methods as well as several novel predicted DTIs with evidence supports from previous studies have demonstrated the superior predictive power of NeoDTI. In addition, NeoDTI is robust against a wide range of choices of hyperparameters and is ready to integrate more drug and target related information (e.g. compound-protein binding affinity data). All these results suggest that NeoDTI can offer a powerful and robust tool for drug development and drug repositioning. Availability and implementation The source code and data used in NeoDTI are available at: https://github.com/FangpingWan/NeoDTI. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Collapse