scispace - formally typeset
Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTAB147

DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes.

02 Mar 2021-Bioinformatics (Oxford University Press (OUP))-Vol. 37, Iss: 17, pp 2722-2729
Abstract: Motivation Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. Results We developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction. Availability Code and datasets for reproduction and customization are available at Prediction results for 14 virus families are available at

... read more


8 results found

Open accessPosted ContentDOI: 10.1101/2021.03.25.437037
Ngan Thi Dong1, Megha Khosla1Institutions (1)
26 Mar 2021-bioRxiv
Abstract: AO_SCPLOWBSTRACTC_SCPLOWUnderstanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection. This could further help in developing treatments of viral diseases. The main issues in tackling it as a machine learning problem is the scarcity of training data as well input information of the viral proteins. We overcome these limitations by exploiting powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multi task framework. Our experiments on 7 varied benchmark datasets support the superiority of our approach.

... read more

4 Citations

Open accessJournal ArticleDOI: 10.1098/RSTB.2020.0358
Colin J. Carlson1, Maxwell J. Farrell2, Zoe Grange, Barbara A. Han3  +29 moreInstitutions (27)
Abstract: In the light of the urgency raised by the COVID-19 pandemic, global investment in wildlife virology is likely to increase, and new surveillance programmes will identify hundreds of novel viruses that might someday pose a threat to humans. To support the extensive task of laboratory characterization, scientists may increasingly rely on data-driven rubrics or machine learning models that learn from known zoonoses to identify which animal pathogens could someday pose a threat to global health. We synthesize the findings of an interdisciplinary workshop on zoonotic risk technologies to answer the following questions. What are the prerequisites, in terms of open data, equity and interdisciplinary collaboration, to the development and application of those tools? What effect could the technology have on global health? Who would control that technology, who would have access to it and who would benefit from it? Would it improve pandemic prevention? Could it create new challenges? This article is part of the theme issue 'Infectious disease macroecology: parasite diversity and dynamics across the globe'.

... read more

Topics: Disease reservoir (51%), Airborne transmission (50%), Global health (50%)

2 Citations

Open accessJournal ArticleDOI: 10.1093/BIOINFORMATICS/BTAB533
Xiaodi Yang1, Shiping Yang1, Xianyi Lian1, Stefan Wuchty2  +1 moreInstitutions (2)
17 Jul 2021-Bioinformatics
Abstract: Motivation To complement experimental efforts, machine learning-based computational methods are playing an increasingly important role to predict human-virus protein-protein interactions (PPIs). Furthermore, transfer learning can effectively apply prior knowledge obtained from a large source dataset/task to a small target dataset/task, improving prediction performance. Results To predict interactions between human and viral proteins, we combine evolutionary sequence profile features with a Siamese convolutional neural network (CNN) architecture and a multi-layer perceptron. Our architecture outperforms various feature encodings-based machine learning and state-of-the-art prediction methods. As our main contribution, we introduce two transfer learning methods (i.e., 'frozen' type and 'fine-tuning' type) that reliably predict interactions in a target human-virus domain based on training in a source human-virus domain, by retraining CNN layers. Finally, we utilize the 'frozen' type transfer learning approach to predict human-SARS-CoV-2 PPIs, indicating that our predictions are topologically and functionally similar to experimentally known interactions. Supplementary information Supplementary data are available at Bioinformatics online.

... read more

Topics: Transfer of learning (56%), Convolutional neural network (54%), Perceptron (54%) ... show more

Journal ArticleDOI: 10.1093/BIOINFORMATICS/BTAB737
Xiaotian Hu1, Cong Feng1, Yincong Zhou1, Andrew Harrison2  +1 moreInstitutions (2)
25 Oct 2021-Bioinformatics
Abstract: MOTIVATION Protein-protein interaction (PPI), as a relative property, is determined by two binding proteins, which brings a great challenge to design an expert model with an unbiased learning architecture and a superior generalization performance. Additionally, few efforts have been made to allow PPI predictors to discriminate between relative properties and intrinsic properties. RESULTS We present a sequence-based approach, DeepTrio, for PPI prediction using mask multiple parallel convolutional neural networks. Experimental evaluations show that DeepTrio achieves a better performance over several state-of-the-art methods in terms of various quality metrics. Besides, DeepTrio is extended to provide additional insights into the contribution of each input neuron to the prediction results. AVAILABILITY We provide an online application at The DeepTrio models and training data are deposited at SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

... read more

Open accessPosted ContentDOI: 10.1101/2021.06.25.449930
Wadie B1, Kleshchevnikov2, Sandaltzopoulou E3, Caroline Benz4  +1 moreInstitutions (4)
26 Jun 2021-bioRxiv
Abstract: Linear motifs have an integral role in dynamic cell functions including cell signalling, the cell cycle and others. However, due to their small size, low complexity, degenerate nature, and frequent mutations, identifying novel functional motifs is a challenging task. Viral proteins rely extensively on the molecular mimicry of cellular linear motifs for modifying cell signalling and other processes in ways that favour viral infection. This study aims to discover human linear motifs convergently evolved also in disordered regions of viral proteins, under the hypothesis that these will result in enrichment in functional motif instances. We systematically apply computational motif prediction, combined with implementation of several functional and structural filters to the most recent publicly available human-viral and human-human protein interaction network. By limiting the search space to the sequences of viral proteins, we observed an increase in the sensitivity of motif prediction, as well as improved enrichment in known instances compared to the same analysis using only human protein interactions. We identified > 8,400 motif instances at various confidence levels, 105 of which were supported by all functional and structural filters applied. Overall, we provide a pipeline to improve the identification of functional linear motifs from interactomics datasets and a comprehensive catalogue of putative human motifs that can contribute to our understanding of the human domain-linear motif code and the mechanisms of viral interference with this.

... read more

Topics: Short linear motif (60%)


67 results found

Open accessJournal ArticleDOI: 10.1038/75556
01 May 2000-Nature Genetics
Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web ( are being constructed: biological process, molecular function and cellular component.

... read more

30,473 Citations

Open accessJournal Article
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

... read more

Topics: Overfitting (66%), Deep learning (62%), Convolutional neural network (61%) ... show more

27,534 Citations

Open accessJournal ArticleDOI: 10.1016/J.PATREC.2005.10.010
Abstract: Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

... read more

14,304 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKH073
David L. Wheeler1, Deanna M. Church1, Ron Edgar1, Scott Federhen1  +9 moreInstitutions (1)
Abstract: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's website. NCBI resources include Entrez, PubMed, PubMed Central, LocusLink, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SARS Coronavirus Resource, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD) and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at:

... read more

Topics: Entrez Gene (71%), Molecular Modeling Database (64%), Sequence profiling tool (64%) ... show more

8,599 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKY1131
Damian Szklarczyk1, Annika L. Gable1, David Lyon1, Alexander Junge2  +8 moreInstitutions (4)
Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at

... read more

5,475 Citations

No. of citations received by the Paper in previous years
Network Information
Related Papers (5)