What is the input to the LR module?

The input to the LR module is the element wise product of fine-tuned representations (output of the MLP) of virus and human protein.

What is the main limitation of the proposed model?

Noting the fact that virus tends to mimic humans towards building interactions with its proteins, the authors use the prediction of human PPI as a side task to further regularize their model and improve generalization.

What is the zhh′ for human PPI?

For human PPI, the target variables (zhh′ ) are the normalized confidence scores which can be interpreted as the probability of observing an interaction.

(Open Access) A multitask transfer learning framework for novel virus-human protein interactions (2021) | Ngan Thi Dong

Q: What have the authors contributed in "A multitask transfer learning framework for novel virus-human protein interactions" ?

The authors overcome these limitations by exploiting powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multi task framework. Their experiments on 7 varied benchmark datasets support the superiority of their approach.

Q: What is the purpose of the paper?

The authors will enhance their multi task approach by incorporating more domain information as well as exploiting more sophisticated multi task model architectures.

Q: What is the name of the dataset?

DeNovo’s SLIM datasetTo be presented at the ICLR Workshop on AI for Public Health 2021encapsulated viral proteins based on presence of Short Linear Motif (SLiM) (short recurring protein sequences with specific biological function).

To be presented at the ICLR Workshop on AI for Public Health 2021

A MULTITASK TRANSFER LEARNING FRAMEWORK FOR

NOVEL VIRUS-HUMAN PROTEIN INTERACTIONS

Ngan Thi Dong & Megha Khosla

L3S Research Center, Leibniz University Hannover, Germany

ABSTRACT

Understanding the interaction patterns between a particular virus and human pro-

teins plays a crucial role in unveiling the underlying mechanism of viral infection.

This could further help in developing treatments of viral diseases. The main issues

in tackling it as a machine learning problem is the scarcity of training data as well

input information of the viral proteins. We overcome these limitations by exploit-

ing powerful statistical protein representations derived from a corpus of around 24

Million protein sequences in a multi task framework. Our experiments on 7 varied

benchmark datasets support the superiority of our approach.

1 INTRODUCTION

Viral infections most have been increasingly burdening the healthcare systems. Biologically the viral

infection involves many protein-protein interactions (PPIs) between the virus and its host. These

interactions range from the initial biding of viral coat proteins to the host membrane receptor to

the hijacking of the host transcription machinery by viral proteins. In this work we develop a deep

learning based computational model for predicting interactions between a novel virus (a completely

new one) and human proteins.

One of the key challenges in tackling the current learning task with novel unseen viruses is the

limited training data. Often, some known interactions of related viruses are used to train supervised

models. These data is usually collected by wet lab experiments and are usually too little to ensure

generalizability of trained models. In effect, the trained models might overﬁt the training data and

would give inaccurate predictions for the novel virus.

Moreover, viral proteins are substantially different from human or bacterial proteins. They are

structurally dynamic so that they cannot be easily detected by common sequence-structure com-

parison (Requi

ao et al., 2020). Virus protein sequences of different species share only little in com-

mon (Eid et al., 2016). Therefore, models trained for other human PPI (Li & Ilie, 2020; Sun et al.,

2017; Li, 2020; Chen et al., 2019; Sarkar & Saha, 2019) or for other pathogen-human PPI (Sudhakar

et al., 2020; Mei & Zhang, 2020; Dick et al., 2020; Li et al., 2014; Guven-Maiorov et al., 2019; Ba-

sit et al., 2018)(for which more data might be available) cannot be directly used for predictions for

novel viral-human protein interactions.

While for human proteins, features related to their function, semantic annotation, domain, structure,

pathway, etc. can be extracted from public databases, such information is not readily available for

viral proteins. The only reliable source of viral protein information is its amino acid sequence.

Learning effective representations of the viral proteins is thus an important step towards building

the prediction model. Heuristics such as K-mer composition usually used for protein representations

are bound to fail as it is known that viral proteins with completely different sequences might show

similar interaction patterns.

Other existing works also employed additional features to represent viral proteins such as protein

functional information (or GO annotation) (Wang, 2020), proteins domain-domain associations in-

formation as in (Barman et al., 2014), protein structure information as in (Lasso et al., 2019; Guven-

Maiorov et al., 2019), and the disease phenotype of clinical symptoms as in (Wang, 2020). A major

limitation of these approaches is that they cannot generalize to novel viruses where such information

is not available or lack experimentally supported evidence.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.25.437037doi: bioRxiv preprint

To be presented at the ICLR Workshop on AI for Public Health 2021

Figure 1: MULTITASK TRANSFER (MTT) model for pathogen-human PPI.

In this work we tackle the above limitations by exploiting powerful statistical protein representations

derived from a corpus of around 24 Million protein sequences in a multitask framework. Noting the

fact that virus tends to mimic humans towards building interactions with its proteins, we use the

prediction of human PPI as a side task to further regularize our model and improve generalization.

Our large scale experiments on a number of datasets showcase the superiority of our approach.

2 OUR APPROACH

The schematic diagram of our proposed model is presented in Figure 1. We use human and virus raw

protein sequences as input. As side or domain information we use human protein-protein interaction

network of around 20,000 proteins and over 22M interactions from STRING (Szklarczyk et al.,

2015) database.

We note that the protein sequence determines the protein’s structural conformation (fold), which

further determines its function and its interaction pattern with other proteins. However, the underly-

ing mechanism of the sequence-to-structure matching process is very complex and cannot be easily

speciﬁed by hand crafted rules. Therefore, rather than using handcrafted features extracted from

amino acid sequences we employ the pre-trained UNIREP model (Alley et al., 2019) to generate

latent representations or protein embeddings. The protein representations extracted from UNIREP

model are empirically shown to preserve fundamental properties of the proteins and are hypothe-

sized to be statistically more powerful and generalizable than hand crafted sequence features.

We further ﬁne-tune these representations by training 2 simple neural networks (single layer MLP

with ReLu activation) using an additional objective of predicting human PPI in addition to the main

task. We use Logistic Regression networks to predict likelihood of having interaction between virus-

human proteins or human-human proteins. The two networks’ parameters are not shared among

tasks allowing them to extract more task-speciﬁc representation.

The rationale behind using human PPI task is that viruses have been shown to mimic and compete

with human proteins in their binding and interaction patterns with other human proteins (Mei &

Zhang, 2020). Therefore, we believe that the patterns learned from the human interactome (or

human PPI network) should be a rich source of knowledge to guide our virus-human PPI task and

further helps to regularize our model.

Let Θ, Φ denote the set of learnable parameters corresponding to representation tuning components,

i.e., the Multilayer Perceptrons (MLP) corresponding to the virus and human proteins, respectively.

Let W

, W

denote the two learnable weight matrices (parameters) for the logistic regression mod-

ules for the virus-human and human-human PPI prediction tasks. We use V H, and HH to denote

the training set of virus-human, human-human PPI, correspondingly.

We use binary cross entropy loss for virus-human PPI predictions as given below

(v ,h )∈V H

−z

v h

log y

v h

(Θ, Φ, W

) − (1 − z

v h

) log(1 − y

v h

(Θ, Φ, W

)), (1)

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.25.437037doi: bioRxiv preprint

To be presented at the ICLR Workshop on AI for Public Health 2021

where variables z

v h

is the corresponding binary target variable and y

v h

is the predicted probability

of virus-human PPI or the output of the Logistic regression (LR) module. The input to the LR

module is the element wise product of ﬁne-tuned representations (output of the MLP) of virus and

human protein.

For human PPI, the target variables (z

) are the normalized conﬁdence scores which can be in-

terpreted as the probability of observing an interaction. We use binary cross entropy loss as below

where y

is the element wise product of ﬁne-tuned representations (output of the second MLP ) of

human and human protein.

(h,h

)∈MP

−z

log y

(Φ, W

) − (1 − z

) log(1 − y

(Φ, W

)) (2)

We use a linear combination of the two loss functions to train our model, i.e., L = L

+ α · L

where α is the human PPI weight factor. We set it to 10

−3

in our experiments.

3 EXPERIMENTAL EVALUATION

We compare our method with following six baseline methods and two simper variants of our model.

(1) GENERALIZED (Zhou et al., 2018): It is a generalized SVM model trained on hand crafted fea-

tures extracted from protein sequence for the novel virus-human PPI task.

(2) HYBRID (Deng et al., 2020): It is a complex deep model with convolutional and LSTM layers

for extracting latent representation of virus and human proteins from their input sequence features

and is trained using L1 regularized Logistic regression.

(3) DOC2VEC (Yang et al., 2020): It employs the doc2vec (Le & Mikolov, 2014) approach to gener-

ate protein embeddings from the corpus of protein sequences. A random forest model is then trained

for the PPI prediction.

(4) MOTIFTRANSFORMER (Lanchantin et al., 2020): It ﬁrst generates protein embeddings using

supervised protein structure and function prediction tasks. Those embeddings were later passed as

input to a an order-independent classiﬁer to do the PPI prediction task.

(5) DENOVO(Eid et al., 2016): It trained a SVM classiﬁer on a hand crafted feature set extracted

from the K-mer amino acid composition information using a novel negative sampling strategy.

(6) BARMAN(Barman et al., 2014): It used a SVM model trained on feature set consisting of the

protein domain-domain association and methionine, serine, and valine amino acid composition of

viral proteins.

(7) 2 simpler variants of MTT: Towards ablation study we evaluate two simpler variants: (i) SIN-

GLETASK TRANSFER (STT), which is trained on a single objective of predicting pathogen-human

PPI and (ii) NAIVE BASELINE, which is a Logistic regression model using concatenated human and

pathogen protein UNIREP representations as input.

3.1 BENCHMARK DATASETS AND RESULTS

We evaluate our approach on 7 benchmark datasets. As several of our competitors do not release

their code, we use the reported performance scores (using the same evaluation metrics) in the original

papers giving them full advantage. Besides, as many of the methods use hand crafted features which

might not be available for other benchmark datasets not evaluated in their original papers. Detailed

data statistics can be found in the Appendix A.1.

Novel Viral-Human PPI. We use the benchmark datasets for human H1N1 and human Ebola

viruses as released by Zhou et al. (2018). The dataset is prepared for testing predictions for a novel

virus. The known PPIs between virus and human were retrieved from four databases: APID, IntAct,

Metha, and UniProt. The training data for the human-H1N1 dataset includes PPIs between human

and all viruses except H1N1. Similarly, the training data for the human-Ebola dataset includes PPIs

between human and all viruses except Ebola. The statistics for both datasets are presented in Ta-

ble 4 in the Appendix. The results (Area under curve (AUC) and Area under Precision Recall curve

(AUPR) scores) are given in Table 1.

Viral-Human PPI prediction on Datasets with Rich Viral information. We use the datasets

from DeNovo(Eid et al., 2016) and Barman (Barman et al., 2014) studies. DeNovo’s SLIM dataset

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.25.437037doi: bioRxiv preprint

To be presented at the ICLR Workshop on AI for Public Health 2021

H1N1 EBOLA

MODEL AUC AUPR AUC AUPR

GENERALIZED 0.886 - 0.867 -

HYBRID 0.937 - - -

MOTIFTRANSFORMER 0.945 0.948 0.968 0.974

DOC2VEC 0.817 0.542 - -

MULTITASK TRANSFER(MTT) (our) 0.957 0.966 0.976 0.981

SINGLETASK TRANSFER(STT) (our) 0.950 0.962 0.963 0.974

NAIVE BASELINE (our) 0.834 0.806 0.893 0.870

Table 1: Comparison on novel virus-human PPI prediction task. “-” denotes that the corresponding

score is not reported in the original paper.

encapsulated viral proteins based on presence of Short Linear Motif (SLiM) (short recurring protein

sequences with speciﬁc biological function). Barman’s dataset was retrieved from Virus-MINT

database by removing interacting protein pairs that did not have any “InterPro” domain hit. Barman

dataset is evaluated using 5-fold cross validation using the original data splits.

DENOVO SLIM BARMAN’S DATASET

MODEL SN SP ACC AUC SN SP ACC AUC

GENERALIZED 80.00 88.94 84.47 0.897 76.14 83.77 79.95 0.858

DENOVO Eid et al. (2016) 82.59 81.65 83.53 - - - - -

MTT (our) 88.00 87.76 87.88 0.955 90.05 89.57 89.81 0.958

STT (our) 86.12 85.88 86.00 0.941 90.14 89.66 89.9 0.957

NAIVE BASELINE (our) 84.00 83.76 83.88 0.885 74.20 73.72 73.96 0.809

Table 2: Comparison on datasets with rich feature information. SN, SP, ACC refer to Sensitivity,

Speciﬁcity, and Accuracy, respectively. “-” denotes that the corresponding score is not reported in

the original paper.

Additional results on novel bacteria-human PPI prediciton. We further demonstrate our model

effectiveness on the novel bacteria-human PPI prediction task. We compare our method with Denovo

on the three datasets for three human bacteria: BACILLUS ANTHRACIS (B1), YERSINIA PESTIS

(B2), and FRANCISELLA TULARENSIS (B3), obtained from (Eid et al., 2016). The results are

shown in Table 3. MTT clearly outperforms the baseline method (Denovo).

BACILLUS ANTHRACIS YERSINIA PESTIS FRANCISELLA TULARENSIS

MODEL SN SP ACC SN SP ACC SN SP ACC

DENOVO 94 97.2 96.42 94.8 98.3 97.47 94.9 98.3 97.32

MTT(our) 93.46 97.83 96.74 96.93 98.99 98.49 98.22 99.27 98.98

Table 3: Comparison for the novel bacteria-human PPI prediction task. SN, SP, ACC refer to

Sensitivity, Speciﬁcity, and Accuracy, respectively.

3.2 DISCUSSION AND FUTURE WORK

Our methods shows superior performance on a wide range of tested datasets. Note that this is despite

the fact that each of our baselines have been proposed to exploit certain speciﬁc kind of information

which was in the ﬁrst place used to construct the dataset. MTT also outperforms its simpler variants

developed with single task objective. Note that our naive baseline which directly trains a logistic

regression classiﬁer with pretrained embeddings already outperforms several methods. This points

to the superiority of these representations as compared to hand-crafted features. As future work

We will enhance our multi task approach by incorporating more domain information as well as

exploiting more sophisticated multi task model architectures.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.25.437037doi: bioRxiv preprint

To be presented at the ICLR Workshop on AI for Public Health 2021

REFERENCES

Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church.

Uniﬁed rational protein engineering with sequence-based deep representation learning. Nature

methods, 16(12):1315–1322, 2019.

Mais G Ammari, Cathy R Gresham, Fiona M McCarthy, and Bindu Nanduri. Hpidb 2.0: a curated

database for host–pathogen interactions. Database, 2016, 2016.

Ranjan Kumar Barman, Sudipto Saha, and Santasabuj Das. Prediction of interactions between viral

and host proteins using supervised machine learning methods. PloS one, 9(11):e112034, 2014.

Abdul Hannan Basit, Wajid Arshad Abbasi, Amina Asif, Sadaf Gull, and Fayyaz Ul Amir Afsar

Minhas. Training host-pathogen protein–protein interaction predictors. Journal of bioinformatics

and computational biology, 16(04):1850014, 2018.

Alberto Calderone, Luana Licata, and Gianni Cesareni. Virusmentha: a new resource for virus-host

protein interactions. Nucleic acids research, 43(D1):D588–D592, 2015.

Andrew Chatr-Aryamontri, Arnaud Ceol, Daniele Peluso, Aurelio Nardozza, Simona Panni,

Francesca Sacco, Michele Tinti, Alex Smolyar, Luisa Castagnoli, Marc Vidal, et al. Virusmint: a

viral protein interaction database. Nucleic acids research, 37(suppl 1):D669–D673, 2009.

Kuan-Hsi Chen, Tsai-Feng Wang, and Yuh-Jyh Hu. Protein-protein interaction prediction using a

hybrid feature representation and a stacked generalization scheme. BMC bioinformatics, 20(1):

1–17, 2019.

Lei Deng, Jiaojiao Zhao, and Jingpu Zhang. Predict the protein-protein interaction between virus

and host through hybrid deep neural network. In 2020 IEEE International Conference on Bioin-

formatics and Biomedicine (BIBM), pp. 11–16. IEEE, 2020.

Kevin Dick, Bahram Samanfar, Bradley Barnes, Elroy R Cober, Benjamin Mimee, Stephen J Mol-

nar, Kyle K Biggar, Ashkan Golshani, Frank Dehne, James R Green, et al. Pipe4: Fast ppi

predictor for comprehensive inter-and cross-species interactomes. Scientiﬁc reports, 10(1):1–15,

2020.

Francesca Diella, Niall Haslam, Claudia Chica, Aidan Budd, Sushama Michael, Nigel P Brown,

Gilles Trav

e, and Toby J Gibson. Understanding eukaryotic linear motifs and their role in cell

signaling and regulation. Front Biosci, 13(6580):603, 2008.

Fatma-Elzahraa Eid, Mahmoud ElHefnawi, and Lenwood S Heath. Denovo: virus-host sequence-

based protein–protein interaction prediction. Bioinformatics, 32(8):1144–1150, 2016.

Emine Guven-Maiorov, Chung-Jung Tsai, Buyong Ma, and Ruth Nussinov. Interface-based struc-

tural prediction of novel host-pathogen interactions. In Computational Methods in Protein Evo-

lution, pp. 317–335. Springer, 2019.

Jack Lanchantin, Arshdeep Sekhon, Clint Miller, and Yanjun Qi. Transfer learning with motiftrans-

formers for predicting protein-protein interactions between a novel virus and humans. bioRxiv,

2020.

Gorka Lasso, Sandra V Mayer, Evandro R Winkelmann, Tim Chu, Oliver Elliot, Juan Angel Patino-

Galindo, Kernyu Park, Raul Rabadan, Barry Honig, and Sagi D Shapira. A structure-informed

atlas of human-virus interactions. Cell, 178(6):1526–1541, 2019.

Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Interna-

tional conference on machine learning, pp. 1188–1196. PMLR, 2014.

Benjamin Yee Shing Li, Lam Fat Yeung, and Genke Yang. Pathogen host interaction prediction via

matrix factorization. In 2014 IEEE International Conference on Bioinformatics and Biomedicine

(BIBM), pp. 357–362. IEEE, 2014.

Yiwei Li. Computational methods for predicting protein-protein interactions and binding sites. 2020.

.CC-BY 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.25.437037doi: bioRxiv preprint

A multitask transfer learning framework for novel virus-human protein interactions

Figures

Citations

A multitask transfer learning framework for the prediction of virus-human protein-protein interactions

Application of Sequence Embedding in Protein Sequence-Based Predictions

Application of Sequence Embedding in Protein Sequence-Based Predictions

Sharing to learn and learning to share - Fitting together Meta-Learning, Multi-Task Learning, and Transfer Learning : A meta review.

A multitask transfer learning framework for the prediction of virus-human protein-protein interactions.

References

STRING v10: protein–protein interaction networks, integrated over the tree of life

Distributed Representations of Sentences and Documents

Unified rational protein engineering with sequence-based deep representation learning

Synthesis of a Vocal Sound from the 3,000 year old Mummy, Nesyamun ‘True of Voice’

Understanding eukaryotic linear motifs and their role in cell signaling and regulation.

Frequently Asked Questions (17)

Q1. What have the authors contributed in "A multitask transfer learning framework for novel virus-human protein interactions" ?

Q2. What are the future works mentioned in the paper "A multitask transfer learning framework for novel virus-human protein interactions" ?

Q3. What is the input to the LR module?

Q4. What is the name of the SVM model?

Q5. What is the main limitation of the proposed model?

Q6. What is the rationale behind using human PPI?

Q7. What is the purpose of the paper?

Q8. What are the main limitations of the proposed model?

Q9. What are the limitations of the UNIREP model?

Q10. What is the zhh′ for human PPI?

Q11. What is the name of the dataset?

Q12. What is the corresponding binary cross entropy loss function for the virus-human P?

Q13. What are the interactions between the viral coat and the host?

Q14. What is the main objective of the proposed model?

Q15. What are the learnable parameters for the human PPI?

Q16. What are the limitations of the proposed model?

Q17. What is the rationale behind using human interactome to guide their virus-human PPI task?