Revealing new therapeutic opportunities through drug target prediction via class imbalance-tolerant machine learning

doi:10.1101/572420

Revealing new therapeutic opportunities through drug target prediction via

class imbalance-tolerant machine learning

Siqi Liang

1,2

and Haiyuan Yu

1,2,*

1

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York,

14853, USA

2

Weill Institute for Cell and Molecular Biology, Cornell University, New York, 14853, USA

*

To whom correspondence should be addressed. Tel: 607-255-0259; Fax: 607-255-5961; Email:

haiyuan.yu@cornell.edu

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

ABSTRACT

In silico drug target prediction provides valuable information for drug repurposing, understanding of side

effects as well as expansion of the druggable genome. In particular, discovery of actionable drug targets is

critical to developing targeted therapies for diseases. Here, we develop a robust method for drug target

prediction by leveraging a class imbalance-tolerant machine learning framework with a novel training

scheme. We incorporate novel features, including drug-gene phenotype similarity and gene expression

profile similarity, that capture information orthogonal to other features. We show that our classifier

achieves robust performance and is able to predict gene targets for new drugs as well as drugs that target

unexplored genes. By providing newly predicted drug-target associations, we uncover novel opportunities

of drug repurposing that may benefit cancer treatment through action on either known drug targets or

currently undrugged genes.

INTRODUCTION

Target identification is a crucial step during drug development. As the cost of bringing a single new drug

to market skyrockets to over 2.7 billion dollars on average [1], alternative approaches, such as drug

repurposing, have been pursued with increasing efforts. For example, the drug aspirin, commonly used for

treating fever and acute pain, has been found in recent years to show anti-cancer activities through

attenuation of EGFR expression [2], inhibition of COX-2 [3] and suppression of NF-κB activation by

TNF [4]. As a result, the efficacy of aspirin in treating multiple types of cancers, including breast cancer,

prostate cancer and colorectal cancer, is being actively evaluated in clinical trials. By repurposing

approved drugs for new indications through novel target discovery, the cost of drug development can be

substantially reduced, especially in the preclinical and earlier clinical phases where the toxicity and

dosage of the drug is assessed [5]. In addition to benefiting drug repurposing efforts, identifying unknown

targets of drugs can facilitate understanding of their side effects, which are often caused by drugs binding

to unintended targets. The serotonin receptor agonist cisapride, as an example, is a gastroprokinetic agent

used for treating gastric reflux, but it can cause serious cardiac events including arrhythmia and even lead

to death. The mechanism behind the cardiac effects of cisapride was discovered in 1997 to be its high-

affinity blocking of the human cardiac potassium channel [6]. And this resulted in its withdrawal from the

US market three years later. Furthermore, out of over 4,400 genes estimated to be druggable in the human

genome [7], only less than half of them are currently targeted by approved drugs. Therefore, identification

of novel gene targets can help with expanding the druggable genome, opening up new avenues for drug

development.

Experimental methods for determining drug-target associations provides direct evidence and

information on the mode of action of drugs. However, their high cost and long timeframe have prohibited

them from large-scale application. As an alternative, computational approaches, including docking-based

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

methods and machine learning-based methods, have been developed to predict new drug-target

associations [8]. In particular, machine learning-based methods that exploit the chemogenomic space have

yielded considerable success in drug target prediction without requiring three-dimensional protein

structures of the targets [9-13]. Various features, including chemical similarity [14, 15] and side effect

similarity [16, 17], have proved valuable in identifying new associations between drugs and targets.

Nevertheless, two fallacies are commonly overlooked: conventional train-test splitting and cross-

validation schemes are flawed for pair-input prediction tasks [18]; extreme class imbalance in drug target

datasets is not satisfactorily addressed by commonly used methods such as sampling from the majority

class [19]. Moreover, most methods lack the ability to predict drug-target interactions for genes that are

not yet known to be druggable.

To address these challenges, in this study, we design a novel training scheme that prevents

possible overfitting caused by overlapping drugs or targets in the training and test sets and at the same

time solves the class imbalance problem with an ensemble method. Additionally, we exploit two new

types of features, namely the phenotype similarity between a drug and a gene, and the expression profile

similarity between two genes across different tissues. We show that they confer considerable predictive

power and provide orthogonal information that is not captured by other features. Incorporating these

features, we build a classifier and demonstrate that it achieves robust performance. Further, our classifier

is able to make predictions for drugs without known targets and for genes that are not yet known to be

druggable. By predicting new potential drug-target associations, we reveal unexplored opportunities of

drug discovery and repurposing for cancer treatment.

RESULTS

Drug-gene phenotype similarity and gene expression profile similarity provides complementary

information for identifying drug targets

Similarity-based features have been widely used for drug target prediction [20]. Behind them is a simple

motivating hypothesis: similar drugs tend to have the same gene targets, and correspondingly, similar

genes tend to be targeted by the same drugs. Among various drug-drug similarity metrics, chemical

similarity and side effect similarity have been most extensively employed [14-17]. We obtained a

comprehensive dataset of known drug-target associations by extracting relevant information for all drugs

with human gene targets from a recent version of the Probes & Drugs database [21]. Further filtering (see

Methods) resulted in a total of 1,262 drugs with 11,556 drug-target associations involving 1,062 human

genes.

To calculate drug-drug similarity features, for each drug-gene instance, we considered all drugs

that are known to target the gene in question and measured their resemblance to the drug in question in

terms of chemical similarity and side effect similarity. Since a gene could have multiple known targeters,

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

different aggregation functions were applied to obtain real-valued features for each drug-gene pair (Fig.

1a). 2D chemical similarity between two compounds was calculated by taking the Jaccard index of their

Morgan fingerprints, which represent planar chemical substructures in the form of bit vectors [22]. Not

surprisingly, when taking maximum, mean or median as the aggregation function, drugs are significantly

more chemically similar to known targeters of their gene target than to known targeters of other genes

(Fig. 1b). On a similar note, computing similarity by taking the Jaccard index of their side effects (see

Methods) gave identical trends (Fig. 1c). But interestingly, when aggregating similarity scores by the

minimum, drug-gene pairs that are known to be associated had significantly lower scores than those that

are not known to be associated, regardless of the type of similarity metric used (Fig. 1b, 1c). This can be

explained by the fact that genes in associated drug-gene pairs have a significantly higher number of

known targeters in a broader chemogenomic space than genes in other drug-gene pairs (Supplementary

Fig. 1a). Recently, a method for encoding the 3D structure of molecules has been developed and has been

shown to enhance the performance of conventional 2D fingerprinting methods in binding prediction [23].

Using the Jaccard index of the 3D molecular fingerprints as the chemical similarity metric, we discovered

similar trends as using 2D chemical similarity and side effect similarity (Fig. 1d). Notably, 3D chemical

similarity features are only weakly correlated with 2D chemical similarity and side effect similarity

features (Supplementary Fig. 1b), providing new information about the relatedness of two drugs.

In addition to aforementioned feature groups, which have already been incorporated in previous

drug-target prediction methods, here we introduce two novel types of features: drug-gene phenotype

similarity and expression profile similarity between two genes. Drugs that act directly on a protein and

alter its activity may lead to the same phenotypic changes as mutations on the corresponding gene. On

this account, we designed a drug-gene phenotype similarity metric by taking the Jaccard index of the side

effects of the drugs and disease phenotypes of the gene (Fig. 2a). As expected, drug-gene pairs that are

known to be associated have significantly higher phenotype similarity scores than drug-gene pairs that are

not known to be associated (Fig. 2b). On top of drug-drug and drug-gene similarity features, we

calculated similarity between two genes as their correlation coefficient in expression levels across

different tissues using gene expression data from GTEx [24]. To obtain scalar features, we considered the

similarity between the gene in question and known targets of the drug in question and applied the same

four aggregation functions as drug-drug similarity features (Fig. 2c). We discovered that when taking

maximum, mean or median as the aggregation function, genes have significantly more similar expression

profiles to known targets of their targeters than to known targets of other drugs (Fig. 2d). Using minimum

as the aggregation function rendered the opposite trend, which could be explained by drugs in drug-gene

pairs that are known to be associated having a significantly more diverse target set than drugs in other

drug-gene pairs (Supplementary Fig. 1c). Intriguingly, expression profile features, especially when

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

aggregated with maximum, mean or median, exhibit almost no correlation with other groups of features,

bringing in complementary information that other features do not capture (Fig. 2e).

It is worth noticing that drug-gene phenotype similarity and gene expression profile similarity

features can be calculated even if the gene in question has no known drugs that targets it. This

potentializes us to make predictions for currently “undrugged” genes, thereby expanding the druggable

genome. To extend this advantage to drug-drug similarity features, we considered targeters of protein-

protein interaction partners of the gene in question for both chemical similarity and side effect similarity

(Supplementary Fig. 2a). We also considered protein-protein interactors of the gene in question for drug-

gene phenotype similarity (Supplementary Fig. 2b). Four new groups of features were thus generated, and

we showed that they possess distinguishing power in separating drug-target pairs and other drug-gene

pairs (Supplementary Fig. 2c-2f).

A novel training scheme prevents overfitting and solves the class imbalance problem

In order to build a machine learning model for drug-target prediction, we divided all drug-gene pairs into

a training set and a test set. If the split is random, the machine learning algorithm might pick up

characteristics of single drugs or genes that appear in both the training set and the test set, causing a

problem called overfitting. To avoid this, we applied a splitting scheme where the drugs were first

randomly divided into “train drugs” and “test drugs”, and the genes were split into “train targets” and

“test targets” (Fig. 3a), so that there is no overlap between the training set and the test set in terms of

either drugs or genes. Since there was no gold-standard dataset of non-associated drug-gene pairs, all

drug-gene pairs not known to be associated were considered as non-associated. This resulted in an

extreme class imbalance where negative instances were over 100 folds more than positive instances in

quantity. To address this problem, we divided the negative instances (non-associated pairs) in the training

set into a number of subsets, and each subset was combined with all the positive instances (associated

pairs) in the training set to obtain a training subset (Fig. 3b). For every training subset we trained a

classifier, and eventually we would take the average prediction score of the ensemble of classifiers as our

final prediction score.

The use of typical cross-validation could prevent classifiers from achieving robust predictive

performance for pair-input data [18]. Here, we designed a novel training scheme where the hold-out

validation set has no overlap with data used for fitting the classifier in terms of either drugs or genes by

adopting the same splitting method as the train-test split (Fig. 3c). For each classifier and each set of

hyperparameters, this splitting was done 15 times, and each time the drug-gene pairs used for model

fitting was intersected with the corresponding training subset while all the drug-gene pair used for

validation was used for evaluating model performance (Fig. 3d). This training scheme solved the class

imbalance problem while utilizing all training instances.

.CC-BY-NC-ND 4.0 International licensea

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

Revealing new therapeutic opportunities through drug target prediction via class imbalance-tolerant machine learning

Citations

Protein ligand interaction prediction

Revealing new therapeutic opportunities through drug target prediction: a class imbalance-tolerant machine learning approach.

DeepERA: deep learning enables comprehensive identification of drug-target interactions via embedding of heterogeneous data

References

XGBoost: A Scalable Tree Boosting System

The Genotype-Tissue Expression (GTEx) project

Extended-Connectivity Fingerprints

The Unified Medical Language System (UMLS): integrating biomedical terminology

Algorithms for Hyper-Parameter Optimization

Related Papers (5)

Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach

Prediction of Human Drug Targets and Their Interactions Using Machine Learning Methods: Current and Future Perspectives.

Identification of drug candidates and repurposing opportunities through compound-target interaction networks.

Drug-Target Interactions: Prediction Methods and Applications.

Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-Class Collaborative Filtering and Its Application to Drug Repurposing