scispace - formally typeset
Open AccessPosted ContentDOI

Revealing new therapeutic opportunities through drug target prediction via class imbalance-tolerant machine learning

Siqi Liang, +1 more
- 09 Mar 2019 - 
- pp 572420
TLDR
A robust method for drug target prediction is developed by leveraging a class imbalance-tolerant machine learning framework with a novel training scheme that incorporates novel features, including drug-gene phenotype similarity and gene expression profile similarity, that capture information orthogonal to other features.
Abstract
In silico drug target prediction provides valuable information for drug repurposing, understanding of side effects as well as expansion of the druggable genome. In particular, discovery of actionable drug targets is critical to developing targeted therapies for diseases. Here, we develop a robust method for drug target prediction by leveraging a class imbalance-tolerant machine learning framework with a novel training scheme. We incorporate novel features, including drug-gene phenotype similarity and gene expression profile similarity, that capture information orthogonal to other features. We show that our classifier achieves robust performance and is able to predict gene targets for new drugs as well as drugs that target unexplored genes. By providing newly predicted drug-target associations, we uncover novel opportunities of drug repurposing that may benefit cancer treatment through action on either known drug targets or currently undrugged genes.

read more

Content maybe subject to copyright    Report

Revealing new therapeutic opportunities through drug target prediction via
class imbalance-tolerant machine learning
Siqi Liang
1,2
and Haiyuan Yu
1,2,*
1
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York,
14853, USA
2
Weill Institute for Cell and Molecular Biology, Cornell University, New York, 14853, USA
*
To whom correspondence should be addressed. Tel: 607-255-0259; Fax: 607-255-5961; Email:
haiyuan.yu@cornell.edu
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

ABSTRACT
In silico drug target prediction provides valuable information for drug repurposing, understanding of side
effects as well as expansion of the druggable genome. In particular, discovery of actionable drug targets is
critical to developing targeted therapies for diseases. Here, we develop a robust method for drug target
prediction by leveraging a class imbalance-tolerant machine learning framework with a novel training
scheme. We incorporate novel features, including drug-gene phenotype similarity and gene expression
profile similarity, that capture information orthogonal to other features. We show that our classifier
achieves robust performance and is able to predict gene targets for new drugs as well as drugs that target
unexplored genes. By providing newly predicted drug-target associations, we uncover novel opportunities
of drug repurposing that may benefit cancer treatment through action on either known drug targets or
currently undrugged genes.
INTRODUCTION
Target identification is a crucial step during drug development. As the cost of bringing a single new drug
to market skyrockets to over 2.7 billion dollars on average [1], alternative approaches, such as drug
repurposing, have been pursued with increasing efforts. For example, the drug aspirin, commonly used for
treating fever and acute pain, has been found in recent years to show anti-cancer activities through
attenuation of EGFR expression [2], inhibition of COX-2 [3] and suppression of NF-κB activation by
TNF [4]. As a result, the efficacy of aspirin in treating multiple types of cancers, including breast cancer,
prostate cancer and colorectal cancer, is being actively evaluated in clinical trials. By repurposing
approved drugs for new indications through novel target discovery, the cost of drug development can be
substantially reduced, especially in the preclinical and earlier clinical phases where the toxicity and
dosage of the drug is assessed [5]. In addition to benefiting drug repurposing efforts, identifying unknown
targets of drugs can facilitate understanding of their side effects, which are often caused by drugs binding
to unintended targets. The serotonin receptor agonist cisapride, as an example, is a gastroprokinetic agent
used for treating gastric reflux, but it can cause serious cardiac events including arrhythmia and even lead
to death. The mechanism behind the cardiac effects of cisapride was discovered in 1997 to be its high-
affinity blocking of the human cardiac potassium channel [6]. And this resulted in its withdrawal from the
US market three years later. Furthermore, out of over 4,400 genes estimated to be druggable in the human
genome [7], only less than half of them are currently targeted by approved drugs. Therefore, identification
of novel gene targets can help with expanding the druggable genome, opening up new avenues for drug
development.
Experimental methods for determining drug-target associations provides direct evidence and
information on the mode of action of drugs. However, their high cost and long timeframe have prohibited
them from large-scale application. As an alternative, computational approaches, including docking-based
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

methods and machine learning-based methods, have been developed to predict new drug-target
associations [8]. In particular, machine learning-based methods that exploit the chemogenomic space have
yielded considerable success in drug target prediction without requiring three-dimensional protein
structures of the targets [9-13]. Various features, including chemical similarity [14, 15] and side effect
similarity [16, 17], have proved valuable in identifying new associations between drugs and targets.
Nevertheless, two fallacies are commonly overlooked: conventional train-test splitting and cross-
validation schemes are flawed for pair-input prediction tasks [18]; extreme class imbalance in drug target
datasets is not satisfactorily addressed by commonly used methods such as sampling from the majority
class [19]. Moreover, most methods lack the ability to predict drug-target interactions for genes that are
not yet known to be druggable.
To address these challenges, in this study, we design a novel training scheme that prevents
possible overfitting caused by overlapping drugs or targets in the training and test sets and at the same
time solves the class imbalance problem with an ensemble method. Additionally, we exploit two new
types of features, namely the phenotype similarity between a drug and a gene, and the expression profile
similarity between two genes across different tissues. We show that they confer considerable predictive
power and provide orthogonal information that is not captured by other features. Incorporating these
features, we build a classifier and demonstrate that it achieves robust performance. Further, our classifier
is able to make predictions for drugs without known targets and for genes that are not yet known to be
druggable. By predicting new potential drug-target associations, we reveal unexplored opportunities of
drug discovery and repurposing for cancer treatment.
RESULTS
Drug-gene phenotype similarity and gene expression profile similarity provides complementary
information for identifying drug targets
Similarity-based features have been widely used for drug target prediction [20]. Behind them is a simple
motivating hypothesis: similar drugs tend to have the same gene targets, and correspondingly, similar
genes tend to be targeted by the same drugs. Among various drug-drug similarity metrics, chemical
similarity and side effect similarity have been most extensively employed [14-17]. We obtained a
comprehensive dataset of known drug-target associations by extracting relevant information for all drugs
with human gene targets from a recent version of the Probes & Drugs database [21]. Further filtering (see
Methods) resulted in a total of 1,262 drugs with 11,556 drug-target associations involving 1,062 human
genes.
To calculate drug-drug similarity features, for each drug-gene instance, we considered all drugs
that are known to target the gene in question and measured their resemblance to the drug in question in
terms of chemical similarity and side effect similarity. Since a gene could have multiple known targeters,
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

different aggregation functions were applied to obtain real-valued features for each drug-gene pair (Fig.
1a). 2D chemical similarity between two compounds was calculated by taking the Jaccard index of their
Morgan fingerprints, which represent planar chemical substructures in the form of bit vectors [22]. Not
surprisingly, when taking maximum, mean or median as the aggregation function, drugs are significantly
more chemically similar to known targeters of their gene target than to known targeters of other genes
(Fig. 1b). On a similar note, computing similarity by taking the Jaccard index of their side effects (see
Methods) gave identical trends (Fig. 1c). But interestingly, when aggregating similarity scores by the
minimum, drug-gene pairs that are known to be associated had significantly lower scores than those that
are not known to be associated, regardless of the type of similarity metric used (Fig. 1b, 1c). This can be
explained by the fact that genes in associated drug-gene pairs have a significantly higher number of
known targeters in a broader chemogenomic space than genes in other drug-gene pairs (Supplementary
Fig. 1a). Recently, a method for encoding the 3D structure of molecules has been developed and has been
shown to enhance the performance of conventional 2D fingerprinting methods in binding prediction [23].
Using the Jaccard index of the 3D molecular fingerprints as the chemical similarity metric, we discovered
similar trends as using 2D chemical similarity and side effect similarity (Fig. 1d). Notably, 3D chemical
similarity features are only weakly correlated with 2D chemical similarity and side effect similarity
features (Supplementary Fig. 1b), providing new information about the relatedness of two drugs.
In addition to aforementioned feature groups, which have already been incorporated in previous
drug-target prediction methods, here we introduce two novel types of features: drug-gene phenotype
similarity and expression profile similarity between two genes. Drugs that act directly on a protein and
alter its activity may lead to the same phenotypic changes as mutations on the corresponding gene. On
this account, we designed a drug-gene phenotype similarity metric by taking the Jaccard index of the side
effects of the drugs and disease phenotypes of the gene (Fig. 2a). As expected, drug-gene pairs that are
known to be associated have significantly higher phenotype similarity scores than drug-gene pairs that are
not known to be associated (Fig. 2b). On top of drug-drug and drug-gene similarity features, we
calculated similarity between two genes as their correlation coefficient in expression levels across
different tissues using gene expression data from GTEx [24]. To obtain scalar features, we considered the
similarity between the gene in question and known targets of the drug in question and applied the same
four aggregation functions as drug-drug similarity features (Fig. 2c). We discovered that when taking
maximum, mean or median as the aggregation function, genes have significantly more similar expression
profiles to known targets of their targeters than to known targets of other drugs (Fig. 2d). Using minimum
as the aggregation function rendered the opposite trend, which could be explained by drugs in drug-gene
pairs that are known to be associated having a significantly more diverse target set than drugs in other
drug-gene pairs (Supplementary Fig. 1c). Intriguingly, expression profile features, especially when
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

aggregated with maximum, mean or median, exhibit almost no correlation with other groups of features,
bringing in complementary information that other features do not capture (Fig. 2e).
It is worth noticing that drug-gene phenotype similarity and gene expression profile similarity
features can be calculated even if the gene in question has no known drugs that targets it. This
potentializes us to make predictions for currently “undrugged” genes, thereby expanding the druggable
genome. To extend this advantage to drug-drug similarity features, we considered targeters of protein-
protein interaction partners of the gene in question for both chemical similarity and side effect similarity
(Supplementary Fig. 2a). We also considered protein-protein interactors of the gene in question for drug-
gene phenotype similarity (Supplementary Fig. 2b). Four new groups of features were thus generated, and
we showed that they possess distinguishing power in separating drug-target pairs and other drug-gene
pairs (Supplementary Fig. 2c-2f).
A novel training scheme prevents overfitting and solves the class imbalance problem
In order to build a machine learning model for drug-target prediction, we divided all drug-gene pairs into
a training set and a test set. If the split is random, the machine learning algorithm might pick up
characteristics of single drugs or genes that appear in both the training set and the test set, causing a
problem called overfitting. To avoid this, we applied a splitting scheme where the drugs were first
randomly divided into “train drugs” and “test drugs”, and the genes were split into “train targets” and
“test targets” (Fig. 3a), so that there is no overlap between the training set and the test set in terms of
either drugs or genes. Since there was no gold-standard dataset of non-associated drug-gene pairs, all
drug-gene pairs not known to be associated were considered as non-associated. This resulted in an
extreme class imbalance where negative instances were over 100 folds more than positive instances in
quantity. To address this problem, we divided the negative instances (non-associated pairs) in the training
set into a number of subsets, and each subset was combined with all the positive instances (associated
pairs) in the training set to obtain a training subset (Fig. 3b). For every training subset we trained a
classifier, and eventually we would take the average prediction score of the ensemble of classifiers as our
final prediction score.
The use of typical cross-validation could prevent classifiers from achieving robust predictive
performance for pair-input data [18]. Here, we designed a novel training scheme where the hold-out
validation set has no overlap with data used for fitting the classifier in terms of either drugs or genes by
adopting the same splitting method as the train-test split (Fig. 3c). For each classifier and each set of
hyperparameters, this splitting was done 15 times, and each time the drug-gene pairs used for model
fitting was intersected with the corresponding training subset while all the drug-gene pair used for
validation was used for evaluating model performance (Fig. 3d). This training scheme solved the class
imbalance problem while utilizing all training instances.
.CC-BY-NC-ND 4.0 International licensea
certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under
The copyright holder for this preprint (which was notthis version posted March 9, 2019. ; https://doi.org/10.1101/572420doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Revealing new therapeutic opportunities through drug target prediction: a class imbalance-tolerant machine learning approach.

TL;DR: This work develops a robust method for drug target prediction by leveraging a class imbalance-tolerant machine learning framework with a novel training scheme that incorporates novel features, including drug-gene phenotype similarity and gene expression profile similarity, that capture information orthogonal to other features.
Posted ContentDOI

DeepERA: deep learning enables comprehensive identification of drug-target interactions via embedding of heterogeneous data

Le Li, +2 more
- 28 Jan 2023 - 
TL;DR: In this article , the authors proposed an end-to-end deep learning model, DeepERA, to identify drug-target interactions based on heterogeneous data using three independent feature embedding modules (intrinsic embedding, relational embedding and annotation embedding) which each represent different attributes of the dataset and jointly contribute to the comprehensive predictions.
References
More filters
Proceedings ArticleDOI

XGBoost: A Scalable Tree Boosting System

TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.
Journal ArticleDOI

The Genotype-Tissue Expression (GTEx) project

John T. Lonsdale, +129 more
- 29 May 2013 - 
TL;DR: The Genotype-Tissue Expression (GTEx) project is described, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
Journal ArticleDOI

Extended-Connectivity Fingerprints

TL;DR: A description of their implementation has not previously been presented in the literature, and ECFPs can be very rapidly calculated and can represent an essentially infinite number of different molecular features.
Journal ArticleDOI

The Unified Medical Language System (UMLS): integrating biomedical terminology

TL;DR: The Unified Medical Language System is a repository of biomedical vocabularies developed by the US National Library of Medicine and includes tools for customizing the Metathesaurus (MetamorphoSys), for generating lexical variants of concept names (lvg) and for extracting UMLS concepts from text (MetaMap).
Proceedings Article

Algorithms for Hyper-Parameter Optimization

TL;DR: This work contributes novel techniques for making response surface models P(y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.
Related Papers (5)