scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Novel Logic‐Based Approach for Quantitative Toxicology Prediction.

14 Aug 2007-ChemInform (John Wiley & Sons, Ltd)-Vol. 38, Iss: 33
TL;DR: Support vector inductive logic programming (SVILP) as mentioned in this paper is a general approach, which extends the essentially qualitative ILP-based structure activity relationship (SAR) to quantitative modeling, and is used to learn rules, the predictions of which are then used within a novel kernel to derive a support vector generalization model.
Abstract: There is a pressing need for accurate in silico methods to predict the toxicity of molecules that are being introduced into the environment or are being developed into new pharmaceuticals. Predictive toxicology is in the realm of structure activity relationships (SAR), and many approaches have been used to derive such SAR. Previous work has shown that inductive logic programming (ILP) is a powerful approach that circumvents several major difficulties, such as molecular superposition, faced by some other SAR methods. The ILP approach reasons with chemical substructures within a relational framework and yields chemically understandable rules. Here, we report a general new approach, support vector inductive logic programming (SVILP), which extends the essentially qualitative ILP-based SAR to quantitative modeling. First, ILP is used to learn rules, the predictions of which are then used within a novel kernel to derive a support-vector generalization model. For a highly heterogeneous dataset of 576 molecules ...

Summary (1 min read)

Jump to: [INTRODUCTION][METHODS][RESULTS] and [DISCUSSION]

INTRODUCTION

  • With more than 70 000 chemicals in use today and many more being synthesized, it is vital that there are effective methods to assess the effect of these compounds on the environment and on human health.
  • Using a recently available dataset of toxicity DSSTox, 10 which provides the toxicities of 576 chemicals for fathead minnow, the authors show that SVILP yields significantly better accuracies than ILP, regression from chemical descriptors, and an industry standard method TOPKAT.
  • Importantly, the learned logic rules are readily amenable to interpretation as chemical substructures related to activity and thereby provide extensive chemical insights.

METHODS

  • The SVILP approach 9 uses ILP for learning logic rules, followed by quantitative modeling based on support vector technology as shown in Figure 1 .
  • The logic relations identify the chemical fragments according to the atom and bond details of the MOL2 structures.
  • These learned rules form the input for quantitative prediction using the newly developed method SVILP.
  • One fold is used as a testing set, and the four other folds are for training.
  • All molecules above the mean value of toxicities in the training set are considered to be positive (more toxic), and the remaining are considered to be negative (less toxic).

RESULTS

  • The average accuracies of predictions over five folds using chemical descriptor method (CHEM), ILP rules in combination with PLS, and SVILP are given in Table 2a .
  • In the second part of this study, the molecules were classified into two groups based on their toxicities: that is, toxic (pLC 50 g mean) and nontoxic (pLC 50 < mean), where "mean" is the average of toxicities of molecules in the training set.
  • For majority of rules, the distances between the chemical fragments are also defined, thereby identifying the relative location of the a C is the compression; p and n are the number of positives and negatives covered by the rule, respectively.
  • Such chlorinated compounds show toxicity, particularly in aromatic compounds.
  • In the previous sections, the authors compared the SVILP with four methods: that is, ILP, CHEM, PLS, and TOPKAT.

DISCUSSION

  • The authors introduced a new quantitative logic-based method, support vector inductive logic programming , which uses the logic-based technology to learn logic rules followed by regression.
  • The results of this study on a large, public, and diverse dataset show that SVILP predicts the toxicities with higher accuracy than other tested models.
  • One could interpret the higher accuracy of the SVILP and PLS as a consequence of using more features.
  • The rules are chemically understandable and describe the chemical alerts which are the cause of activity/toxicity.
  • The program automatically and consistently detects chemical substructures and properties by construction of rules which are general.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Novel Logic-Based Approach for Quantitative Toxicology Prediction
Ata Amini,
Stephen H. Muggleton,
Huma Lodhi,
and Michael J. E. Sternberg*
,†
Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, and
Computational Bioinformatics Laboratory, Department of Computing, Imperial College London,
London SW7 2AZ, U.K.
Received June 1, 2006
There is a pressing need for accurate in silico methods to predict the toxicity of molecules that are being
introduced into the environment or are being developed into new pharmaceuticals. Predictive toxicology is
in the realm of structure activity relationships (SAR), and many approaches have been used to derive such
SAR. Previous work has shown that inductive logic programming (ILP) is a powerful approach that
circumvents several major difficulties, such as molecular superposition, faced by some other SAR methods.
The ILP approach reasons with chemical substructures within a relational framework and yields chemically
understandable rules. Here, we report a general new approach, support vector inductive logic programming
(SVILP), which extends the essentially qualitatiVe ILP-based SAR to quantitatiVe modeling. First, ILP is
used to learn rules, the predictions of which are then used within a novel kernel to derive a support-vector
generalization model. For a highly heterogeneous dataset of 576 molecules with known fathead minnow
fish toxicity, the cross-validated correlation coefficients (R
2
CV
) from a chemical descriptor method (CHEM)
and SVILP are 0.52 and 0.66, respectively. The ILP, CHEM, and SVILP approaches correctly predict 55,
58, and 73%, respectively, of toxic molecules. In a set of 165 unseen molecules, the R
2
values from the
commercial software TOPKAT and SVILP are 0.26 and 0.57, respectively. In all calculations, SVILP showed
significant improvements in comparison with the other methods. The SVILP approach has a major advantage
in that it uses ILP automatically and consistently to derive rules, mostly novel, describing fragments that
are toxicity alerts. The SVILP is a general machine-learning approach and has the potential of tackling
many problems relevant to chemoinformatics including in silico drug design.
INTRODUCTION
With more than 70 000 chemicals in use today and many
more being synthesized, it is vital that there are effective
methods to assess the effect of these compounds on the
environment and on human health.
1,2
In the development of
pharmaceuticals, many potential leads are dropped due to
their toxicity after millions of dollars have been invested.
3
Experimental testing is both time-consuming and expensive,
and accordingly, there is a pressing requirement for accurate
in silico methods to provide an initial screen that generates
alerts for toxicity.
4
Often the strategy to develop these
predictors follows a more general approach to derive
qualitative/quantitative structure activity relationships (SAR).
5
Thus many of the toxicity prediction methods are based on
regression from chemical properties, advanced machine
learning, or expert-derived rule-based systems.
6,7
One par-
ticular machine-learning approach that has successfully been
used for toxicity prediction is based on inductive logic
programming (ILP).
8
However a major limitation of ILP is
that the resultant logic rules yield yes or no predictions, and
thus the capacity for quantitative prediction is limited. Here
we report the development of a new quantitatiVe SAR
method (SVILP)
9
which combines ILP with support vector
(SV) programming. Using a recently available dataset of
toxicity DSSTox,
10
which provides the toxicities of 576
chemicals for fathead minnow, we show that SVILP yields
significantly better accuracies than ILP, regression from
chemical descriptors, and an industry standard method
TOPKAT.
11
We also compare the SVILP results with two
more studies which have investigated the prediction of
toxicity for fathead minnow.
The toxicities of several classes of chemicals can, as a
first approximation, be estimated from their hydrophobicities
(using the logarithm of the octanol/water partition coefficient
LOGP) and their electrophilicities (using the lowest-unoc-
cupied molecular orbital, LUMO, energies and dipole
moments).
12-14
There are also expert systems available which
are used for prediction of toxicities predominantly in
pharmaceutical companies, universities, and government
agencies:
15
for example, HazardExpert, which predicts the
toxicity according to the dissociation constant (pK
a
), LOGP,
and structure of compounds and DEREK (deductive estima-
tion of risk from existing knowledge),
16
which is based on
toxicophores (chemical substructures with known toxic
effect) and physicochemical properties. TOPKAT (toxicity
prediction by computer-assisted technology) is a well-known
software for toxicity prediction that uses the quantitative SAR
model of prediction derived using statistical methods such
as linear regression of structural descriptors.
11
ILP has been used to derive SAR in several systems,
17-20
and it has been shown that the logic-based approach remedies
many of the problems associated with many other SAR
* To whom correspondence should be addressed. Phone: +44 (0)20 7594
5212. Fax: +44 (0)20 7594 5264. E-mail: m.sternberg@imperial.ac.uk.
Structural Bioinformatics Group.
Computational Bioinformatics Laboratory.
998 J. Chem. Inf. Model. 2007, 47, 998-1006
10.1021/ci600223d CCC: $37.00 © 2007 American Chemical Society
Published on Web 04/24/2007
Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org
Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

methods including the need for molecular superposition as
in CoMFA, problems of handling diverse data sets and the
lack of chemical insights in learned SAR. In the ILP
approach, there is no need to define all of the chemical
features as in many other methods, since new features can
be learned by the use of logic rules. One encodes that atom
A is bonded to atom B, and atom B is bonded to atom C.
The program can infer that atom A is connected to atom C
via atom B without this having to be explicitly encoded.
Importantly, the learned logic rules are readily amenable to
interpretation as chemical substructures related to activity
and thereby provide extensive chemical insights.
ILP is essentially a qualitative method yet the aim is to
generate a quantitative toxicity predictor and, more generally,
to generate quantitative SAR. A step toward generating
quantitative SAR using ILP-based SAR was introduced by
King and co-workers.
21
By inspection, they identified a few
ILP-derived rules which were used as features in a linear
regression to obtain a quantitatiVe SAR.
21
However, their
approach did not encapsulate the diversity of logic-learned
rules and therefore would be difficult to apply to a diverse
dataset, such as the one required for toxicity prediction. In
this study, we report a novel approach which quantifies logic
rules learned by ILP method using the SVILP method. This
approach benefits from often-observed improved accuracy
of the support vector methodology over linear regression.
Moreover it can include a large number of logic-derived
features and therefore is applicable to diverse datasets.
METHODS
The SVILP approach
9
uses ILP for learning logic rules,
followed by quantitative modeling based on support vector
technology as shown in Figure 1. The first step is to prepare
the background knowledge, namely, the chemical fragments
in the form of logic relations. The logic relations identify
the chemical fragments according to the atom and bond
details of the MOL2 structures. For example, we define a
hydroxyl group as a combination of an oxygen atom and a
hydrogen atom that are single-bonded in the form of a logic
relation. In addition to background knowledge, the chemicals
in the training set are classified into more toxic (positives)
and less toxic (negatives) according to the observed toxicities
(see below for more details). ILP learning can now be
conducted using the background knowledge and the observa-
tions. The software CProgol automatically learns logic rules
using background knowledge and experimental observations.
These learned rules form the input for quantitative prediction
using the newly developed method SVILP. Furthermore, ILP
provides extensive chemical insight into the cause of toxicity
for each chemical. More details about each step are outlined
in the following sections.
Dataset. The Distributed Structure-Searchable Toxicity
(DSSTox) database from the U.S. Environmental Protection
Agency (www.epa.gov; accessed Jan 30 2007) provides
various databases including EPA Fathead Minnow Aquatic
Toxicity Database (EPAFHM), which currently contains
structures of 613 chemicals of which 576 have designated
toxicity and therefore were used in this study. The set is
expressed as highly diverse since it covers chemicals of
various organic classes, that is, hydrocarbons, alcohols,
aldehydes, esters, acids, etc., and furthermore, the molecules
show many modes of action such as narcotics, oxidative
phosphorylation uncouplers, respiratory inhibitors, electro-
philes/proelectrophiles, acetylcholinesterase inhibitors, or
central nervous system seizure agents. Supportiing Informa-
tion Table SI-1 provides the structures of all of the chemicals.
The toxicity end-points are based on the 96 h LC
50
(mmol/
L) value for the fathead minnow.
10
The LC
50
is the aqueous
concentration associated with 50% individual survival of a
test population within a specified period.
13
The concept of
toxicity in this manuscript is not general and is based entirely
on the fathead minnow end-point.
The structures of chemicals in DSSTox are stored as
SMILES strings, which we converted to 3D structures using
CONCORD
22
via its implementation in the TRIPOS soft-
ware. The 3D structures were then reoptimized using the
PM3 semiempirical method,
23
and the structures were then
saved as SYBYL Mol2 formats which provides all the
necessary information for ILP (including atom types, atomic
hybridizations, partial charges, x, y, and z coordinates of all
of the atoms, and types of chemical bonds). The program
BioMedCache
24
was used to derive the three chemical
Figure 1. Process of construction and selection of logic rules by CProgol using the ILP system, followed by quantification of the logic
rules using SVILP and testing on the unseen molecules.
TOXICOLOGY PREDICTION USING LOGIC-BASED APPROACH J. Chem. Inf. Model., Vol. 47, No. 3, 2007 999
Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org
Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

descriptors used, that is, LUMO, LOGP,
25
and dipole
moment. LOGP reflects the hydrophobicity of compounds,
and the mechanism of toxicities of highly hydrophobic
molecules is often based on their accumulation in the
nonpolar lipid phase of the biomembranes. LUMO and dipole
moment describe electrophilicities of compounds and are two
of the common chemical descriptors used in literature.
26,27
In this study, we used only a single conformation of the
structures since most of structures in the set are rigid and
therefore are not critically dependent on conformational
changes. Furthermore, the logic-based SAR calculates the
distance between fragments considering (1.0 Å as the range
which reduces the dependency of the model on conforma-
tional changes.
Background Knowledge. The first step for ILP learning
is to provide generic logical relations that define the chemical
fragments of the molecules: the background knowledge. The
structures of molecules are prepared in Mol2 format and thus
provide all the knowledge we need for a logic-based SAR,
that is, details of atoms and bonds. The formats of atoms
and bonds as transformed into PROLOG format (the logic
programming language) are atom (M,Label,Atom_type,-
Hybrid,X,Y,Z,Charge) and bond(M,Label1,Label2,Bond_type),
where M represents molecules (e.g., m1, m2, ...); Label
represents the unique label for each atom in each molecule
(e.g., a1, a2, ...) via an atom number; Atom_type represents
the type of atoms (e.g., c, h, cl, ...); Hybrid represents the
hybridization state of each atom (e.g., 3 for sp3, 2 for sp2,
...); X, Y, Z represents the x, y, z coordinates of each atom;
Q represents the partial charge of each atom; and Bond_type
represents the type of bond (“1” for single bonds, “2” for
double bonds, “3” for triple bond, and “ar” for aromatic
bonds).
The atom-bond information provides elementary knowl-
edge about the chemicals involved in the set. We then define
chemical fragments (e.g., phenyl ring, aldehyde, carboxylic
acids) to use as the main features for the ILP calculations.
These chemical substructures are defined as relations in the
PROLOG language using the background knowledge (atom-
bond data). For example, the hydroxyl group can be defined
as the following logic relation: hydroxyl(M,A) ) atom-
(M,A,o,o3,3.21,0.01,1.21,-0.32), atom(M,B,h,h,4.01,0.01,
1.02,0.22), bond(M,A,B,1).
This means that any molecule (M) could have an OH
group, if it has a sp3 oxygen and a hydrogen that are single
bonded. Table 1 lists the name of the chemical fragments
used in this study.
Classification of Molecules. The dataset of 576 molecules
was randomly divided into five folds with 115-116 mol-
ecules in each fold. One fold is used as a testing set, and the
four other folds are for training. Calculations were repeated
five times for a 5-fold cross-validation. The dataset was then
partitioned into positive and negative examples based on their
toxicities. Molecules are distributed in a range of pLC
50
[-log(LC
50
)] between -3 (LC
50
) 918 mmol/L) to 6.4 (LC
50
) 4.2 × 10
-7
mmol/L). All molecules above the mean value
of toxicities in the training set are considered to be positive
(more toxic), and the remaining are considered to be negative
(less toxic). Therefore, the cutoff value is different for each
fold.
Learning theories using PROGOL. ILP learns from
known examples or observations (i.e., it employs the
reasoning known as induction).
28
The observations, the
background knowledge, and the resultant rules are expressed
as first-order logic programs, such as “compound m21
contains oxygen atom”. CProgol
29,30
is a state-of-the-art ILP
system. CProgol’s input consists of positive and negative
examples which belong to more toxic and less toxic
molecules, respectively, together with background knowledge
The output of CProgol is a set of logic rules which describe
the positive and negative examples using the information
provided in the background knowledge. In CProgol, the first
positive example is randomly selected, and on the basis of
the background knowledge, hypotheses are constructed; then
the hypothesis with maximum compression is selected as
the results of search. Compression, C, for each clause is
defined as
where C, P, p, n, and l are compression, total number of
positive examples, number of positive examples covering
by the clause, number of negative examples covering by
clause, and length of clause (the number of features in each
rule), respectively. Compression is a suitable measure for
finding those rules which have predictive power, and it
avoids overly specific rules (i.e., long clauses). The calcula-
tion is continued on the next positive example, but the
redundant examples relative to the previously learned rules
are removed. One of the advantages of the logic-based
method is that it both constructs and selects the hypotheses.
The selection is based on the value of compression that is
defined automatically for each rule. At the end of the ILP
calculation, all of rules with positive compression are used
for regression.
SVILP. Support vector inductive logic programming
(SVILP) is at the intersection of two areas of machine
learning, namely, support vector machines (SVMs) and
inductive logic programming (ILP). It is a novel machine-
learning approach which combines the dimensionality-
Table 1. Chemical Fragments Used in This Study to Construct the Logic Rules
atoms
a
rings
b
alkyl groups functional groups
c
O(SP3), N(SP3), N(tertiary),
N(quaternary), N(ar), F(ar), F(nar), F,
Cl(ar), Cl(nar), Cl, Br(ar), Br(nar), Br,
I(ar), I(nar), I, hydrophobic_hydrogen,
S, P, Sn, ic
phenyl, hetar6ring,
hetnar6ring, hetar5ring,
hetnar5ring, cyclohexane
methyl, ethyl, propyl,
butyl, big_alkyl, alkyl,
tert-butyl, iso-butyl,
iso-propyl,
aldehyde, ether, thioether, ester, carboxylic acid,
NH
2
, amide, ketone, alcohol, nitro, alkene, alkyne,
conjugated alkene, cyanide, CCl
3
, CCl
2
, NH,
NdN, CdN, distance, edg, ewg, positive charge,
negative charge, polar, hydrophobic
a
ar stands for aromatic (for example, N(ar) means an aromatic nitrogen, but Cl(ar) means a chlorine atom connected to an aromatic atom), and
“nar” stands for nonaromatic; ic (isolating or hydrophobic carbons) are carbon atoms which are not double- or triple-bonded to a heteroatom;
hydrophobic_hydrogen is a hydrogen which is bonded to an ic.
b
hetnar6ring stands for hetero-nonaromatic-6-membered-ring and similar definitions
are used for other rings.
c
edg and ewg stands for e-donating-groups and e-withdrawing-groups, respectively.
C ) P[p - (n + l)]/p
1000 J. Chem. Inf. Model., Vol. 47, No. 3, 2007 AMINI ET AL.
Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org
Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

independence advantages of SVMs with the expressive power
and flexibility of ILP. In particular, we proposed a kernel
9
which is an inner product between two mapped examples.
As with normal ILP, background knowledge and hypoth-
esized clauses are encoded as logic programs. The approach
we suggest differs from the existing relational kernels
suggested in ref 31 by our use of logical background
knowledge. The SVILP approach is a form of generalization
relative to background knowledge, although the final com-
bining function for the ILP-learned clauses in an SVM rather
than by a logical conjunction. Figure 2 outlines the SVILP
method. CProgol learns rules as we described in previous
section. The learned logic rules are converted into a binary
matrix (Figure 2). Each rule is tested for each molecule. If
the rule covers the molecule, a number “1” is assigned,
otherwise a “0” value is given. The whole matrix is
multiplied by a k factor which is defined as
where m is the number of rules used in the matrix. The
training matrix is made by addition of the pLC
50
of the
molecules as the dependent variable to the first column of
the matrix shown in Figure 2. The support vector machine
method provided by SVMTorch (http://www.idiap.ch/
machine_learning.php?content)Torch/en_SVMTorch.txt; ac-
cessed Jan 30 2007) is then used to make the model. The
testing matrix is made using the same procedure, and the
model is tested on this matrix for prediction. The results are
presented using the squared linear correlation coefficient
(R
2
CV
) and the mean squared error (MSE) on the cross-
validated data. The method has been reported in separate
publications.
9
We chose the widely used approach partial
least square (PLS) for comparison with SVILP. PLS is used
when the number of variables exceeds the number of
observations.
In SVILP and PLS, we require that the training set is
further divided into a smaller training and a validating set.
We use 25% of the training set for validation and the
remaining 75% for making the validation model. For PLS
calculations, we encoded a program using the algorithm
described by Geladi and Kowalski.
32
Chemical Descriptors. The toxicity of compounds can
be modeled using various chemical descriptors such as
LOGP, LUMO, and dipole moments.
11,26,27
To compare with
this approach, we derived a (cross-validated) multilinear
regression of pLC
50
[-log(LC
50
(mol/L))] with the above
descriptors that we termed CHEM because it has been among
chemical descriptors used in other studies.
26,27
These chemical
features were calculated using the methods described previ-
ously.
TOPKAT. TOPKAT (toxicity prediction by computer-
assisted technology), developed by Enslein et al.,
11
uses the
quantitative SAR model of prediction including linear
regression using the structural descriptors. The software
accepts the structures of the molecules in the SMILES string,
automatically splits the molecule into different fragments,
and uses these fragments, as well as some chemical descrip-
tors such as LOGP and shape index, molecular weight, and
symmetry, for predictions. The program uses the above
descriptors for quantitative toxicology modeling for over 18
endpoints. The Fathead minnow LC
50
(version 3.2) was used
as the model in this study, and the submodels were chosen
by the program. TOPKAT validates its assessment by
univariate analysis of the descriptors, multivariate analysis
of the query structure in optimum prediction space, and
finally similarity searching. To make a fair comparison of
the above methods with the commercial software TOPKAT,
we must ensure that we only consider predicted accuracies
for molecules that were not included in the training data of
either method. We therefore excluded any of the DSSTox
molecules that TOPKAT had in its database leaving 165
unseen molecules.
Sign Test. The sign test compares the success of two
methods under the null hypothesis that method 1 has the
same chance of success as method 2. Random distributions
of successes of a method follow a binomial distribution and
the one tail provides the measure of significance.
McNemar Test. The McNemar test
33
was used to find
the reliability of the classification methods. The McNemar
test is a simple and standard approach for finding the
statistical significance by evaluation of the probability of χ
2
where b is the number of times that the prediction of the
first method is wrong and the prediction of the second
method (the case method) is correct and c is the number of
times that the prediction of the first method is correct and
the prediction of the second method is wrong. A prediction
is significant if χ
2
< 0.05.
RESULTS
The average accuracies of predictions over five folds using
chemical descriptor method (CHEM), ILP rules in combina-
tion with PLS, and SVILP are given in Table 2a. The cross-
validated square of correlation coefficients for CHEM, PLS,
and SVILP are 0.52, 0.59, and 0.66, respectively. On the
basis of the statistical sign test method, this improvement
for SVILP and PLS is highly significant with respect to the
CHEM method. The SVILP also shows significant improve-
ment in comparison with PLS method. The numbers of
features that are learned by ILP and used by SVILP and PLS
are 1526, 1883, 1802, 1996, and 2095 for calculations 1-5,
respectively
Figure 2. Support vector inductive logic programming (SVILP)
for a system of n molecules and m learned rules: M
1
,M
2
, ..., M
n
are the list of molecules; R
1
,R
2
, ..., R
m
are the logic rules; the
initial matrix is binary, “1” when it covers the molecule and “0”
otherwise. The whole table is multiplied by a k factor.
k )
1
x
m
χ
2
) (b - c)
2
/(b + c)
TOXICOLOGY PREDICTION USING LOGIC-BASED APPROACH J. Chem. Inf. Model., Vol. 47, No. 3, 2007 1001
Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org
Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

In the second part of this study, the molecules were
classified into two groups based on their toxicities: that is,
toxic (pLC
50
g mean) and nontoxic (pLC
50
< mean), where
“mean” is the average of toxicities of molecules in the
training set. This study assesses a qualitative prediction of
whether a molecule is toxic and also enables us to compare
the accuracy of SVILP with ILP. The numbers of correct
and incorrect predicted samples were found using three
above-described methods, as well as the original ILP
approach. The details of predicted values are given in Tables
2b and 3. According to Table 3, the SVILP has the largest
recall, which means that more positives have been predicted
correctly in comparison with other methods. The CHEM
method is the best method among others regarding the
specificity, which means that less negative examples has been
predicted as toxic by this approach, however, with no
significant difference. But the total accuracy is in favor of
SVILP based on the results of Table 3. Table 2b compares
the recalls and shows significant improvements of the SVILP
and PLS with respect to the CHEM method.
In the next study, the quality of the above methods was
evaluated by comparison of the results with an extensively
used software, TOPKAT. The SVILP, PLS, and ILP
procedures were retrained using the 411 molecules of
DSSTox database that are present in the databases of
TOPKAT, and the remaining 165 molecules were used for
testing. TOPKAT gives an error in prediction of toxicities
of 33 molecules for various reasons such as absence of
fragments and being outside of premium predictions space.
Exclusion of these molecules improves the R
2
value for
TOPKAT to 0.31. Because the change of accuracy resulting
from the exclusion of 33 molecules is not significant, we
have reported and compared the results based on all of 165
molecules in Table 2c. According to the results of Table 2c,
the SVILP and TOPKAT methods have the highest and
lowest accuracy of predictions, respectively. According to
sign test, SVILP shows highly significant improvement in
comparison with all of other approaches. The improvements
for PLS in comparison with TOPKAT and CHEM and for
CHEM in comparison with TOPKAT are significant. Ac-
cording to the results of Figure 3, TOPKAT is unable to
discriminate between the toxic and nontoxic molecules;
furthermore, some nontoxic molecules in top left corner of
the Figure 3a have been predicted as very toxic, while in
other two methods, we do not see the same problem. The
advantage of SVILP in comparison with CHEM is in better
prediction of more toxic molecules (top right corner of Figure
3b).
Chemical Insights. A major advantage of the logic-based
approach is that the ILP-phase yields rules that provide
chemical insights. The power of these rules is quantified by
compression. The rules with high compression are expected
to have a greater contribution to the predicted toxicity of
compounds. Tables 4 and 5 show a few sample rules with
high compression, and in Figures 4 and 5, these rules are
shown on appropriate chemical structures. We have examined
these rules and via a literature search assessed whether these
features have previously been identified. Several of the rules
had been previously identified as chemical alerts, but it is
important that these are now discovered automatically and
consistently using ILP. In addition, several new features have
been identified. We now report some of these chemical alerts.
Hydrophobic Features. In our first series of rules, the
molecule is toxic if it has hydrophobic features; however,
this does not mean that all of molecules with hydrophobic
features are toxic. These features have been found in different
rules, and here, we summarize a few of the most important
ones in Table 4. Some of the obvious forms of rules found
in this study are a high value of LOGP, large number of
hydrophobic hydrogen and hydrophobic carbon atoms, and
two hydrophobic elements (carbon or hydrogen) in a range
of 8-12 Å from each other. In Figure 4a, we show two
highly hydrophobic molecules with a large number of
hydrophobic elements. These rules emphasize the importance
of having hydrophobic groups in increasing of toxicities
which is a well-known phenomenon in the field of toxicol-
Table 2. (a) Accuracies of Quantitative Predictions by CHEM,
PLS, and SVILP, (b) Results of Qualitative Classification of
Molecules as Toxic or Nontoxic Using Three above Methods Plus
ILP, and (c) Comparison of Results from Various Methods
Introduced in This Study with TOPKAT
(a) regression (N ) 576)
a
accuracy significant improvement
g
R
2
CV
MSE CHEM PLS
CHEM
d
0.52 0.81
PLS
e
0.59 0.67 0.005
SVILP
f
0.66 0.57 0.000001 0.02
(b) classification (N ) 220)
b
recall significant improvement
h
% ILP CHEM PLS
ILP 55
CHEM 58 0.41
PLS 71 0.00005 0.00005
SVILP 73 0.00005 0.00005 0.28
(c) comparison with TOPKAT (N ) 165)
c
accuracy significant improvement
g
R
2
MSE TOPKAT CHEM
TOPKAT 0.26 2.2
CHEM 0.48 1.04 0.01
PLS 0.47 1.03 0.001 0.02
SVILP 0.57 0.8 0.0001 0.0005 0.0001
a
The average of correlation coefficients on the five folds plus the
mean square error (MSE) values for three methods of calculations. The
significant (P < 0.05) improvements are shown in bold. The bold italic
values are highly significant (P < 0.01) improvements. The numbers
are the one-tail probabilities. N is the number of samples used for cross-
validation.
b
The results of classification for the toxic class of molecules
(pLC
50
> mean) for four methods described in the text.
c
Comparison
of toxicities of 165 molecules predicted by the different methods
including the commercial software TOPKAT.
d
CHEM stands for
chemical descriptors (LOGP, LUMO, and dipole moment).
e
Partial
least square (PLS).
f
Support vector inductive logic programming
(SVILP).
g
Using sign test.
h
Using McNemar method.
Table 3. Prediction of Toxic and Nontoxic Classes of Molecules by
Methods Described in the Text
method TP
a
FN
b
TN
c
FP
d
recall
e
specificity
f
accuracy
g
CHEM 128 92 324 32 0.58 0.91 0.78
PLS 156 64 303 53 0.71 0.85 0.80
SVILP 161 59 310 46 0.73 0.87 0.82
ILP 121 99 313 43 0.55 0.88 0.75
a
True positives.
b
False negatives.
c
True negative.
d
False positive.
e
TP/(TP + FN).
f
TN/(TN + FP).
g
(TP + TN)/(TP + TN + FP +
FN).
1002 J. Chem. Inf. Model., Vol. 47, No. 3, 2007 AMINI ET AL.
Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org
Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

Citations
More filters
Journal ArticleDOI
TL;DR: Early retrieved compounds showed high topological differences to molecules used as training data, showing the strength of this method for scaffold hopping, and the method was benchmarked on the Directory of Useful Decoys datasets.
Abstract: Investigational Novel Drug Discovery by Example (INDDEx™) is a technology developed to guide hit to lead discovery by learning rules from existing active compounds that link activity to chemical substructure. INDDEx is based on Inductive Logic Programming [1], which learns easily interpretable qualitative logic rules from active ligands that give an insight into chemistry, relate molecular substructure to activity, and can be used to guide the next steps of drug design chemistry. Support Vector Machines weight the rules to produce a quantitative model of structure-activity relationships. Whereas earlier testing [2,3] was performed on single dataset examples, this talk presents the largest and fullest test of the method. The method was benchmarked on the Directory of Useful Decoys (DUD) datasets [4], using the same methodology described in the paper on the assessment of LASSO [5] and DOCK. For each of the DUD datasets, the known active ligands were mixed with all the decoy compounds in DUD, and the retrieval rates of INDDEx and DUD were measured when they were trained on 2, 4, and 8 of the known active ligands (Figure 2). Early retrieved compounds showed high topological differences to molecules used as training data, showing the strength of this method for scaffold hopping. This work was supported by a BBSRC case studentship with Equinox Pharma Ltd (http://www.equinoxpharma.com). Figure 1 Recovery of actives in each of the DUD datasets from all decoys in the DUD, averaged across all 40 datasets.

2 citations


Cites methods from "A Novel Logic‐Based Approach for Qu..."

  • ...Whereas earlier testing [2,3] was performed on single dataset examples, this talk presents the largest and fullest test of the method....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The SVILP approach has a major advantage in that it uses ILP automatically and consistently to derive rules, mostly novel, describing fragments that are toxicity alerts, and has the potential of tackling many problems relevant to chemoinformatics including in silico drug design.
Abstract: There is a pressing need for accurate in silico methods to predict the toxicity of molecules that are being introduced into the environment or are being developed into new pharmaceuticals. Predictive toxicology is in the realm of structure activity relationships (SAR), and many approaches have been used to derive such SAR. Previous work has shown that inductive logic programming (ILP) is a powerful approach that circumvents several major difficulties, such as molecular superposition, faced by some other SAR methods. The ILP approach reasons with chemical substructures within a relational framework and yields chemically understandable rules. Here, we report a general new approach, support vector inductive logic programming (SVILP), which extends the essentially qualitative ILP-based SAR to quantitative modeling. First, ILP is used to learn rules, the predictions of which are then used within a novel kernel to derive a support-vector generalization model. For a highly heterogeneous dataset of 576 molecules ...

38 citations