A Novel Logic‐Based Approach for Quantitative Toxicology Prediction.

doi:10.1002/CHIN.200733223

A Novel Logic-Based Approach for Quantitative Toxicology Prediction

Ata Amini,

†

Stephen H. Muggleton,

‡

Huma Lodhi,

‡

and Michael J. E. Sternberg*

,†

Structural Bioinformatics Group, Centre for Bioinformatics, Division of Molecular Biosciences, and

Computational Bioinformatics Laboratory, Department of Computing, Imperial College London,

London SW7 2AZ, U.K.

Received June 1, 2006

There is a pressing need for accurate in silico methods to predict the toxicity of molecules that are being

introduced into the environment or are being developed into new pharmaceuticals. Predictive toxicology is

in the realm of structure activity relationships (SAR), and many approaches have been used to derive such

SAR. Previous work has shown that inductive logic programming (ILP) is a powerful approach that

circumvents several major difficulties, such as molecular superposition, faced by some other SAR methods.

The ILP approach reasons with chemical substructures within a relational framework and yields chemically

understandable rules. Here, we report a general new approach, support vector inductive logic programming

(SVILP), which extends the essentially qualitatiVe ILP-based SAR to quantitatiVe modeling. First, ILP is

used to learn rules, the predictions of which are then used within a novel kernel to derive a support-vector

generalization model. For a highly heterogeneous dataset of 576 molecules with known fathead minnow

fish toxicity, the cross-validated correlation coefficients (R

2

CV

) from a chemical descriptor method (CHEM)

and SVILP are 0.52 and 0.66, respectively. The ILP, CHEM, and SVILP approaches correctly predict 55,

58, and 73%, respectively, of toxic molecules. In a set of 165 unseen molecules, the R

2

values from the

commercial software TOPKAT and SVILP are 0.26 and 0.57, respectively. In all calculations, SVILP showed

significant improvements in comparison with the other methods. The SVILP approach has a major advantage

in that it uses ILP automatically and consistently to derive rules, mostly novel, describing fragments that

are toxicity alerts. The SVILP is a general machine-learning approach and has the potential of tackling

many problems relevant to chemoinformatics including in silico drug design.

INTRODUCTION

With more than 70 000 chemicals in use today and many

more being synthesized, it is vital that there are effective

methods to assess the effect of these compounds on the

environment and on human health.

1,2

In the development of

pharmaceuticals, many potential leads are dropped due to

their toxicity after millions of dollars have been invested.

3

Experimental testing is both time-consuming and expensive,

and accordingly, there is a pressing requirement for accurate

in silico methods to provide an initial screen that generates

alerts for toxicity.

4

Often the strategy to develop these

predictors follows a more general approach to derive

qualitative/quantitative structure activity relationships (SAR).

5

Thus many of the toxicity prediction methods are based on

regression from chemical properties, advanced machine

learning, or expert-derived rule-based systems.

6,7

One par-

ticular machine-learning approach that has successfully been

used for toxicity prediction is based on inductive logic

programming (ILP).

8

However a major limitation of ILP is

that the resultant logic rules yield yes or no predictions, and

thus the capacity for quantitative prediction is limited. Here

we report the development of a new quantitatiVe SAR

method (SVILP)

9

which combines ILP with support vector

(SV) programming. Using a recently available dataset of

toxicity DSSTox,

10

which provides the toxicities of 576

chemicals for fathead minnow, we show that SVILP yields

significantly better accuracies than ILP, regression from

chemical descriptors, and an industry standard method

TOPKAT.

11

We also compare the SVILP results with two

more studies which have investigated the prediction of

toxicity for fathead minnow.

The toxicities of several classes of chemicals can, as a

first approximation, be estimated from their hydrophobicities

(using the logarithm of the octanol/water partition coefficient

LOGP) and their electrophilicities (using the lowest-unoc-

cupied molecular orbital, LUMO, energies and dipole

moments).

12-14

There are also expert systems available which

are used for prediction of toxicities predominantly in

pharmaceutical companies, universities, and government

agencies:

15

for example, HazardExpert, which predicts the

toxicity according to the dissociation constant (pK

a

), LOGP,

and structure of compounds and DEREK (deductive estima-

tion of risk from existing knowledge),

16

which is based on

toxicophores (chemical substructures with known toxic

effect) and physicochemical properties. TOPKAT (toxicity

prediction by computer-assisted technology) is a well-known

software for toxicity prediction that uses the quantitative SAR

model of prediction derived using statistical methods such

as linear regression of structural descriptors.

11

ILP has been used to derive SAR in several systems,

17-20

and it has been shown that the logic-based approach remedies

many of the problems associated with many other SAR

* To whom correspondence should be addressed. Phone: +44 (0)20 7594

5212. Fax: +44 (0)20 7594 5264. E-mail: m.sternberg@imperial.ac.uk.

†

Structural Bioinformatics Group.

‡

Computational Bioinformatics Laboratory.

998 J. Chem. Inf. Model. 2007, 47, 998-1006

Published on Web 04/24/2007

Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org

Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

methods including the need for molecular superposition as

in CoMFA, problems of handling diverse data sets and the

lack of chemical insights in learned SAR. In the ILP

approach, there is no need to define all of the chemical

features as in many other methods, since new features can

be learned by the use of logic rules. One encodes that atom

A is bonded to atom B, and atom B is bonded to atom C.

The program can infer that atom A is connected to atom C

via atom B without this having to be explicitly encoded.

Importantly, the learned logic rules are readily amenable to

interpretation as chemical substructures related to activity

and thereby provide extensive chemical insights.

ILP is essentially a qualitative method yet the aim is to

generate a quantitative toxicity predictor and, more generally,

to generate quantitative SAR. A step toward generating

quantitative SAR using ILP-based SAR was introduced by

King and co-workers.

21

By inspection, they identified a few

ILP-derived rules which were used as features in a linear

regression to obtain a quantitatiVe SAR.

21

However, their

approach did not encapsulate the diversity of logic-learned

rules and therefore would be difficult to apply to a diverse

dataset, such as the one required for toxicity prediction. In

this study, we report a novel approach which quantifies logic

rules learned by ILP method using the SVILP method. This

approach benefits from often-observed improved accuracy

of the support vector methodology over linear regression.

Moreover it can include a large number of logic-derived

features and therefore is applicable to diverse datasets.

METHODS

The SVILP approach

9

uses ILP for learning logic rules,

followed by quantitative modeling based on support vector

technology as shown in Figure 1. The first step is to prepare

the background knowledge, namely, the chemical fragments

in the form of logic relations. The logic relations identify

the chemical fragments according to the atom and bond

details of the MOL2 structures. For example, we define a

hydroxyl group as a combination of an oxygen atom and a

hydrogen atom that are single-bonded in the form of a logic

relation. In addition to background knowledge, the chemicals

in the training set are classified into more toxic (positives)

and less toxic (negatives) according to the observed toxicities

(see below for more details). ILP learning can now be

conducted using the background knowledge and the observa-

tions. The software CProgol automatically learns logic rules

using background knowledge and experimental observations.

These learned rules form the input for quantitative prediction

using the newly developed method SVILP. Furthermore, ILP

provides extensive chemical insight into the cause of toxicity

for each chemical. More details about each step are outlined

in the following sections.

Dataset. The Distributed Structure-Searchable Toxicity

(DSSTox) database from the U.S. Environmental Protection

Agency (www.epa.gov; accessed Jan 30 2007) provides

various databases including EPA Fathead Minnow Aquatic

Toxicity Database (EPAFHM), which currently contains

structures of 613 chemicals of which 576 have designated

toxicity and therefore were used in this study. The set is

expressed as highly diverse since it covers chemicals of

various organic classes, that is, hydrocarbons, alcohols,

aldehydes, esters, acids, etc., and furthermore, the molecules

show many modes of action such as narcotics, oxidative

phosphorylation uncouplers, respiratory inhibitors, electro-

philes/proelectrophiles, acetylcholinesterase inhibitors, or

central nervous system seizure agents. Supportiing Informa-

tion Table SI-1 provides the structures of all of the chemicals.

The toxicity end-points are based on the 96 h LC

50

(mmol/

L) value for the fathead minnow.

10

The LC

50

is the aqueous

concentration associated with 50% individual survival of a

test population within a specified period.

13

The concept of

toxicity in this manuscript is not general and is based entirely

on the fathead minnow end-point.

The structures of chemicals in DSSTox are stored as

SMILES strings, which we converted to 3D structures using

CONCORD

22

via its implementation in the TRIPOS soft-

ware. The 3D structures were then reoptimized using the

PM3 semiempirical method,

23

and the structures were then

saved as SYBYL Mol2 formats which provides all the

necessary information for ILP (including atom types, atomic

hybridizations, partial charges, x, y, and z coordinates of all

of the atoms, and types of chemical bonds). The program

BioMedCache

24

was used to derive the three chemical

Figure 1. Process of construction and selection of logic rules by CProgol using the ILP system, followed by quantification of the logic

rules using SVILP and testing on the unseen molecules.

TOXICOLOGY PREDICTION USING LOGIC-BASED APPROACH J. Chem. Inf. Model., Vol. 47, No. 3, 2007 999

Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org

Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

descriptors used, that is, LUMO, LOGP,

25

and dipole

moment. LOGP reflects the hydrophobicity of compounds,

and the mechanism of toxicities of highly hydrophobic

molecules is often based on their accumulation in the

nonpolar lipid phase of the biomembranes. LUMO and dipole

moment describe electrophilicities of compounds and are two

of the common chemical descriptors used in literature.

26,27

In this study, we used only a single conformation of the

structures since most of structures in the set are rigid and

therefore are not critically dependent on conformational

changes. Furthermore, the logic-based SAR calculates the

distance between fragments considering (1.0 Å as the range

which reduces the dependency of the model on conforma-

tional changes.

Background Knowledge. The first step for ILP learning

is to provide generic logical relations that define the chemical

fragments of the molecules: the background knowledge. The

structures of molecules are prepared in Mol2 format and thus

provide all the knowledge we need for a logic-based SAR,

that is, details of atoms and bonds. The formats of atoms

and bonds as transformed into PROLOG format (the logic

programming language) are atom (M,Label,Atom_type,-

Hybrid,X,Y,Z,Charge) and bond(M,Label1,Label2,Bond_type),

where M represents molecules (e.g., m1, m2, ...); Label

represents the unique label for each atom in each molecule

(e.g., a1, a2, ...) via an atom number; Atom_type represents

the type of atoms (e.g., c, h, cl, ...); Hybrid represents the

hybridization state of each atom (e.g., 3 for sp3, 2 for sp2,

...); X, Y, Z represents the x, y, z coordinates of each atom;

Q represents the partial charge of each atom; and Bond_type

represents the type of bond (“1” for single bonds, “2” for

double bonds, “3” for triple bond, and “ar” for aromatic

bonds).

The atom-bond information provides elementary knowl-

edge about the chemicals involved in the set. We then define

chemical fragments (e.g., phenyl ring, aldehyde, carboxylic

acids) to use as the main features for the ILP calculations.

These chemical substructures are defined as relations in the

PROLOG language using the background knowledge (atom-

bond data). For example, the hydroxyl group can be defined

as the following logic relation: hydroxyl(M,A) ) atom-

(M,A,o,o3,3.21,0.01,1.21,-0.32), atom(M,B,h,h,4.01,0.01,

1.02,0.22), bond(M,A,B,1).

This means that any molecule (M) could have an OH

group, if it has a sp3 oxygen and a hydrogen that are single

bonded. Table 1 lists the name of the chemical fragments

used in this study.

Classification of Molecules. The dataset of 576 molecules

was randomly divided into five folds with 115-116 mol-

ecules in each fold. One fold is used as a testing set, and the

four other folds are for training. Calculations were repeated

five times for a 5-fold cross-validation. The dataset was then

partitioned into positive and negative examples based on their

toxicities. Molecules are distributed in a range of pLC

50

[-log(LC

50

)] between -3 (LC

50

) 918 mmol/L) to 6.4 (LC

50

) 4.2 × 10

-7

mmol/L). All molecules above the mean value

of toxicities in the training set are considered to be positive

(more toxic), and the remaining are considered to be negative

(less toxic). Therefore, the cutoff value is different for each

fold.

Learning theories using PROGOL. ILP learns from

known examples or observations (i.e., it employs the

reasoning known as induction).

28

The observations, the

background knowledge, and the resultant rules are expressed

as first-order logic programs, such as “compound m21

contains oxygen atom”. CProgol

29,30

is a state-of-the-art ILP

system. CProgol’s input consists of positive and negative

examples which belong to more toxic and less toxic

molecules, respectively, together with background knowledge

The output of CProgol is a set of logic rules which describe

the positive and negative examples using the information

provided in the background knowledge. In CProgol, the first

positive example is randomly selected, and on the basis of

the background knowledge, hypotheses are constructed; then

the hypothesis with maximum compression is selected as

the results of search. Compression, C, for each clause is

defined as

where C, P, p, n, and l are compression, total number of

positive examples, number of positive examples covering

by the clause, number of negative examples covering by

clause, and length of clause (the number of features in each

rule), respectively. Compression is a suitable measure for

finding those rules which have predictive power, and it

avoids overly specific rules (i.e., long clauses). The calcula-

tion is continued on the next positive example, but the

redundant examples relative to the previously learned rules

are removed. One of the advantages of the logic-based

method is that it both constructs and selects the hypotheses.

The selection is based on the value of compression that is

defined automatically for each rule. At the end of the ILP

calculation, all of rules with positive compression are used

for regression.

SVILP. Support vector inductive logic programming

(SVILP) is at the intersection of two areas of machine

learning, namely, support vector machines (SVMs) and

inductive logic programming (ILP). It is a novel machine-

learning approach which combines the dimensionality-

Table 1. Chemical Fragments Used in This Study to Construct the Logic Rules

atoms

a

rings

b

alkyl groups functional groups

c

O(SP3), N(SP3), N(tertiary),

N(quaternary), N(ar), F(ar), F(nar), F,

Cl(ar), Cl(nar), Cl, Br(ar), Br(nar), Br,

I(ar), I(nar), I, hydrophobic_hydrogen,

S, P, Sn, ic

phenyl, hetar6ring,

hetnar6ring, hetar5ring,

hetnar5ring, cyclohexane

methyl, ethyl, propyl,

butyl, big_alkyl, alkyl,

tert-butyl, iso-butyl,

iso-propyl,

aldehyde, ether, thioether, ester, carboxylic acid,

NH

2

, amide, ketone, alcohol, nitro, alkene, alkyne,

conjugated alkene, cyanide, CCl

3

, CCl

2

, NH,

NdN, CdN, distance, edg, ewg, positive charge,

negative charge, polar, hydrophobic

a

ar stands for aromatic (for example, N(ar) means an aromatic nitrogen, but Cl(ar) means a chlorine atom connected to an aromatic atom), and

“nar” stands for nonaromatic; ic (isolating or hydrophobic carbons) are carbon atoms which are not double- or triple-bonded to a heteroatom;

hydrophobic_hydrogen is a hydrogen which is bonded to an ic.

b

hetnar6ring stands for hetero-nonaromatic-6-membered-ring and similar definitions

are used for other rings.

c

edg and ewg stands for e-donating-groups and e-withdrawing-groups, respectively.

C ) P[p - (n + l)]/p

1000 J. Chem. Inf. Model., Vol. 47, No. 3, 2007 AMINI ET AL.

Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org

Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

independence advantages of SVMs with the expressive power

and flexibility of ILP. In particular, we proposed a kernel

9

which is an inner product between two mapped examples.

As with normal ILP, background knowledge and hypoth-

esized clauses are encoded as logic programs. The approach

we suggest differs from the existing relational kernels

suggested in ref 31 by our use of logical background

knowledge. The SVILP approach is a form of generalization

relative to background knowledge, although the final com-

bining function for the ILP-learned clauses in an SVM rather

than by a logical conjunction. Figure 2 outlines the SVILP

method. CProgol learns rules as we described in previous

section. The learned logic rules are converted into a binary

matrix (Figure 2). Each rule is tested for each molecule. If

the rule covers the molecule, a number “1” is assigned,

otherwise a “0” value is given. The whole matrix is

multiplied by a k factor which is defined as

where m is the number of rules used in the matrix. The

training matrix is made by addition of the pLC

50

of the

molecules as the dependent variable to the first column of

the matrix shown in Figure 2. The support vector machine

method provided by SVMTorch (http://www.idiap.ch/

machine_learning.php?content)Torch/en_SVMTorch.txt; ac-

cessed Jan 30 2007) is then used to make the model. The

testing matrix is made using the same procedure, and the

model is tested on this matrix for prediction. The results are

presented using the squared linear correlation coefficient

(R

2

CV

) and the mean squared error (MSE) on the cross-

validated data. The method has been reported in separate

publications.

9

We chose the widely used approach partial

least square (PLS) for comparison with SVILP. PLS is used

when the number of variables exceeds the number of

observations.

In SVILP and PLS, we require that the training set is

further divided into a smaller training and a validating set.

We use 25% of the training set for validation and the

remaining 75% for making the validation model. For PLS

calculations, we encoded a program using the algorithm

described by Geladi and Kowalski.

32

Chemical Descriptors. The toxicity of compounds can

be modeled using various chemical descriptors such as

LOGP, LUMO, and dipole moments.

11,26,27

To compare with

this approach, we derived a (cross-validated) multilinear

regression of pLC

50

[-log(LC

50

(mol/L))] with the above

descriptors that we termed CHEM because it has been among

chemical descriptors used in other studies.

26,27

These chemical

features were calculated using the methods described previ-

ously.

TOPKAT. TOPKAT (toxicity prediction by computer-

assisted technology), developed by Enslein et al.,

11

uses the

quantitative SAR model of prediction including linear

regression using the structural descriptors. The software

accepts the structures of the molecules in the SMILES string,

automatically splits the molecule into different fragments,

and uses these fragments, as well as some chemical descrip-

tors such as LOGP and shape index, molecular weight, and

symmetry, for predictions. The program uses the above

descriptors for quantitative toxicology modeling for over 18

endpoints. The Fathead minnow LC

50

(version 3.2) was used

as the model in this study, and the submodels were chosen

by the program. TOPKAT validates its assessment by

univariate analysis of the descriptors, multivariate analysis

of the query structure in optimum prediction space, and

finally similarity searching. To make a fair comparison of

the above methods with the commercial software TOPKAT,

we must ensure that we only consider predicted accuracies

for molecules that were not included in the training data of

either method. We therefore excluded any of the DSSTox

molecules that TOPKAT had in its database leaving 165

unseen molecules.

Sign Test. The sign test compares the success of two

methods under the null hypothesis that method 1 has the

same chance of success as method 2. Random distributions

of successes of a method follow a binomial distribution and

the one tail provides the measure of significance.

McNemar Test. The McNemar test

33

was used to find

the reliability of the classification methods. The McNemar

test is a simple and standard approach for finding the

statistical significance by evaluation of the probability of χ

2

where b is the number of times that the prediction of the

first method is wrong and the prediction of the second

method (the case method) is correct and c is the number of

times that the prediction of the first method is correct and

the prediction of the second method is wrong. A prediction

is significant if χ

2

< 0.05.

RESULTS

The average accuracies of predictions over five folds using

chemical descriptor method (CHEM), ILP rules in combina-

tion with PLS, and SVILP are given in Table 2a. The cross-

validated square of correlation coefficients for CHEM, PLS,

and SVILP are 0.52, 0.59, and 0.66, respectively. On the

basis of the statistical sign test method, this improvement

for SVILP and PLS is highly significant with respect to the

CHEM method. The SVILP also shows significant improve-

ment in comparison with PLS method. The numbers of

features that are learned by ILP and used by SVILP and PLS

are 1526, 1883, 1802, 1996, and 2095 for calculations 1-5,

respectively

Figure 2. Support vector inductive logic programming (SVILP)

for a system of n molecules and m learned rules: M

1

,M

2

, ..., M

n

are the list of molecules; R

1

,R

2

, ..., R

m

are the logic rules; the

initial matrix is binary, “1” when it covers the molecule and “0”

otherwise. The whole table is multiplied by a k factor.

k )

1

x

m

χ

2

) (b - c)

2

/(b + c)

TOXICOLOGY PREDICTION USING LOGIC-BASED APPROACH J. Chem. Inf. Model., Vol. 47, No. 3, 2007 1001

Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org

Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

In the second part of this study, the molecules were

classified into two groups based on their toxicities: that is,

toxic (pLC

50

g mean) and nontoxic (pLC

50

< mean), where

“mean” is the average of toxicities of molecules in the

training set. This study assesses a qualitative prediction of

whether a molecule is toxic and also enables us to compare

the accuracy of SVILP with ILP. The numbers of correct

and incorrect predicted samples were found using three

above-described methods, as well as the original ILP

approach. The details of predicted values are given in Tables

2b and 3. According to Table 3, the SVILP has the largest

recall, which means that more positives have been predicted

correctly in comparison with other methods. The CHEM

method is the best method among others regarding the

specificity, which means that less negative examples has been

predicted as toxic by this approach, however, with no

significant difference. But the total accuracy is in favor of

SVILP based on the results of Table 3. Table 2b compares

the recalls and shows significant improvements of the SVILP

and PLS with respect to the CHEM method.

In the next study, the quality of the above methods was

evaluated by comparison of the results with an extensively

used software, TOPKAT. The SVILP, PLS, and ILP

procedures were retrained using the 411 molecules of

DSSTox database that are present in the databases of

TOPKAT, and the remaining 165 molecules were used for

testing. TOPKAT gives an error in prediction of toxicities

of 33 molecules for various reasons such as absence of

fragments and being outside of premium predictions space.

Exclusion of these molecules improves the R

2

value for

TOPKAT to 0.31. Because the change of accuracy resulting

from the exclusion of 33 molecules is not significant, we

have reported and compared the results based on all of 165

molecules in Table 2c. According to the results of Table 2c,

the SVILP and TOPKAT methods have the highest and

lowest accuracy of predictions, respectively. According to

sign test, SVILP shows highly significant improvement in

comparison with all of other approaches. The improvements

for PLS in comparison with TOPKAT and CHEM and for

CHEM in comparison with TOPKAT are significant. Ac-

cording to the results of Figure 3, TOPKAT is unable to

discriminate between the toxic and nontoxic molecules;

furthermore, some nontoxic molecules in top left corner of

the Figure 3a have been predicted as very toxic, while in

other two methods, we do not see the same problem. The

advantage of SVILP in comparison with CHEM is in better

prediction of more toxic molecules (top right corner of Figure

3b).

Chemical Insights. A major advantage of the logic-based

approach is that the ILP-phase yields rules that provide

chemical insights. The power of these rules is quantified by

compression. The rules with high compression are expected

to have a greater contribution to the predicted toxicity of

compounds. Tables 4 and 5 show a few sample rules with

high compression, and in Figures 4 and 5, these rules are

shown on appropriate chemical structures. We have examined

these rules and via a literature search assessed whether these

features have previously been identified. Several of the rules

had been previously identified as chemical alerts, but it is

important that these are now discovered automatically and

consistently using ILP. In addition, several new features have

been identified. We now report some of these chemical alerts.

Hydrophobic Features. In our first series of rules, the

molecule is toxic if it has hydrophobic features; however,

this does not mean that all of molecules with hydrophobic

features are toxic. These features have been found in different

rules, and here, we summarize a few of the most important

ones in Table 4. Some of the obvious forms of rules found

in this study are a high value of LOGP, large number of

hydrophobic hydrogen and hydrophobic carbon atoms, and

two hydrophobic elements (carbon or hydrogen) in a range

of 8-12 Å from each other. In Figure 4a, we show two

highly hydrophobic molecules with a large number of

hydrophobic elements. These rules emphasize the importance

of having hydrophobic groups in increasing of toxicities

which is a well-known phenomenon in the field of toxicol-

Table 2. (a) Accuracies of Quantitative Predictions by CHEM,

PLS, and SVILP, (b) Results of Qualitative Classification of

Molecules as Toxic or Nontoxic Using Three above Methods Plus

ILP, and (c) Comparison of Results from Various Methods

Introduced in This Study with TOPKAT

(a) regression (N ) 576)

a

accuracy significant improvement

g

R

2

CV

MSE CHEM PLS

CHEM

d

0.52 0.81

PLS

e

0.59 0.67 0.005

SVILP

f

0.66 0.57 0.000001 0.02

(b) classification (N ) 220)

b

recall significant improvement

h

% ILP CHEM PLS

ILP 55

CHEM 58 0.41

PLS 71 0.00005 0.00005

SVILP 73 0.00005 0.00005 0.28

(c) comparison with TOPKAT (N ) 165)

c

accuracy significant improvement

g

R

2

MSE TOPKAT CHEM

TOPKAT 0.26 2.2

CHEM 0.48 1.04 0.01

PLS 0.47 1.03 0.001 0.02

SVILP 0.57 0.8 0.0001 0.0005 0.0001

a

The average of correlation coefficients on the five folds plus the

mean square error (MSE) values for three methods of calculations. The

significant (P < 0.05) improvements are shown in bold. The bold italic

values are highly significant (P < 0.01) improvements. The numbers

are the one-tail probabilities. N is the number of samples used for cross-

validation.

b

The results of classification for the toxic class of molecules

(pLC

50

> mean) for four methods described in the text.

c

Comparison

of toxicities of 165 molecules predicted by the different methods

including the commercial software TOPKAT.

d

CHEM stands for

chemical descriptors (LOGP, LUMO, and dipole moment).

e

Partial

least square (PLS).

f

Support vector inductive logic programming

(SVILP).

g

Using sign test.

h

Using McNemar method.

Table 3. Prediction of Toxic and Nontoxic Classes of Molecules by

Methods Described in the Text

method TP

a

FN

b

TN

c

FP

d

recall

e

specificity

f

accuracy

g

CHEM 128 92 324 32 0.58 0.91 0.78

PLS 156 64 303 53 0.71 0.85 0.80

SVILP 161 59 310 46 0.73 0.87 0.82

ILP 121 99 313 43 0.55 0.88 0.75

a

True positives.

b

False negatives.

c

True negative.

d

False positive.

e

TP/(TP + FN).

f

TN/(TN + FP).

g

(TP + TN)/(TP + TN + FP +

FN).

1002 J. Chem. Inf. Model., Vol. 47, No. 3, 2007 AMINI ET AL.

Downloaded by IMPERIAL COLLEGE LONDON on September 4, 2015 | http://pubs.acs.org

Publication Date (Web): April 24, 2007 | doi: 10.1021/ci600223d

A Novel Logic‐Based Approach for Quantitative Toxicology Prediction.

Summary (1 min read)

INTRODUCTION

METHODS

RESULTS

DISCUSSION

Figures (3)

Citations

Cites methods from "A Novel Logic‐Based Approach for Qu..."

Cites background from "A Novel Logic‐Based Approach for Qu..."

Cites methods or result from "A Novel Logic‐Based Approach for Qu..."

References

Related Papers (5)