scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemical Information and Computer Sciences in 2001"


Journal ArticleDOI
TL;DR: The hypothesis that less complex molecules are more common starting points for the discovery of drugs is supported by the changes observed for these properties during the drug optimization phase.
Abstract: Using a simple model of ligand−receptor interactions, the interactions between ligands and receptors of varying complexities are studied and the probabilities of binding calculated. It is observed that as the systems become more complex the chance of observing a useful interaction for a randomly chosen ligand falls dramatically. The implications of this for the design of combinatorial libraries is explored. A large set of drug leads and optimized compounds is profiled using several different properties relevant to molecular recognition. The changes observed for these properties during the drug optimization phase support the hypothesis that less complex molecules are more common starting points for the discovery of drugs. An extreme example of the use of simple molecules for directed screening against thrombin is provided.

808 citations


Journal ArticleDOI
TL;DR: Lead structures exhibit, on the average, less molecular complexity, are less hydrophobic, and less druglike (lower druglike scores), and this information should be used in the design of novel combinatorial libraries that are aimed at lead discovery.
Abstract: To be considered for further development, lead structures should display the following properties: (1) simple chemical features, amenable for chemistry optimization; (2) membership to an established SAR series; (3) favorable patent situation; and (4) good absorption, distribution, metabolism, and excretion (ADME) properties. There are two distinct categories of leads: those that lack any therapeutic use (i.e., “pure” leads), and those that are marketed drugs themselves but have been altered to yield novel drugs. We have previously analyzed the design of leadlike combinatorial libraries starting from 18 lead and drug pairs of structures (S. J. Teague et al. Angew. Chem., Int. Ed. Engl. 1999, 38, 3743−3748). Here, we report results based on an extended dataset of 96 lead-drug pairs, of which 62 are lead structures that are not marketed as drugs, and 75 are drugs that are not presumably used as leads. We examined the following properties: MW (molecular weight), CMR (the calculated molecular refractivity),...

704 citations


Journal ArticleDOI
TL;DR: It is shown that the diversity of the training sets rather than the design of the methods is the main factor determining their prediction ability for new data, and the ALOGPS method provided better prediction ability than the other tested methods.
Abstract: A new method, ALOGPS v 2.0 (http://www.lnh.unil.ch/~itetko/logp/), for the assessment of n-octanol/water partition coefficient, log P, was developed on the basis of neural network ensemble analysis of 12 908 organic compounds available from PHYSPROP database of Syracuse Research Corporation. The atom and bond-type E-state indices as well as the number of hydrogen and non-hydrogen atoms were used to represent the molecular structures. A preliminary selection of indices was performed by multiple linear regression analysis, and 75 input parameters were chosen. Some of the parameters combined several atom-type or bond-type indices with similar physicochemical properties. The neural network ensemble training was performed by efficient partition algorithm developed by the authors. The ensemble contained 50 neural networks, and each neural network had 10 neurons in one hidden layer. The prediction ability of the developed approach was estimated using both leave-one-out (LOO) technique and training/test protocol. In case of interseries predictions, i.e., when molecules in the test and in the training subsets were selected by chance from the same set of compounds, both approaches provided similar results. ALOGPS performance was significantly better than the results obtained by other tested methods. For a subset of 12 777 molecules the LOO results, namely correlation coefficient r(2)= 0.95, root mean squared error, RMSE = 0.39, and an absolute mean error, MAE = 0.29, were calculated. For two cross-series predictions, i.e., when molecules in the training and in the test sets belong to different series of compounds, all analyzed methods performed less efficiently. The decrease in the performance could be explained by a different diversity of molecules in the training and in the test sets. However, even for such difficult cases the ALOGPS method provided better prediction ability than the other tested methods. We have shown that the diversity of the training sets rather than the design of the methods is the main factor determining their prediction ability for new data. A comparative performance of the methods as well as a dependence on the number of non-hydrogen atoms in a molecule is also presented.

358 citations


Journal ArticleDOI
TL;DR: If used in conjunction with phases I and II, which reduced the size of the data set dramatically by eliminating most inactive chemicals, the current CoMFA model can be used to predict the RBA of chemicals with sufficient accuracy and to provide quantitative information for priority setting.
Abstract: Endocrine disruptors (EDs) have a variety of adverse effects in humans and animals. About 58,000 chemicals, most having little safety data, must be tested in a group of tiered assays. As assays will take years, it is important to develop rapid methods to help in priority setting. For application to large data sets, we have developed an integrated system that contains sequential four phases to predict the ability of chemicals to bind to the estrogen receptor (ER), a prevalent mechanism for estrogenic EDs. Here we report the results of evaluating two types of QSAR models for inclusion in phase III to quantitatively predict chemical binding to the ER. Our data set for the relative binding affinities (RBAs) to the ER consists of 130 chemicals covering a wide range of structural diversity and a 6 orders of magnitude spread of RBAs. CoMFA and HQSAR models were constructed and compared for performance. The CoMFA model had a r2 = 0.91 and a q2LOO = 0.66. HQSAR showed reduced performance compared to CoMFA with r2 = 0.76 and q2LOO = 0.59. A number of parameters were examined to improve the CoMFA model. Of these, a phenol indicator increased the q2LOO to 0.71. When up to 50% of the chemicals were left out in the leave-N-out cross-validation, the q2 remained significant. Finally, the models were tested by using two test sets; the q2pred for these were 0.71 and 0.62, a significant result which demonstrates the utility of the CoMFA model for predicting the RBAs of chemicals not included in the training set. If used in conjunction with phases I and II, which reduced the size of the data set dramatically by eliminating most inactive chemicals, the current CoMFA model (phase III) can be used to predict the RBA of chemicals with sufficient accuracy and to provide quantitative information for priority setting.

317 citations


Journal ArticleDOI
TL;DR: Smaller neural networks and use of one homogeneous set of parameters provides a more robust model for prediction of aqueous solubility of chemical compounds.
Abstract: The molecular weight and electrotopological E-state indices were used to estimate by Artificial Neural Networks aqueous solubility for a diverse set of 1291 organic compounds. The neural network with 33-4-1 neurons provided highly predictive results with r2 = 0.91 and RMS = 0.62. The used parameters included several combinations of E-state indices with similar properties. The calculated results were similar to those published for these data by Huuskonen (2000). However, in the current study only E-state indices were used without need of additional indices (the molecular connectivity, shape, flexibility and indicator indices) also considered in the previous study. In addition, the present neural network contained three times less hidden neurons. Smaller neural networks and use of one homogeneous set of parameters provides a more robust model for prediction of aqueous solubility of chemical compounds. Limitations of the developed method for prediction of large compounds are discussed. The developed approach...

305 citations


Journal ArticleDOI
TL;DR: An idealized computer experiment to explore how consensus scoring works and demonstrates that consensus scoring outperforms any single scoring for a simple statistical reason: the mean value of repeated samplings tends to be closer to the true value.
Abstract: It has been reported recently that consensus scoring, which combines multiple scoring functions in binding affinity estimation, leads to higher hit-rates in virtual library screening studies. This method seems quite independent to the target receptor, the docking program, or even the scoring functions under investigation. Here we present an idealized computer experiment to explore how consensus scoring works. A hypothetical set of 5000 compounds is used to represent a chemical library under screening. The binding affinities of all its member compounds are assigned by mimicking a real situation. Based on the assumption that the error of a scoring function is a random number in a normal distribution, the predicted binding affinities were generated by adding such a random number to the "observed" binding affinities. The relationship between the hit-rates and the number of scoring functions employed in scoring was then investigated. The performance of several typical ranking strategies for a consensus scoring procedure was also explored. Our results demonstrate that consensus scoring outperforms any single scoring for a simple statistical reason: the mean value of repeated samplings tends to be closer to the true value. Our results also suggest that a moderate number of scoring functions, three or four, are sufficient for the purpose of consensus scoring. As for the ranking strategy, both the rank-by-number and the rank-by-rank strategy work more effectively than the rank-by-vote strategy.

285 citations


Journal ArticleDOI
TL;DR: The NCI database has by far the highest number of compounds that are unique to it, and each appears to have its own niche and thus "raison d'être".
Abstract: Eight large chemical databases have been analyzed and compared to each other. Central to this comparison is the open National Cancer Institute (NCI) database, consisting of approximately 250 000 structures. The other databases analyzed are the Available Chemicals Directory ("ACD," from MDL, release 1.99, 3D-version); the ChemACX ("ACX," from CamSoft, Version 4.5); the Maybridge Catalog and the Asinex database (both as distributed by CamSoft as part of ChemInfo 4.5); the Sigma-Aldrich Catalog (CD-ROM, 1999 Version); the World Drug Index ("WDI," Derwent, version 1999.03); and the organic part of the Cambridge Crystallographic Database ("CSD," from Cambridge Crystallographic Data Center, 1999 Version 5.18). The database properties analyzed are internal duplication rates; compounds unique to each database; cumulative occurrence of compounds in an increasing number of databases; overlap of identical compounds between two databases; similarity overlap; diversity; and others. The crystallographic database CSD and the WDI show somewhat less overlap with the other databases than those with each other. In particular the collections of commercial compounds and compilations of vendor catalogs have a substantial degree of overlap among each other. Still, no database is completely a subset of any other, and each appears to have its own niche and thus "raison d'etre". The NCI database has by far the highest number of compounds that are unique to it. Approximately 200 000 of the NCI structures were not found in any of the other analyzed databases.

203 citations


Journal ArticleDOI
TL;DR: The aim of this contribution is to review and comment on some major developments in compound classification and molecular similarity research, reflect their diversity, and highlight some of the questions that remain unanswered.
Abstract: Compound classification and virtual screening methods are capable of exploring and exploiting molecular similarity beyond chemistry, in accordance with the similar property principle.1 They can be used to analyze and predict biologically active compounds and correlate structural features and chemical properties of molecules with specific activities. This explains why such approaches are highly attractive tools in pharmaceutical research, 2 although a number of the underlying scientific concepts have originally been developed for different purposes. Since it is increasingly recognized that simply synthesizing and screening more and more compounds does not necessarily provide a sufficiently large number of high-quality leads and, ultimately, clinical candidates, much effort is spent in developing and implementing computational concepts that help to identify and refine leads. Typical applications include the identification of compounds with desired activity by database searching, derivation of predictive models of activity for database mining, selection of representative subsets from large compound libraries, or analysis of druglike properties. The aim of this contribution is to review and comment on some major developments in compound classification and molecular similarity research, reflect their diversity, and highlight some of the questions that remain unanswered. In a single contribution, it is difficult, if not impossible, to provide a complete account of, and give full credit to, all methods and developments relevant to compound classification and virtual screening. Therefore, some areas have been, rather subjectively, more emphasized than others or even omitted. For example, the discussion of virtual screening approaches is limited to those that focus on the small molecular level, as opposed to target structure-based design or docking methods, which have been reviewed elsewhere. 3-5

195 citations


Journal ArticleDOI
TL;DR: The comprehensive studies show that the proposed PI index correlates highly with W and Sz as well as with physicochemical properties and biological activities of a large number of diversified and complex compounds.
Abstract: A novel topological index, PI (Padmakar−Ivan index), is derived in this paper. The index is very simple to calculate and has disseminating power similar to that of the Wiener (W) and the Szeged (Sz) indices. The comprehensive studies show that the proposed PI index correlates highly with W and Sz as well as with physicochemical properties and biological activities of a large number of diversified and complex compounds. The proposed PI index promises to be a useful parameter in the QSPR/QSAR studies. The stability of each model is demonstrated by applying cross-validation test. Furthermore, more favorable comparison with other representative indices such as the Randic index is also made in order to establish the predictive ability of the PI index. The results have shown that in several cases the PI index gave better results.

193 citations


Journal ArticleDOI
TL;DR: The revised GSE proposed by Jain and Yalkowsky is simpler and provides a more accurate estimation of aqueous solubility of the same set of organic compounds, also more accurate than the original version of the GSE.
Abstract: The revised general solubility equation (GSE) proposed by Jain and Yalkowsky is used to estimate the aqueous solubility of a set of organic nonelectrolytes studied by Jorgensen and Duffy. The only inputs used in the GSE are the Celsius melting point (MP) and the octanol water partition coefficient (Kow). These are generally known, easily measured, or easily calculated. The GSE does not utilize any fitted parameters. The average absolute error for the 150 compounds is 0.43 compared to 0.56 with Jorgensen and Duffy's computational method, which utilitizes five fitted parameters. Thus, the revised GSE is simpler and provides a more accurate estimation of aqueous solubility of the same set of organic compounds. It is also more accurate than the original version of the GSE.

172 citations


Journal ArticleDOI
TL;DR: The approach of using partially ordered sets is described by applying it to a battery of tests and it is found that, at least in the case of the 55 analyzed samples and the evaluation by the scores, there is a considerable redundancy with respect to ranking.
Abstract: When a ranking of some objects (chemicals, geographical sites, river sections, etc.) by a multicriteria analysis is of concern, then it is often difficult to find a common scale among the criteria, and therefore even the simple sorting process is performed by applying additional constraints, just to get a ranking index. However such additional constraints, often arising from normative considerations, are controversially discussed. The theory of partially ordered sets and its graphical representation (Hasse diagrams) does not need such additional information just to sort the objects. Here, the approach of using partially ordered sets is described by applying it to a battery of tests, developed by Dutka et al. In our analysis we found the following: (1) The dimension analysis of partially ordered sets suggests that, at least in the case of the 55 analyzed samples and the evaluation by the scores, developed by Dutka et al., there is a considerable redundancy with respect to ranking. The visualization of the...

Journal ArticleDOI
TL;DR: It is shown that this novel QSAR technique can be used to build both classification and regression models and outperforms simpler variable selection techniques mainly for nonlinear data sets.
Abstract: In this work, we report the development of a novel QSAR technique combining genetic algorithms and neural networks for selecting a subset of relevant descriptors and building the optimal neural network architecture for QSAR studies. This technique uses a neural network to map the dependent property of interest with the descriptors preselected by the genetic algorithm. This technique differs from other variable selection techniques combining genetic algorithms to neural networks by two main features: (1) The variable selection search performed by the genetic algorithm is not constrained to a defined number of descriptors. (2) The optimal neural network architecture is explored in parallel with the variable selection by dynamically modifying the size of the hidden layer. By using both artificial data and real biological data, we show that this technique can be used to build both classification and regression models and outperforms simpler variable selection techniques mainly for nonlinear data sets. The re...

Journal ArticleDOI
TL;DR: The best group contribution method, based on a new fragment atom scheme, leads to a squared correlation coefficient of 0.95 and an average absolute calculation error of0.50 log unit, which is superior to other group contribution methods currently available.
Abstract: Several group contribution methods to estimate the aqueous solubility of organic molecules are proposed and evaluated for their ability to predict the water solubility of new molecules. The learning set consisted of 1168 organic compounds with experimental data taken from the literature after critical evaluation. The best method, based on a new fragment atom scheme, leads to a squared correlation coefficient of 0.95 and an average absolute calculation error of 0.50 log unit, which is superior to other group contribution methods currently available. One of the advantages of this model is that it has upper and lower limits so that the predicted solubilities cannot be unrealistily high or low.

Journal ArticleDOI
TL;DR: A quantitative structure property relationship study of the flash point of a diverse set of 271 compounds provided a general three-parameter QSPR model that solved the boiling point equation using the experimental boiling point as a descriptor.
Abstract: A quantitative structure property relationship study of the flash point of a diverse set of 271 compounds provided a general three-parameter QSPR model (R2 = 0.9020, R2cv = 0.8985, s = 16.1). Use of the experimental boiling point as a descriptor gives a three-descriptor equation with R2 = 0.9529. Use of the boiling point predicted by a four-parameter reported relationship gives a three-parameter flash point equation with a R2 value of 0.9247.

Journal ArticleDOI
TL;DR: This paper presents a new approach to near-infrared spectral (NIR) data analysis that is based on independent component analysis (ICA), able to separate the spectra of the constituent components from the spectras of their mixtures.
Abstract: This paper presents a new approach to near-infrared spectral (NIR) data analysis that is based on independent component analysis (ICA). The main advantage of the new method is that it is able to separate the spectra of the constituent components from the spectra of their mixtures. The separation is a blind operation, since the constituent components of mixtures can be unknown. The ICA based method is therefore particularly useful in identifying the unknown components in a mixture as well as in estimating their concentrations. The approach is introduced by reference to case studies and compared to other techniques for NIR analysis including principal component regression (PCR), multiple linear regression (MLR), and partial least squares (PLS) as well as Fourier and wavelet transforms.

Journal ArticleDOI
TL;DR: Evidence is provided for the GSE being a convenient and reliable method to predict aqueous solubilities of organic compounds and for it being more accurate than MLR methods.
Abstract: The revised general solubility equation (GSE) is used along with four different methods including Huuskonen's artificial neural network (ANN) and three multiple linear regression (MLR) methods to estimate the aqueous solubility of a test set of the 21 pharmaceutically and environmentally interesting compounds. For the selected test sets, it is clear that the GSE and ANN predictions are more accurate than MLR methods. The GSE has the advantages of being simple and thermodynamically sound. The only two inputs used in the GSE are the Celsius melting point (MP) and the octanol water partition coefficient (Kow). No fitted parameters and no training data are used in the GSE, whereas other methods utilize a large number of parameters and require a training set. The GSE is also applied to a test set of 413 organic nonelectrolytes that were studied by Huuskonen. Although the GSE uses only two parameters and no training set, its average absolute errors is only 0.1 log units larger than that of the ANN, which requir...

Journal ArticleDOI
TL;DR: Experimental IC(50) data for 314 selective cyclooxygenase-2 (COX-2) inhibitors are used to develop quantitation and classification models as a potential screening mechanism for larger libraries of target compounds.
Abstract: Experimental IC50 data for 314 selective cyclooxygenase-2 (COX-2) inhibitors are used to develop quantitation and classification models as a potential screening mechanism for larger libraries of target compounds. Experimental log(IC50) values ranged from 0.23 to ≥ 5.00. Numerical descriptors encoding solely topological information are calculated for all structures and are used as inputs for linear regression, computational neural network, and classification analysis routines. Evolutionary optimization algorithms are then used to search the descriptor space for information-rich subsets which minimize the rms error of a diverse training set of compounds. An eight-descriptor model was identified as a robust predictor of experimental log(IC50) values, producing a root-mean-square error of 0.625 log units for an external prediction set of inhibitors which took no part in model development. A k-nearest neighbor classification study of the data set discriminating between active and inactive members produced a ni...

Journal ArticleDOI
TL;DR: Some simple protocols are shown that, if used with a standard topological similarity search method, are sufficient to select nonpeptide actives given a peptide probe.
Abstract: Similarity searches based on chemical descriptors have proven extremely useful in aiding large-scale drug screening. Typically an investigator starts with a “probe”, a drug-like molecule with an interesting biological activity, and searches a database to find similar compounds. In some projects, however, the only known actives are peptides, and the investigator needs to identify drug-like actives. 3D similarity methods are able to help in this endeavor but suffer from the necessity of having to specify the active conformation of the probe, something that is not always possible at the beginning of a project. Also, 3D methods are slow and are complicated by the need to generate low-energy conformations. In contrast, topological methods are relatively rapid and do not depend on conformation. However, unmodified topological similarity methods, given a peptide probe, will preferentially select other peptides from a database. In this paper we show some simple protocols that, if used with a standard topological ...

Journal ArticleDOI
TL;DR: The novel chirality descriptors of molecular structure should find their applications in QSAR studies and related investigations of molecular sdatasets as well as the comparative molecular field analysis applied to the same dataset.
Abstract: Several series of novel chirality descriptors of chemical organic molecules have been introduced. The descriptors have been developed on the basis of conventional topological descriptors of molecular graphs. They include modified molecular connectivity indices, Zagreb group indices, extended connectivity, overall connectivity, and topological charge indices. These modified descriptors make use of an additional term called chirality correction, which is added to the vertex degrees of asymmetric atoms in a molecular graph. Chirality descriptors can be real or complex numbers. Advantages and drawbacks of different series of chirality descriptors are discussed. These descriptors circumvent the inability of conventional topological indices to distinguish chiral or enantiomeric isomers, which so far has been the major drawback of 2D descriptors as compared to true 3D descriptors (e.g., shape, molecular fields) of molecular structure. These novel chirality descriptors have been implemented in a quantitative stru...

Journal ArticleDOI
TL;DR: A useful classification of topological indices based on the relative magnitudes of the contributions of terminal and interior bonds is suggested, to extend such considerations to other indices one needs to consider partitioning of global molecular indices into bond additive terms.
Abstract: Many topological indices lack an interpretation in terms of simple physicochemical quantities. We have reexamined the structural interpretation of well-known topological indices: the connectivity index 1χ, the Wiener index W, and the Hosoya topological index Z. We relate the success of various indices in structure−property studies to the degree to which they differentiate contributions from more exposed terminal bonds and more buried interior bonds. When considering bond additive properties of alkanes we find better regressions when greater weights are assigned to terminal CC bonds and lesser weights to internal CC. We suggest here that topological indices be discussed in terms of their partitioning into bond contributions, which for different indices and different bonds will assume different values. With this insight we modified the Wiener index W into a new index W*, in which bond contributions are determined using the reciprocal of the product of the number of atoms on each side of a bond. Similarly w...

Journal ArticleDOI
Pierre Bruneau1
TL;DR: The results show that it is possible to obtain a generic predictive model from database I but that the diversity of database II is too restricted to give a model with good generalization ability and that the ARD method applied to the mixed database III gives the best predictive model.
Abstract: Several predictive models of aqueous solubility have been published. They have good performances on the data sets which have been used for training the models, but usually these data sets do not contain many structures similar to the structures of interest to the drug research and their applicability in drug hunting is questionable. A very diverse data set has been gathered with compounds issued from literature reports and proprietary compounds. These compounds have been grouped in a so-called literature data set I, a proprietary data set II, and a mixed data set III formed by I and II. About 100 descriptors emphasizing surface properties were calculated for every compound. Bayesian learning of neural nets which cumulates the advantages of neural nets without having their weaknesses was used to select the most parsimonious models and train them, from I, II, and III. The models were established by either selecting the most efficient descriptors one by one using a modified Gram-Schmidt procedure (GS) or by ...

Journal ArticleDOI
TL;DR: Predictive models for the surface tension, viscosity, and thermal conductivity of 213 common organic solvents are reported and compare favorably to previously reported prediction methods for these three properties.
Abstract: Predictive models for the surface tension, viscosity, and thermal conductivity of 213 common organic solvents are reported. The models are derived from numerical descriptors which encode information about the topology, geometry, and electronics of each compound in the data set. Multiple linear regression and computational neural networks are used to train and evaluate models based on statistical indices and overall root-mean-square error. Eight-descriptor models were developed for both surface tension and viscosity, while a nine-descriptor model was developed for thermal conductivity. In addition, a single nine-descriptor model was developed for prediction of all three properties. The results of this study compare favorably to previously reported prediction methods for these three properties.

Journal ArticleDOI
TL;DR: The potential utility of data reduction methods for the analysis of matrices assembled from the related properties of large sets of compounds is discussed by reference to results obtained from solvent polarity scales, ongoing work on solubilities and sweetness properties, and proposed general treatments of toxicities and gas chromatographic retention indices.
Abstract: The potential utility of data reduction methods (e.g. principal component analysis) for the analysis of matrices assembled from the related properties of large sets of compounds is discussed by reference to results obtained from solvent polarity scales, ongoing work on solubilities and sweetness properties, and proposed general treatments of toxicities and gas chromatographic retention indices.

Journal ArticleDOI
TL;DR: In this article, a set of smaller 4 × 4 matrices were constructed to represent DNA primary sequences which are based on enumeration of all 64 triplets of nucleic acids bases.
Abstract: We consider construction of a set of smaller 4 × 4 matrices to represent DNA primary sequences which are based on enumeration of all 64 triplets of nucleic acids bases. The leading eigenvalue from ...

Journal ArticleDOI
TL;DR: It is shown how a document object model (DOM) for chemistry can be constructed using as its basis Chemical Markup Language (CML).
Abstract: We describe the development of a structured method of representing chemistry on the World-Wide Web using an object-oriented approach to information objects. We show how a document object model (DOM) for chemistry can be constructed using as its basis Chemical Markup Language (CML). Application of the CMLDOM to the development of chemical tools is described.

Journal ArticleDOI
TL;DR: A new molecular similarity method based on the topology of the electron density, called quantum topological molecular similarity (QTMS), which has been tested for five sets of carboxylic systems and shown that QTMS avoids certain challenges of traditional Carbó-like similarity indices.
Abstract: Building on the ideas of a previous paper [part 1, J. Phys. Chem. A 1999, 103, 2883] we present a new molecular similarity method based on the topology of the electron density. This method is directly applicable to QSARs and is called quantum topological molecular similarity (QTMS). It has been tested for five sets of carboxylic systems including para- and meta-benzoic acid, para-phenylacetic acid, 4-X-bicyclo[2.2.2]octane-1-carboxylic acids, and polysubstituted benzoic acids. In combination with the partial least squares (PLS) procedure QTMS is able to produce excellent and statistically valid regressions. It is shown that QTMS avoids certain challenges of traditional Carbo-like similarity indices. Finally, QTMS is able to suggest a molecular fragment that contains the active center or the part of the molecule that is responsible for the QSAR.

Journal ArticleDOI
TL;DR: Results demonstrate that carefully chosen physically meaningful 1D and 2D descriptors encode sufficient molecular information for fast and reasonably reliable prediction of aqueous solubility with a simple neural network.
Abstract: A simple QSPR model, based on seven 1D and 2D descriptors and artificial neural network, was developed for fast evaluation of aqueous solubility. The model was able to predict the molar solubility of a diverse set of 1312 organic compounds with an overall correlation coefficient of 0.92 and a standard deviation of 0.72 log unit between the calculated and experimental data. Considering the fact that the estimated uncertainty of the experimental data is no less than 0.5 log unit, the results demonstrate that carefully chosen physically meaningful 1D and 2D descriptors encode sufficient molecular information for fast and reasonably reliable prediction of aqueous solubility with a simple neural network. As a comparison, we calculated the solubility of a test set of 258 compounds, ranging from simple hydrocarbons to more complex multifunctional organic molecules, with a commercial program (QMPR+ version 2.0.1 of SimulationPlus Inc.) and compared the results with predictions from our model. Statistical paramete...

Journal ArticleDOI
TL;DR: Binary kernel discrimination is shown to perform robustly with varying quantities of training data and also in the presence of noisy data, highlighting the importance of the judicious use of general pattern recognition techniques for compound selection.
Abstract: High-throughput screening has made a significant impact on drug discovery, but there is an acknowledged need for quantitative methods to analyze screening results and predict the activity of further compounds. In this paper we introduce one such method, binary kernel discrimination, and investigate its performance on two datasets; the first is a set of 1650 monoamine oxidase inhibitors, and the second a set of 101 437 compounds from an in-house enzyme assay. We compare the performance of binary kernel discrimination with a simple procedure which we call “merged similarity search”, and also with a feedforward neural network. Binary kernel discrimination is shown to perform robustly with varying quantities of training data and also in the presence of noisy data. We conclude by highlighting the importance of the judicious use of general pattern recognition techniques for compound selection.

Journal ArticleDOI
TL;DR: The results indicate that the lipoaffinity descriptor defined in this paper may be a significant descriptor for molecular transport properties across lipid bilayers.
Abstract: A new molecular lipoaffinity descriptor was introduced in this paper to account for the effect of molecular hydrophobicity on blood-brain barrier penetration. The descriptor was defined based on Kier and Hall's atom-type electrotopological state indices. Its evaluation requires 2-D molecular bonding information only. A multiple linear regression equation using this descriptor and molecular weight reproduces the experimental logBB values of 55 training set compounds and 11 test set compounds satisfactorily with statistical parameters nearly identical to the best models based on polar surface area and ClogP. The results indicate that the lipoaffinity descriptor defined in this paper may be a significant descriptor for molecular transport properties across lipid bilayers.

Journal ArticleDOI
Milan Randić1
TL;DR: The new shape indices for smaller alkanes and several cyclic structures are reported and offer regressions of high quality for diverse physicochemical properties of octanes and have lead to a novel classification of physic biochemical properties of alkanes.
Abstract: We report on novel graph theoretical indices which are sensitive to the shapes of molecular graphs. In contrast to the Kier's kappa shape indices which were based on a comparison of a molecular graph with graphs representing the extreme shapes, the linear graph and the “star” graph, the new shape indices are obtained by considering for all atoms the number of paths and the number of walks within a graph and then making the quotients of the number of paths and the number of walks the same length. The new shape indices show much higher discrimination among isomers when compared to the kappa shape indices. We report the new shape indices for smaller alkanes and several cyclic structures and illustrate their use in structure−property correlations. The new indices offer regressions of high quality for diverse physicochemical properties of octanes. They also have lead to a novel classification of physicochemical properties of alkanes.