scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemical Information and Computer Sciences in 1998"


Journal ArticleDOI
TL;DR: The concept of similarity searching is introduced, differentiating it from the more common substructure searching, and the current generation of fragment-based measures that are used for searching chemical structure databases are discussed.
Abstract: This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching, differentiating it from the more common substructure searching, and then discusses the current generation of fragment-based measures that are used for searching chemical structure databases. The next sections focus upon two of the principal characteristics of a similarity measure: the coefficient that is used to quantify the degree of structural resemblance between pairs of molecules and the structural representations that are used to characterize molecules that are being compared in a similarity calculation. New types of similarity measure are then compared with current approaches, and examples are given of several applications that are related to similarity searching.

1,662 citations


Journal ArticleDOI
TL;DR: "RECAP" (Retrosynthetic Combinatorial Analysis Procedure), a new computational technique designed to address the design and availability of high quality building blocks which are likely to afford hits from the libraries that they generate is described.
Abstract: The use of combinatorial chemistry for the generation of new lead molecules is now a well established strategy in the drug discovery process. Central to the use of combinatorial chemistry is the design and availability of high quality building blocks which are likely to afford hits from the libraries that they generate. Herein we describe “RECAP” (Retrosynthetic Combinatorial Analysis Procedure), a new computational technique designed to address this building block issue. RECAP electronically fragments molecules based on chemical knowledge. When applied to databases of biologically active molecules this allows the identification of building block fragments rich in biologically recognized elements and privileged motifs and structures. This allows the design of building blocks and the synthesis of libraries rich in biological motifs. Application of RECAP to the Derwent World Drug Index (WDI) and the molecular fragments/building blocks that this generates are discussed. We also describe a WDI fragment knowle...

568 citations


Journal ArticleDOI
TL;DR: Empirical results suggest that bit strings provide a nonintuitive encoding of molecular size, shape, and global similarity and suggest that there are instances when they may not be the most appropriate tool for searching or segregating chemical structures.
Abstract: With the growth of interest in database searching and compound selection, the quantification of chemical similarity has become an area of intense practical and theoretical interest. One of the most widely used methods of measuring chemical similarity is based on mapping fragments within a molecule as bits within a binary string. We present empirical results which suggest that bit strings provide a nonintuitive encoding of molecular size, shape, and global similarity. Other results, this time statistical in nature, suggest that the observed behavior of bit string-based searches have a large nonspecific component. On this basis, we question whether bit string-based similarity methods possess all the features desirable in a quantitative chemical distance measure or metric and suggest that there are instances when they may not be the most appropriate tool for searching or segregating chemical structures.

331 citations


Journal ArticleDOI
TL;DR: A nonlinear computational neural network model developed by using the genetic algorithm with a neural network fitness evaluator to estimate percent human intestinal absorption (%HIA) is an attractive alternative to experimental measurements.
Abstract: Prediction of human intestinal absorption (HIA) is a major goal in the development of oral drugs. The application of combinatorial chemistry methods to drug discovery has dramatically increased the demand for rapid and efficient models for estimating HIA and other biopharmaceutical properties. While experimental methods for measurement of intestinal absorption have been developed and are used widely, computational approaches provide an attractive alternative.

317 citations



Journal ArticleDOI
TL;DR: Among the QSAR methods considered, HQSAR appears to offer many attractive features, such as speed, reproducibility and ease of use, which portend its utility for prioritizing large numbers of potential EDCs for subsequent toxicological testing and risk assessment.
Abstract: Three different QSAR methods, Comparative Molecular Field AnaIysis (CoMFA), classical QSAR (utilizing the CODESSA program), and Hologram QSAR (HQSAR), are compared in terms of their potential for screening large data sets of chemicals as endocrine disrupting compounds (EDCs). While CoMFA and CODESSA (Comprehensive Descriptors for Structural and Statistical Analysis) have been commercially available for some time, HQSAR is a novel QSAR technique. HQSAR attempts to correlate molecular structure with biological activity for a series of compounds using molecular holograms constructed from counts of sub-structural molecular fragments. In addition to using r2 and q2 (cross-validated r2) in assessing the statistical quality of QSAR models, another statistical parameter was defined to be the ratio of the standard error to the activity range. The statistical quality of the QSAR models constructed using CoMFA and HQSAR techniques were comparable and were generally better than those produced with CODESSA. It is nota...

196 citations


Journal ArticleDOI
TL;DR: A search for optimum molecular descriptors based on the connectivity index found that in most cases the optimum value of the exponent is indeed different from -0.5, and suggests that a modified version of the (valence) vertex-connectivity index should be routinely employed in the structure-property modeling instead of the standard versions of the index.
Abstract: We report a search for optimum molecular descriptors based on the connectivity index. A suggestion made by several authors that the exponent -0.5 used in the standard formula for computing the connectivity index may not be the optimum for modeling some molecular properties was reexamined. We considered several molecular properties and found that in most cases the optimum value of the exponent is indeed different from -0.5. We suggest that a modified version of the (valence) vertex-connectivity index should be routinely employed in the structure-property modeling instead of the standard version of the index.

193 citations


Journal ArticleDOI
TL;DR: A substructural analysis approach is used to calculate biological activity profiles, which contain weights that describe the differential occurrences of generic features in active molecules taking from the World Drug Index and in (presumed) inactive molecules taken from the SPRESI database.
Abstract: A substructural analysis approach is used to calculate biological activity profiles, which contain weights that describe the differential occurrences of generic features (specifically, the numbers of hydrogen-bond donors and acceptors, the numbers of rotatable bonds and aromatic rings, the molecular weights, and the 2κα shape descriptors) in active molecules taken from the World Drug Index and in (presumed) inactive molecules taken from the SPRESI database. Even with such simple structural descriptors, the profiles discriminate effectively between active and inactive compounds. The effectiveness of the approach is further increased by using a genetic algorithm for the calculation of the weights comprising a profile. The methods have been successfully applied to a number of different data sets.

162 citations


Journal ArticleDOI
TL;DR: A general QSPR model was developed for the prediction of the refractive index for a diverse set of amorphous homopolymers with the CODESSA program and the average prediction error by this model is 0.9%.
Abstract: A general QSPR model (R2 = 0.940, s = 0.018) was developed for the prediction of the refractive index for a diverse set of amorphous homopolymers with the CODESSA program. The five descriptors, involved in the model, are calculated from the structure of the repeating unit of the polymer. The average prediction error by this model is 0.9%.

156 citations


Journal ArticleDOI
TL;DR: The vapor pressures and the aqueous solubilities of 411 compounds with a large structural diversity were investigated using a quantitative structure−property relationship (QSPR) approach to allow the reliable prediction of water−air partition coefficients.
Abstract: The vapor pressures and the aqueous solubilities of 411 compounds with a large structural diversity were investigated using a quantitative structure−property relationship (QSPR) approach. A five-descriptor equation with the squared correlation coefficient (R2) of 0.949 for vapor pressure and a six-descriptor equation with R2 of 0.879 for aqueous solubility were obtained. All descriptors were derived solely from the chemical structure of the compounds. The QSPR correlation equations for vapor pressure and aqueous solubility allow the reliable prediction of water−air partition coefficients.

152 citations


Journal ArticleDOI
TL;DR: A method for predicting the aqueous solubility of drug compounds was developed based on topological indices and artificial neural network (ANN) modeling, which yielded positive results for acidic, neutral, and basic drugs of different structural classes.
Abstract: A method for predicting the aqueous solubility of drug compounds was developed based on topological indices and artificial neural network (ANN) modeling. The aqueous solubility values for 211 drugs and related compounds representing acidic, neutral, and basic drugs of different structural classes were collected from the literature. The data set was divided into a training set (n = 160) and a randomly chosen test set (n = 51). Structural parameters used as inputs in a 23-5-1 artificial neural network included 14 atom-type electrotopological indices and nine other topological indices. For the test set, a predictive r2 = 0.86 and s = 0.53 (log units) were achieved.

Journal ArticleDOI
TL;DR: Genetic algorithm and simulated annealing routines, in conjunction with MLR and CNN, are used to select subsets of descriptors that accurately relate to aqueous solubility.
Abstract: Multiple linear regression (MLR) and computational neural networks (CNN) are utilized to develop mathematical models to relate the structures of a diverse set of 332 organic compounds to their aqueous solubilities. Topological, geometric, and electronic descriptors are used to numerically represent structural features of the data set compounds. Genetic algorithm and simulated annealing routines, in conjunction with MLR and CNN, are used to select subsets of descriptors that accurately relate to aqueous solubility. Nonlinear models with nine calculated structural descriptors are developed that have a training set root-mean-square error of 0.394 log units for compounds which span a −log(molarity) range from −2 to +12 log units.

Journal ArticleDOI
TL;DR: A new quantitative structure−property relationship (QSPR) five-parameter correlation of molar glass transition temperatures (Tg/M) for a diverse set of 88 polymers is developed with the Comprehensive Descriptors for Structural and Statistical Analysis (CODESSA) program.
Abstract: A new quantitative structure−property relationship (QSPR) five-parameter correlation (R2 = 0.946) of molar glass transition temperatures (Tg/M) for a diverse set of 88 polymers is developed with the Comprehensive Descriptors for Structural and Statistical Analysis (CODESSA) program. The descriptors are all calculated directly from the molecular structure, and the approach given is applicable, in principle, to all linear polymers of regular structure.

Journal ArticleDOI
TL;DR: This study analyzes the antitumor activity patterns of 112 ellipticine analogues and investigates the quantitative structure-activity relationships (QSAR) of these compounds, in particular with respect to the influence of p53-status and the CNS cell selectivity of the activity patterns.
Abstract: The U.S. National Cancer Institute (NCI) conducts a drug discovery program in which ∼10 000 compounds are screened every year in vitro against a panel of 60 human cancer cell lines from different organs of origin. Since 1990, ∼63 000 compounds have been tested, and their patterns of activity profiled. Recently, we analyzed the antitumor activity patterns of 112 ellipticine analogues using a hierarchical clustering algorithm. Dramatic coherence between molecular structures and activity patterns was observed qualitatively from the cluster tree. In the present study, we further investigate the quantitative structure−activity relationships (QSAR) of these compounds, in particular with respect to the influence of p53-status and the CNS cell selectivity of the activity patterns. Independent variables (i.e., chemical structural descriptors of the ellipticine analogues) were calculated from the Cerius2 molecular modeling package. Important structural descriptors, including partial atomic charges on the ellipticin...

Journal ArticleDOI
TL;DR: A predictive model has been developed by using 125 isomers in alkanes as the training set, and its performance was certified by employing 25 alkanes chosen randomly as the test set from a total of 150 alkane compounds; excellent predicted results were obtained.
Abstract: Models that estimate and predict the normal boiling point (NBP) of alkanes based on a molecular distance-edge (MDE) vector, λ, have been developed by using multiple linear regression (MLR) methods. The structures of the examined compounds are selectively described by an MDE vector structure descriptor, a novel molecular distance-edge vector recently developed in our laboratory. MLR was used to develop a linear model containing ten variables with a high precision root mean squares error (RMS = 4.985K) and a good correlation with the correlation coefficient (R = 0.9948). In addition, a predictive model has been developed by using 125 isomers in alkanes as the training set, and its performance was certified by employing 25 alkanes chosen randomly as the test set from a total of 150 alkane compounds; excellent predicted results were obtained with the RMS and R values found between the calculated value and observed NBP being RMS = 4.486K and R = 0.9945.

Journal ArticleDOI
TL;DR: A substructural approach to quantitative structure−property relationships based on the spectral moments of the edge adjacency matrix is extended to molecules containing cycles to describe the boiling points of a series of 80 cycloalkanes.
Abstract: A substructural approach to quantitative structure−property relationships based on the spectral moments of the edge adjacency matrix is extended to molecules containing cycles. Spectral moments are expressed as linear combinations of structural fragments of any kind of nonweighted graphs. The boiling points of a series of 80 cycloalkanes was well-described by the present approach. The predictive power of the model was proved by using a test set of another 26 compounds. An equation that expresses the contribution of the different fragments of the molecules to the boiling point was obtained.

Journal ArticleDOI
TL;DR: Frequency analysis of building block composition of selected virtual compounds identifies building blocks that can be used in combinatorial synthesis of chemical libraries with high similarity to the lead molecules.
Abstract: We describe a new computational approach, called Focus-2D, to the rational design of targeted combinatorial chemical libraries. This approach is based on the hypothesis that structurally similar compounds display similar biological activity profiles. Building blocks that are used in a combinatorial chemical synthesis are randomly assembled to produce virtual library compounds. Individual library compounds are represented by Kier−Hall topological descriptors. Molecular similarities between compounds are evaluated quantitatively by modified pairwise Euclidean distances in multidimensional descriptor space. Simulated annealing is used to search the potentially large structural space of virtual chemical libraries to identify compounds similar to lead molecules. Frequency analysis of building block composition of selected virtual compounds identifies building blocks that can be used in combinatorial synthesis of chemical libraries with high similarity to the lead molecules. We show that this method correctly i...

Journal ArticleDOI
TL;DR: Issues to be considered include fundamental data structures, neighborhood searching principles, useful searching approaches and techniques, library definition and construction, algorithmic details of library comparison, and user interfaces.
Abstract: Virtual compound libraries, descriptions of all of the structures that might be produced by specified transformations involving specified reagents, are especially useful in molecular discovery when suitably fast and relevant searching techniques are available. Issues to be considered include fundamental data structures, neighborhood searching principles, useful searching approaches and techniques, library definition and construction, algorithmic details of library comparison, and user interfaces.

Journal ArticleDOI
TL;DR: The superposition method described here combines a genetic algorithm with a numerical optimization method to adequately address the conformational flexibility of ligand molecules.
Abstract: The superposition of three-dimensional structures is the first task in the evaluation of the largest common three-dimensional substructure of a set of molecules. This is an important step in the identification of a pharmacophoric pattern for molecules that bind to the same receptor. The superposition method described here combines a genetic algorithm with a numerical optimization method. A major goal is to adequately address the conformational flexibility of ligand molecules. The genetic algorithm optimizes in a nondeterministic process the size and the geometric fit of the substructures. The geometric fit is further improved by changing torsional angles combining the genetic algorithm and the directed tweak method. This directed tweak method is based on a numerical quasi-Newton optimization method. Only one starting conformation per molecule is necessary. Molecules having several rotatable bonds and quite different initial conformations are modified to find large structural similarities. A set of angiote...

Journal ArticleDOI
TL;DR: A novel strategy for rational design of targeted peptide libraries to select a subset of natural amino acids that are most likely to be present in active peptides for the synthesis of library is developed.
Abstract: We have developed a novel strategy for rational design of targeted peptide libraries. The goal of this method is to select a subset of natural amino acids that are most likely to be present in active peptides for the synthesis of library. Two different protocols are employed where chemical structures of peptides are described either by topological indices or by a combination of physicochemical descriptors for individual amino acids. The selection of a peptide as a candidate for the targeted library is based either on its chemical similarity to a biologically active probe or on its biological activity predicted from a preconstructed quantitative structure−activity (QSAR) equation. The optimization of the library is achieved by means of genetic algorithms (GA). This method was tested by rational design of the library with bradykinin-potentiating activity. Twenty-eight bradykinin-potentiating pentapeptides were used as a training set for the development of a QSAR equation, and, alternatively, two active pent...

Journal ArticleDOI
TL;DR: The idea is to encode the three-dimensional features of chemical compounds into bit strings and use RP to determine the important features that statistically correlate to the biological activities of these compounds.
Abstract: Large chemical data sets are becoming available from high throughput screening of corporate collections and chemical libraries. There is a growing need to develop three-dimensional pharmacophores from these large data sets to guide database screening, chemical library design, and lead optimization. Recursive partitioning (RP) is a statistical method that can be used to analyze very large data sets; data sets of over 100 000 observations and over 2 000 000 descriptors pose no computational problems. Our idea is to encode the three-dimensional features of chemical compounds into bit strings and use RP to determine the important features that statistically correlate to the biological activities of these compounds. This kind of structure−activity relationship analysis (SAR) can be considered as the first step to the goal of pharmacophore identification for large chemical data sets. We report here our RP work that for the first time successfully retrieved 3D SARs from a large, heterogeneous data set of 1650 mo...

Journal ArticleDOI
TL;DR: The applicability of these two descriptors for the prediction of boiling points for various other classes of organic compounds was investigated by employing a diverse data set of 612 organic compounds containing C, H, N, O, S, F, Cl, Br, and I.
Abstract: We recently reported a successful correlation of the normal boiling points of 298 organic compounds containing O, N, Cl, and Br with two molecular descriptors.1 In the present study the applicabili...

Journal ArticleDOI
TL;DR: A model to predict vapor pressure from only computationally derived molecular descriptors, allowing study of hypothetical structures, is described here and proves to be more accurate and works over a wider range of compound classes than most previously reported models.
Abstract: To date, most reported quantitative structure−property relationship (QSPR) methods to predict vapor pressure rely on, at least, some empirical data, such as boiling points, critical pressures, and critical temperatures. This limits their usefulness to available chemicals and incurs the time and expense of experimentation. A model to predict vapor pressure from only computationally derived molecular descriptors, allowing study of hypothetical structures, is described here. Several multilinear regressions and artificial neural network analyses were tested with a range of descriptors (e.g., topological and quantum mechanical) derived solely from computations on molecular structure data. From a set of 479 compounds, a linear regression with an r2 of 0.960 was achieved using polarizibility and polar functional group counts as descriptors. This new computationally based model also proves to be more accurate and works over a wider range of compound classes than most previously reported models.

Journal ArticleDOI
TL;DR: Although there exists an infinite variety of wavelet transformations, 22 orthonormal wavelet transforms that are typically used, which include Haar, 9 daublets, 5 coiflets, and 7 symmlets, were evaluated and four threshold selection methods have been studied.
Abstract: Discrete wavelet transform (DWT) denoising contains three steps: forward transformation of the signal to the wavelet domain, reduction of the wavelet coefficients, and inverse transformation to the native domain. Three aspects that should be considered for DWT denoising include selecting the wavelet type, selecting the threshold, and applying the threshold to the wavelet coefficients. Although there exists an infinite variety of wavelet transformations, 22 orthonormal wavelet transforms that are typically used, which include Haar, 9 daublets, 5 coiflets, and 7 symmlets, were evaluated. Four threshold selection methods have been studied: universal, minimax, Stein's unbiased estimate of risk (SURE), and minimum description length (MDL) criteria. The application of the threshold to the wavelet coefficients includes global (hard, soft, garrote, and firm), level-dependent, data-dependent, translation invariant (TI), and wavelet package transform (WPT) thresholding methods. The different DWT-based denoising m...

Journal ArticleDOI
TL;DR: The use of descriptors calculated only from molecular structure eliminates the need for experimental determination of properties for use in the correlation and allows for the estimation of aqueous solubility for molecules not yet synthesized or isolated.
Abstract: The aqueous solubilities of a set of 109 hydrocarbons and 132 halogenated hydrocarbons (total 241) are correlated by a three term equation using descriptors calculated solely from molecular structure, with a correlation coefficient (R) of 0.979 and a standard error (s) of 0.386 log units. This equation allows the estimation of aqueous solubilities of hydrocarbons and halogenated hydrocarbons (including polychlorinated biphenyls). The key descriptor is the molecular volume, modified by topological and electrostatic terms. The use of descriptors calculated only from molecular structure eliminates the need for experimental determination of properties for use in the correlation and allows for the estimation of aqueous solubility for molecules not yet synthesized or isolated.

Journal ArticleDOI
TL;DR: This paper presents the theoretical results that for all molecules, the problems of isomorphism, automorphism partitioning, and canonical labeling are polynomial-time problems.
Abstract: The graph isomorphism problem belongs to the class of NP problems, and has been conjectured intractable, although probably not NP-complete. However, in the context of chemistry, because molecules are a restricted class of graphs, the problem of graph isomorphism can be solved efficiently (i.e., in polynomial-time). This paper presents the theoretical results that for all molecules, the problems of isomorphism, automorphism partitioning, and canonical labeling are polynomial-time problems. Simple polynomial-time algorithms are also given for planar molecular graphs and used for automorphism partitioning of paraffins, polycyclic aromatic hydrocarbons (PAHs), fullerenes, and nanotubes.

Journal ArticleDOI
TL;DR: It is demonstrated how affinity fingerprints may be used in conjunction with simple algorithms to select active-enriched diverse training sets and to efficiently extract the most active compounds from a large library.
Abstract: The Similarity Principle provides the conceptual framework behind most modern approaches to library sampling and design. However, it is often the case that compounds which appear to be very similar structurally may in fact exhibit quite different activities toward a given target. Conversely, some targets recognize a wide variety of molecules and thus bind compounds that have markedly different structures. Affinity fingerprints largely overcome the difficulties associated with selecting compounds on the basis of structure alone. By describing each compound in terms of its binding affinity to a set of functionally dissimilar proteins, fundamental factors relevant to binding and biological activity are automatically encoded. We demonstrate how affinity fingerprints may be used in conjunction with simple algorithms to select active-enriched diverse training sets and to efficiently extract the most active compounds from a large library.


Journal ArticleDOI
TL;DR: A quantitative structure−property relationship study is reported for boiling points of 185 acyclic compounds with one or two oxygen or sulfur atoms (devoid of hydrogen bonding), in terms of four or five molecular descriptors.
Abstract: Two new approaches are presented for the calculation of atom and bond parameters for heteroatom-containing molecules used in computing graph theoretic invariants. In the first approach, the atom and bond weights are computed on the basis of relative atomic electronegativity, using carbon as standard. In the second system, the relative covalent radii are used to compute atom and bond weights, again with the carbon atom as standard. The new definition of the atom and bond parameters leads to a periodic variation versus the atomic number Z, with a more natural variation when compared with the parameters defined only by Z. The two approaches are used to define and compute topological indices based on graph distance. A quantitative structure−property relationship study is reported for boiling points of 185 acyclic compounds with one or two oxygen or sulfur atoms (devoid of hydrogen bonding), in terms of four or five molecular descriptors.

Journal ArticleDOI
TL;DR: QSARs were developed for the acute toxicity of narcotic pollutants to the water flea, the guppy, and the pond snail using hydrophobicity (log KOW) and hydrogen bonding capacity descriptors (Q-, Q+, eHOMO, eLUMO).
Abstract: QSARs were developed for the acute toxicity of narcotic pollutants (nonpolar and polar) to the water flea (Daphnia magna), the guppy (Poecilia reticulata), and the pond snail (Lymnaea stagnalis) using hydrophobicity (log KOW) and hydrogen bonding capacity descriptors (Q-, Q+, eHOMO, eLUMO). Toxicity increases with increasing hydrophobicity and to a minor extent with decreasing LUMO energies and increasing absolute charges in the molecule. The models are rationalized by taking into account the composition of biomembranes, into which chemicals must partition for displaying narcosis. The similarity of these results with models for the membrane/water partition coefficients supports the hypothesis that the toxicity of narcotics is directly related to the accumulation in biological membranes. The results indicate that baseline toxicity based on log KOW should be redefined for chemicals for which log KOW is not a good surrogate for partitioning into biological membranes.