scispace - formally typeset
Search or ask a question

Showing papers in "Qsar & Combinatorial Science in 2009"


Journal ArticleDOI
TL;DR: In this paper, the physicochemical properties of polybrominated diphenyl ethers (PBDEs) were investigated through a modelling approach based on quantitative structure-property relationships (QSPR).
Abstract: Polybrominated diphenyl ethers (PBDEs) are a group of brominated flame retardants (BFRs), which were widely used in a variety of consumer products. Because of evidences of toxicity effects on different organisms and humans, as well as the ubiquitary profile of these compounds, PBDEs are considered an emerging group of toxic and persistent organic pollutants. However, due to the small amount of experimental data available, still little is known about the properties of most of these chemicals. In this study several physicochemical properties, experimentally available for few PBDE congeners and hexabromobenzene (HBB), were investigated through a modelling approach based on quantitative structure – property relationships (QSPR). The OLS regression models, based

85 citations


Journal ArticleDOI
TL;DR: The entropy-based information metric takes into account the frequency distribution of the different scaffolds and is a complementary measure of scaffold diversity enabling a more comprehensive analysis.
Abstract: Scaffold diversity analysis of compound databases has multiple applications in medicinal chemistry and drug discovery including library design, compounds acquisition, virtual screening and assessment of structure-activity-relationships. The scaffold diversity is commonly measured based on frequency counts. Further information can be obtained by considering the specific distribution of the molecules in those scaffolds. To this end, we introduce in this work the use of an entropy-based information metric to assess the scaffold diversity of compound data sets. As a test case we analyzed the scaffold diversity of 16 data sets of active compounds comparable in size targeting five protein classes of interest in drug design. The diversity was also assessed in terms of frequency counts and scaffold retrieval curves. The entropy-based information metric takes into account the frequency distribution of the different scaffolds and is a complementary measure of scaffold diversity enabling a more comprehensive analysis.

75 citations


Journal ArticleDOI
TL;DR: The sparse feature selection algorithm proved to be an excellent, robust method for selecting descriptors for QSAR models, as it is supervised (descriptors chosen in a context-dependent manner), parsimonious (models not overly complex), and inherently interpretable.
Abstract: Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables greatly outweighs the number of observations (grossly underdetermined systems). We have used two Bayesian approaches to carry out this task using a suite of QSAR data sets. We employed a specialized sparse Bayesian feature reduction method based on an EM algorithm with a Laplacian prior to select a small set of the most relevant descriptors for modeling the response variables from a much larger pool of possibilities. Having chosen the optimum descriptors in a supervised manner, we used a Bayesian regularized neural network to carry out nonlinear regression and derive robust parsimonious QSAR models for five drug data sets. Models were validated using independent test sets, and results compared with other contemporary descriptor selection methods. Issues around validating small QSAR data sets were also discussed in detail. The sparse feature selection algorithm proved to be an excellent, robust method for selecting descriptors for QSAR models, as it is supervised (descriptors chosen in a context-dependent manner), parsimonious (models not overly complex), and inherently interpretable. Coupled to a robust parsimonious nonlinear modeling method such as the Bayesian regularized neural net, the combination provides a means of optimally modeling the data, and allowing interpretation of the model in terms of the most relevant descriptors.

70 citations


Journal ArticleDOI
TL;DR: A new Group-Based QSAR (G-QSAR) method is proposed which uses descriptors evaluated for the fragments of the molecules generated using specific fragmentation rules defined for a given dataset.
Abstract: Several approaches are widely being used as important tools for drug discovery. These approaches include Hansch method, Free-Wilson method and conventional 2-D/3-D QSAR methods. The Hansch analysis assumes that substituents are independent of each other and does not include explicit interactions of groups. In the conventional QSAR method, the interpretation of model generated is rather difficult, as one does not a get clear direction about the site for improvement. A new Group-Based QSAR (G-QSAR) method is proposed which uses descriptors evaluated for the fragments of the molecules generated using specific fragmentation rules defined for a given dataset. Herein, we describe the application of G-QSAR method on two different datasets belonging to a simple congeneric series and a complex noncongeneric series. This method provides models with predictive ability similar or better to conventional methods and in addition provides hints for sites of improvement in the molecules.

69 citations


Journal ArticleDOI
TL;DR: A Bayesian regularized neural network with a sparse Laplacian prior is employed as an efficient method for supervised feature selection, and robust parsimonious nonlinear QSAR modeling.
Abstract: Feature selection is an important but still poorly solved problem in QSAR modeling. We employ a Bayesian regularized neural network with a sparse Laplacian prior as an efficient method for supervised feature selection, and robust parsimonious nonlinear QSAR modeling. The method simultaneously selects the most relevant descriptors for model, and automatically prunes the neural network to have the architecture with optimum prediction ability. We illustrate the advantages of the method using a suite of diverse data sets, and compare the results obtained by the new method against those obtained by alternative contemporary methods.

57 citations


Journal ArticleDOI
TL;DR: In this article, the authors performed a quantitative analysis of 2,4-diphenyl-1,3-oxazoline analogues against two-spotted spider mite Tetranychus urticae, which causes serious damage to agricultural products.
Abstract: Quantitative Structure–Activity Relationship (QSAR) studies have been carried out for ovicidal activity of 2,4-diphenyl-1,3-oxazoline analogues against two-spotted spider mite Tetranychus urticae, which causes serious damage to agricultural products. The studies have been performed with 2D (physicochemical, structural, and topological) and 3D (shape, spatial, electronic, and molecular field) descriptors. The chemometrics tools used for the analyses are Genetic Function Approximation (GFA) and Genetic Partial Least Squares (G/PLS). The whole dataset (n=90) was divided into a training set (75% of the dataset) and a test set (remaining 25%) on the basis of K-means clustering technique of standardized topological and structural descriptor matrix. Models developed from the training set were used to predict the activity of the test set compounds. All the models have been validated internally, externally, and by Y-randomization technique. However, different models emerged as the best ones according to different validation criteria. We have tried a consensus model, which is based on the results obtained by all predictive models and this may provide the most stable solution. Models obtained by using 2D parameters revealed that the chain length of the substituent at para position of the 4-phenyl ring is a critical factor. Lipophilicity of the molecule also reflects a dominant role for the ovicidal activity. Models generated from 3D descriptors suggest that the shape of the substituents should be optimum and the lipophilic substituents having electronegative atoms with distributed positive charge over a surface may enhance the ovicidal activity. The model obtained from Molecular Field Analysis (MFA) suggests that bulky substituents with optimally distributed charge may increase the ovicidal activity.

49 citations


Journal ArticleDOI
TL;DR: The predictive performance of five different pKa prediction tools was investigated on the 248-membered Gold Standard dataset and it was concluded that ACD and Marvin are in fact the method of choice for medicinal chemistry applications.
Abstract: The predictive performance of five different pKa prediction tools (ACDpKa, Epik, Marvin pKa, Pallas pKa, and VCCpKa) was investigated on the 248-membered Gold Standard dataset. We found VCC as the most predictive, high throughput pKa predictor. However since VCC calculates pKa for the most acidic or basic group only we concluded that ACD and Marvin are in fact the method of choice for medicinal chemistry applications. Analyzing the common outliers we identified guanidines, enolic hydroxyl groups and weak acidic NHs as most problematic moieties from prediction point of view. Our results obtained on the high quality, homogenous Gold Standard dataset could be useful for end-users selecting a suitable solution for pKa prediction.

49 citations


Journal ArticleDOI
TL;DR: 2D similarity methods offer a useful method for building chemical categories for teratogenicity in which a-priori mechanistic knowledge is limited, and are investigated within the freely downloadable Toxmatch software.
Abstract: Reproductive toxicity is a key endpoint under REACH as it is costly, both financially and in the numbers of animal used. These factors make the development of alternative methods for the identification and assessment of chemicals potentially able to cause reproductive toxicity very important. Category formation and read-across have been suggested to be powerful methods that can utilize existing toxicological data in a transparent and interpretable manner, allowing chemical risk assessments to be carried out with minimal animal usage. Category formation relies on chemical similarity and the ability to select chemicals that act via a single mechanism of toxic action. This study therefore has investigated the use of 2D similarity indices available within the freely downloadable Toxmatch software to form mechanistically transparent categories for 57 query chemicals from a database of 233 chemicals for which teratogenicity (an important endpoint within reproductive toxicity) had been previously assessed. The hypothesis being that chemicals selected as being similar should act via a single mechanism of action, even if that mechanism is unknown. Read-across predictions were then performed for the query chemicals for which a category could be formed. The results showed that mechanistic categories could be formed for 17 of the 57 query chemicals, within these categories read across predictions enabled the teratogenicity of the query chemicals to be correctly predicted. It was concluded that 2D similarity methods offer a useful method for building chemical categories for teratogenicity in which a-priori mechanistic knowledge is limited.

44 citations


Journal ArticleDOI
TL;DR: It was revealed that atom's individuality and stereochemistry of chiral surroundings of the asymmetric atom of phosphorus play a vital role in AChE inhibition.
Abstract: In this article the Hierarchical QSAR technology (HiT QSAR) has been used for consensus QSAR modeling of Acetylcholinesterase (AChE) inhibition by various organophosphate compounds. Simplex representation of molecular structure (SiRMS) and Lattice model (LM) QSAR approaches have been used for descriptors' generation. Statistical models have been obtained by partial least squares (PLS) method. Various chiral organophosphates represented by their (R)- and (S)-isomers, racemic mixtures and achiral structures have been investigated. Successful consensus model (R2=0.978) based on fourteen best QSAR models obtained using different QSAR approaches and training sets for several levels and methods of molecular structure representation (2.5D, double 2.5D, and 3D) has been used for the prediction of AChE inhibition of new compounds. In order to avoid chance correlations 1000 rounds of Y-scrambling were performed for each selected model. Leverage and ellipsoid Applicability domain (DA) approaches have been used for additional estimation of the quality of prognosis. Molecular fragments enhancing and interfering with inhibitory activity have been determined. It was revealed that atom's individuality and stereochemistry of chiral surroundings of the asymmetric atom of phosphorus play a vital role in AChE inhibition. Thus, (R)-isomers are always less active than the (S) isomers and the racemate. HiT QSAR proves to be a powerful tool for investigation of activity of stereo selective interactions and can be used in such subsequent studies.

43 citations


Journal ArticleDOI
TL;DR: Taking into account the prediction results of two computer programs for rodent carcinogenicity the consensus model increases the accuracy of prediction.
Abstract: Computer-aided prediction of rodent carcinogenicity for the external test set consisting of 293 chemicals was performed by PASS (Prediction of Activity Spectra for Substances) and by CISOC-PSCT. The set included 64 carcinogens from ISS Carcinogens Data Bank and 229 noncarcinogens from the Prestwick Chemical Library. We calculated the accuracy of carcinogenicity prediction by PASS and CISOC-PSCT in apart, and by the two programs together (the consensus model). Sensitivity, specificity and accuracy (concordance) were calculated for the external test set by PASS (0.81, 0.74, 0.76), by CISOC-PSCT (0.36, 0.89, 0.77) and by the consensus model (0.69, 0.86, 0.83). Thus, taking into account the prediction results of two computer programs for rodent carcinogenicity the consensus model increases the accuracy of prediction.

43 citations


Journal ArticleDOI
TL;DR: Soto, Axel Juan, et al. as discussed by the authors, presented the Planta Piloto de Ingenieria Quimica (PILQ) for the first time.
Abstract: Fil: Soto, Axel Juan. Consejo Nacional de Investigaciones Cientificas y Tecnicas. Centro Cientifico Tecnologico Conicet - Bahia Blanca. Planta Piloto de Ingenieria Quimica. Universidad Nacional del Sur. Planta Piloto de Ingenieria Quimica; Argentina. Universidad Nacional del Sur. Departamento de Ciencias e Ingenieria de la Computacion. Laboratorio de Investigacion y Desarrollo en Computacion Cientifica; Argentina

Journal ArticleDOI
TL;DR: A geometry analysis of Cl-π interactions in protein-ligand complex crystal structures, showed two distinct geometries: edge-on approach of a Cl atom to a ring atom or CC bond, with an average interatomic distance of 3.6
Abstract: A geometry analysis of Cl–π interactions in protein–ligand complex crystal structures, showed two distinct geometries: “edge-on” approach of a Cl atom to a ring atom or CC bond and “face-on” approach towards the ring centroid, with an average interatomic distance of 3.6 A. The interaction energies were estimated as a sum of the CCSD(T) correlation contribution and the Hartree–Fock energy at the complete basis set limit, for the geometries of the benzene–chlorohydrocarbon model structures at the energy minimum obtained by potential energy surface scans using RMP2(FC)/cc-pVTZ. The calculated Cl–π interaction energy was −2.01 kcal/mol, and the dispersion force was found to be the major source of attraction. We also discuss the geometry flexibility in Cl–π interactions.

Journal ArticleDOI
TL;DR: How understanding the limitations of methods, their applicability domains and their prediction accuracies, as well as the use of local models can help to establish accurate and meaningful in silico predictions is discussed.
Abstract: Prediction accuracy of in silico methods for physicochemical and ADMET properties of drugs is an actual matter of controversial discussions. With a particular concern on log P prediction methods, we discuss here, how understanding the limitations of methods, their applicability domains and their prediction accuracies, as well as the use of local models can help to establish accurate and meaningful in silico predictions.

Journal ArticleDOI
TL;DR: A quantitative structure activity relationship (QSAR) model was developed for the aqueous phase hydroxyl radical reaction rate constants (kOH) employing quantum chemical descriptors and multiple linear regressions (MLR) as mentioned in this paper.
Abstract: A quantitative structure activity relationship (QSAR) model was developed for the aqueous-phase hydroxyl radical reaction rate constants (kOH) employing quantum chemical descriptors and multiple linear regressions (MLR). The QSAR development followed the OECD guidelines, with special attention to validation, applicability domain (AD) and mechanistic interpretation. The established model yielded satisfactory performance: the correlation coefficient square (R2) was 0.905, the root mean squared error (RMSE) was 0.139, the leave-many-out cross-validated QLMO2 was 0.806, and the external validated QEXT2 was 0.922 log units. The AD of the model covering compounds of phenols, alkanes and alcohols, was analyzed by Williams plot. The main molecular structural factors governing kOH are the energy of the highest occupied molecular orbital (EHOMO), average net atomic charges on hydrogen atoms (), molecular surface area (MSA) and dipole moment (μ). It was concluded that kOH increased with increasing EHOMO and MSA, while decreased with increasing and μ.

Journal ArticleDOI
Mao Shu1, Hu Mei1, ShanBin Yang1, Limin Liao1, Zhiliang Li1 
TL;DR: In this paper, a new set of descriptors, Hydrophobic, Electronic, Steric, and Hydrogen (HESH) were derived from Principal Component Analysis (PCA) on the collected 171 physicochemical properties of 20 coded amino acids.
Abstract: In this paper, a new set of descriptors, Hydrophobic, Electronic, Steric, and Hydrogen (HESH) (principal components scores vectors of the HESH bond contribution properties), were derived from Principal Component Analysis (PCA) on the collected 171 physicochemical properties of 20 coded amino acids. By applying HESH descriptors to Quantitative Structure–Activity Relationship (QSAR) study on three peptides including 58 Angiotensin-Converting Enzyme (ACE) inhibitors, 48 bitter-tasting dipeptides, and 20 thromboplastin inhibitors, we get three excellent Partial Least Squares (PLS) models, with the squared multiple correlation coefficients (R), cross-validation (R), and Root Mean Square Error (RMSE) of 0.877, 0.838, and 0.361 for ACE inhibitors, 0.926,0.865, and 0.172 for bitter-tasting dipeptides and 0.996, 0.865, and 0.115 for thromboplastin inhibitors. These results were superior to many other reported researches. It showed that HESH may be a useful structural expression method for the study on QSAR of peptide.

Journal ArticleDOI
TL;DR: The COSMO-RS method was applied to predict altogether 4155 experimental available values of 12 different hydrocarbon-water (logPalk) and 1-n-octanol-water(logPow) partition coefficients as discussed by the authors.
Abstract: The COSMO-RS method was applied to predict altogether 4155 experimental available values of 12 different hydrocarbon-water (logPalk) and 1-n-octanol-water (logPow) partition coefficients The over all results of the correlation between the predicted and experimental partition coefficients demonstrate that COSMO-RS is able to a priori predict the values of all systems under investigation for almost all organic compounds As a showcase, the outliers of the hexadecane-water system were critically assessed In two cases the prediction could be improved by correcting the underlying chemical structure The remaining three outliers are due to known problems of seldom elements (germanium compounds) or of insufficient conformational representation of the molecule (crown-ethers) Finally, ΔlogP which is the difference between (logPow−logPalk) is correlated with the blood-brain partition coefficient (logBB) On an external test set of 199 compounds the logBB prediction based on ΔlogP is slightly inferior to our existing QSPR model (RMSE: 052 vs 046)

Journal ArticleDOI
TL;DR: In this paper, the Abraham model correlation for water-to-micellar Sodium Dodecylsulfate (SDS) was updated to include MEKC retention factor data from the five recent experimental studies.
Abstract: Data have been assembled from the published literature on the water-to-micellar Cetyltrimethylammonium Bromide (CTAB) partition coefficient data for more than 60 compounds, and on the Micellar Electrokinetic Chromatographic (MEKC) retention factors for more than 50 compounds measured on a CTAB pseudostationary phase. In total more than 200 experimental data have been compiled. The water-to-micellar CTAB partition coefficients and MEKC retention factor data from three separate studies have been combined into a single database and correlated with the Abraham model. The derived correlation described the 200 experimental values to within a standard deviation of 0.175 log units. Also, the previously derived Abraham model correlation for water-to-micellar Sodium Dodecylsulfate (SDS) was updated to include MEKC retention factor data from the five recent experimental studies. The new repression correlation for SDS is based on 706 experimental values. Principal component analysis showed that water-to-micellar surfactant partition systems are good models for non-polar and polar narcotic toxicities of organic compounds toward fish, water fleas, and other aquatic organisms.

Journal ArticleDOI
TL;DR: In this article, a least squares-support vector machine (LS-SVM) was used to derive a quantitative structureactivity relationship (QSAR) model for predicting the soil sorption coefficient normalized to organic carbon, Koc, from 24 fragment-specific increments and four further molecular descriptors, employing a training set of 571 organic compounds and three external validation sets.
Abstract: Least squares-support vector machine (LS-SVM) was used to derive a quantitative structure-activity relationship (QSAR) model for predicting the soil sorption coefficient normalized to organic carbon, Koc, from 24 fragment-specific increments and four further molecular descriptors, employing a training set of 571 organic compounds and three external validation sets. The combinational parameters of LS-SVM were optimized by adaptive random search technique (ARST). ARST could search the optimal combinational parameters of LS-SVM from the solution space in a simple and quick way. The developed LS-SVM model was compared with the model established by multiple linear regression (MLR) analysis using the same data sets. Generally, the LS-SVM model performed slightly better than the MLR model with respect to goodness-of-fit, predictivity, and applicability domain (AD). The ADs of the LS-SVM and MLR models were described on the basis of leverages and standardized residuals. Both the LS-SVM and MLR models had wide ADs within a given reliability (standardized residual<3 SE units), but the LS-SVM model was superior for compounds with high leverages.

Journal ArticleDOI
TL;DR: Examples of distributed solutions for docking and virtual screening applications in moving the computational paradigms towards collaborative research and grid computing environments are given.
Abstract: Distributed grid technologies are gradually realizing their potential to provide innovative infrastructures for complex scientific and industrial applications in the field of computational chemistry and related application areas The current paper gives examples of distributed solutions for docking and virtual screening applications in moving the computational paradigms towards collaborative research and grid computing environments The Chemomentum collaborative computing environment including both hardware and software infrastructure is described Examples of applications are given i) for docking and virtual screening on multiprotein and multilibrary cases for H5N1 avian influenza and HIV-1 viruses; and ii) QSAR model building related to HIV-1 protease activity and aquatic toxicity

Journal ArticleDOI
TL;DR: Prior feature selection is not essential for ANN and it is a desirable option for meaningful outputs in terms of the rationale behind the inputs, as investigated with a variety of objectively selected and arbitrarily chosen variables from chemical databases.
Abstract: In modeling approaches, artificial neural networks (ANNs) have a special place to address the nonlinear phenomena or curved manifold. Often one or other feature selection approach is used prior to ANN to feed the input variables for its models. The function of ‘selected’ versus ‘arbitrary’ features on the outcome of ANN models is investigated with a variety of objectively selected and arbitrarily chosen variables from chemical databases namely thiazolidinones, anilinoquinolines and piperazinoquinolines. For each database, its biological activity is considered as the dependent variable and the molecular descriptors from DRAGON software are used as explanatory variables. The selection sets are obtained from feature selection approaches namely, combinatorial protocol in multiple linear regression, stepwise regression and genetic algorithm. Apart from these, a large number of arbitrary sets have been created by randomly picking the descriptors from corresponding databases. The features of all sets have shown a variety of inter- and intra- set diversities. A three-layer back propagation ANN with Levenberg-Marquardt optimization algorithm has been used for modeling the phenomena. Regardless of the origin of the feature sets, the ANN models from a very large number of sets have well explained the activity and qualified themselves to be predictive models. Also, no specific pattern is apparent between the quality of ANN model and the origin of its input feature set. Since these results are unusual, the study is extended to a few more databases. All the results emphasized the innate ability of ANN in developing complex network of relations among features to estimate the target variable. This has prompted us to suggest that prior feature selection is not essential for ANN and it is a desirable option for meaningful outputs in terms of the rationale behind the inputs.

Journal ArticleDOI
TL;DR: A lead hopping application is deployed that uses belief theory to combine the results of ROCS, Daylight, and ECFP_6 similarities to detect lead hops that are reported to result from pharmacophore searching.
Abstract: We investigated the ability of several computer programs to detect lead hops that had been reported to result from pharmacophore searching. None of the methods identified all of these lead hops and some were not found by any program. The methods that performed the best identified different lead hops. Hence, we have deployed a lead hopping application that uses belief theory to combine the results of ROCS, Daylight, and ECFP_6 similarities.

Journal ArticleDOI
TL;DR: The aim of this work is to adapt a methodology (previously developed for the analysis of DNA minor groove binders) for theAnalysis of NCI ACAM database, using Principal Component Analysis (PCA and QSAR/QSPR for the prediction of the mechanism of action of anti-cancer drugs.
Abstract: During the years the National Cancer Institute (NCI) accumulated an enormous amount of information through the application of a complex protocol of drugs screening involving several tumor cell lines, grouped into panels according to the disease class. The Anti-cancer Agent Mechanism (ACAM) database is a set of 122 compounds with anti-cancer activity and a reasonably well known mechanism of action, for which are available drug screening data that measure their ability to inhibit growth of a panel of 60 human tumor lines, explicitly designed as a training set for neural network and multivariate analysis. The aim of this work is to adapt a methodology (previously developed for the analysis of DNA minor groove binders) for the analysis of NCI ACAM database, using Principal Component Analysis (PCA) and QSAR/QSPR for the prediction of the mechanism of action of anti-cancer drugs. The entire database was splitted in a training set of 60 structures and a test set of 48 ones, and each set was expressed in form of a matrix on which further procedures were performed. Three statistical parameters were calculated: First Attempt of Prediction (FAP) expresses the percentage of correct predictions at first attempt, Total Attempt of Prediction (TAP) expresses the total percentage of correct predictions across all the three attempts, Non-Classified (NC) expresses the percentage of compounds whose mechanism of action has failed to be predicted. The predictive ability of this approach is variable, but the results obtained are generally good; using 50% Growth Inhibiting concentration (GI50) values as training data, we were able to assign a correct mechanism of action with a good degree of reliability (more than 79%).

Journal ArticleDOI
Shan-sheng Yang1, Wencong Lu1, Tianhong Gu1, Liuming Yan1, Guo-Zheng Li1 
TL;DR: In this article, a quantitative structure-property relationship (QSPR) model was developed to correlate structures of aromatic compounds with their n-octanol-water partition coefficient (logKow).
Abstract: Quantitative Structure–Property Relationship (QSPR) model was developed to correlate structures of aromatic compounds with their n-octanol–water partition coefficient (logKow). The 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated using Gaussian 03, HyperChem 7.5, and TSAR V3.3. The descriptors were screened by the minimum Redundancy Maximum Relevance (mRMR)-Genetic Algorithm (GA)-Support Vector Regression (SVR) method. The parameters of the SVR model were optimized using the five-fold cross-validation method. The QSPR model was developed from a training set consisting of 300 compounds using the SVR method with a good determination coefficient (R2=0.85). The QSPR model was then tested using an external test set consisting of 50 compounds with satisfactory external predictive ability (q2=0.84). The results show that mRMR-GA-SVR feature selection method and SVR method can be used to model logKow for a diverse set of aromatic compounds and could be promising tools in the field of QSPR research.

Journal ArticleDOI
Yueying Ren1, Jin Qin1, Huanxiang Liu1, Xiaojun Yao1, Mancang Liu1 
TL;DR: In this article, a Quantitative Structure-Property Relationship (QSPR) study was carried out to model the melting points for a diverse set of 288 potential Ionic Liquids (ILs) including pyridinium bromides, imidazolium and benzimidazolate bromide.
Abstract: A Quantitative Structure–Property Relationship (QSPR) study was carried out to model the melting points for a diverse set of 288 potential Ionic Liquids (ILs) including pyridinium bromides, imidazolium bromides, benzimidazolium bromides, and 1-substituted 4-amino-1,2,4-triazolium bromides. Based on the calculated descriptors by CODESSA program, a Principal Component Analysis (PCA) was performed on the whole data to detect the homogeneities in the dataset and to assist the separation of the data into representative training and test sets. Heuristic Method (HM) and Projection Pursuit Regression (PPR) were used to develop linear and nonlinear models between the descriptors and the melting points. The PPR model gave a high predictive correlation coefficient (R2) of 0.810 and an Average of Absolute Relative Deviation (AARD) of 17.75%, which are better than those by HM model (R2=0.712, AARD=24.33%) indicating that PPR is better for the prediction of the melting points. In addition, the descriptors selected by HM can give some insight into factors that can affect the melting points, i.e., benzene ring structure, rotatable bonds, branching, symmetry, and intramolecular electronic effects. This information would be very useful in the design of the potential ILs with desired melting points.

Journal ArticleDOI
TL;DR: Developing a quantitative structure-activity relationship (QSAR) model for describing and predicting the inhibition activity of 1-(3,3-diphenylpropyl)-piperidinyl derivatives as CCR5 modulators reveals that MARS can describe and predict inhibitionactivity of these modulators and is as robust as ANN.
Abstract: This study deals with developing a quantitative structure-activity relationship (QSAR) model for describing and predicting the inhibition activity of 1-(3,3-diphenylpropyl)-piperidinyl derivatives as CCR5 modulators. Applying the multiple linear regressions (MLR) and its inability in predicting the inhibition behavior showed that the interaction has no linear characteristics. To assess the nonlinear characteristics of the inhibition activity artificial neural networks (ANN) was used for data modeling. In order to select the variables needed for developing ANNs, three variable selection algorithms were used: Stepwise-MLR, genetic algorithm-partial least squares (GA-PLS), and Bayesian regularized genetic neural networks (BRGNNs). R2 and root mean square error (RMSE) values for training (t) and leave-one-out (LOO) procedures revealed that BRGNNs is a robust algorithm for the variable selection and regression method simultaneously. Due to the ‘black box’ limitation of neural networks, multivariate adaptive regression spline (MARS) technique was used for modeling. A prominent advantage of MARS with respect to ANN is its ability in interpreting of the results of the model. Q2LOO and Rt2 (0.982 and 0.947) reveal that MARS can describe and predict inhibition activity of these modulators and is as robust as ANN. Because the MARS model can explain the activity of molecules, it is a useful model for designing novel CCR5 inhibitors.

Journal ArticleDOI
TL;DR: The analysis demonstrates that RF is a powerful tool capable of building models for the data and should be valuable for virtual screening of androgen receptor-binding ligands.
Abstract: The purpose of the present study was to develop in silico models allowing for a reliable prediction of androgenic and nonandrogenic compounds based on a large diverse dataset of 205 compounds. As a new classification method, the Random Forest (RF) was applied, its performance to classify these compounds in terms of their Quantitative Structure–Activity Relationships (QSAR) was evaluated and also compared with the widely used Partial Least Squares (PLS) analysis for the dataset. The predictive power of these methods was verified with five-fold cross-validation and an independent test set. For the RF model, the prediction accuracies of the androgenic and nonandrogenic compounds are 81.0 and 77.0% for cross-validation, respectively, averaging 87.3% of correctly classified compounds in the external tests. The PLS is slightly weak, showing an average prediction accuracy of 75 and 74.7% for the cross-validation and external validation, respectively. Our analysis demonstrates that RF is a powerful tool capable of building models for the data and should be valuable for virtual screening of androgen receptor-binding ligands.

Journal ArticleDOI
TL;DR: In this article, the authors developed robust, interpretable structure-activity relationship (SAR) models for assessing the aquatic toxicity of pesticides using two variable selection techniques, i.e., the stepwise procedure and the GA coupled with the linear discriminant analysis (LDA) to obtain stable and thoroughly validated QSARs.
Abstract: The purpose of this work is to develop robust, interpretable structure-activity relationship (SAR) models for assessing the aquatic toxicity of pesticides. A data set of 1600 chemicals involving 533 nontoxic (C0), 287 slightly toxic (C1), 329 moderately toxic (C2), 231 highly toxic (C3), and 220 very highly toxic compounds (C4) to aquatic organisms were collected in this work. Their chemical structures were encoded into 196 molecular descriptors including the 2D topological, electrotopological state variables as well as the MlogP and AlogP parameters. Two variable selection techniques, i.e., the Stepwise procedure and the Genetic Algorithms (GA), coupled with the linear discriminant analysis (LDA) were used to obtain stable and thoroughly validated QSARs. Our results reveal that the AlogP is capable of classifying the C0 versus C4 compounds with an accuracy rate of 70.4%, but is poor between other groups, while the MlogP does not show any pronounced correlation for aquatic toxicity for all the groups. By using all the theoretical descriptors, the GA-LDA models for C(0,4) C(1,3), C(1,4), and C(2,4) classifications are acceptable with external prediction accuracies ranging from 66.3% to 80.6%. All these selected descriptors accounting for the molecular size, electrotopological state, and hydrophobicity were found to be crucial to modeling the aquatic toxicity. The robustness and the predictive performance of the proposed models were verified using both the internal (cross-validation by leave-one out, Y-scrambling) and external statistical validations (randomly selected). Our results demonstrate that the Genetic Algorithms have a huge advantage over the Stepwise procedure for generating more reliable models, but by using much less descriptors for all the data sets.

Journal ArticleDOI
TL;DR: It is described how different potential errors in QSAR model generation and prediction were averted and how the method was applied in an industrial environment.
Abstract: We recently developed a global-local fusion model for CYP450 predictions. This model has the advantages of both global and local models. The expected error of the model is also estimated that helps to qualify the reliability of the predictions. It is described how different potential errors in QSAR model generation and prediction were averted and how the method was applied in an industrial environment.

Journal ArticleDOI
Min Sun1, Youguang Zheng1, Hongtao Wei1, Junqing Chen1, Jin Cai1, Min Ji1 
TL;DR: The resulting models could act as an efficient strategy for estimating the Src-inhibiting activity of novel 4-anilino-3-quinolinecarbonitriles and provide some insight into the structural features related to the biological activity of these compounds.
Abstract: Quantitative Structure–Activity Relationship (QSAR) analyses have been carried out for a set of 4-anilino-3-quinolinecarbonitriles. Considering simplicity and interpretability, Src kinase-inhibiting activity of these compounds expressed in log units have been modeled by Multiple Linear Regression (MLR) analysis combined with various variable selection approaches, including Forward Selection (FS), Genetic Algorithm (GA), Simulated Annealing (SA), and Enhanced Replacement Method (ERM), based on descriptors generated by E-Dragon software. Performances of these models are rigorously validated by Leave-One-Out Cross-Validation (LOOCV), five-fold Cross-Validation (5-CV), and external validation. The ERM–MLR model is much better than other models, with R2=0.854 and =0.840. Robustness and predictive ability of this model are prudently evaluated. Moreover, another classification analysis using Fisher Linear Discriminant Analysis (FLDA) and Support Vector Machine (SVM) is also developed with the aim of dissecting the most significant factors that lead to the activity difference between highly active compounds and those not so active. The 5-CV and external validation prediction accuracy reached 95.00 and 93.75% for the SVM-based model, respectively. The resulting models could act as an efficient strategy for estimating the Src-inhibiting activity of novel 4-anilino-3-quinolinecarbonitriles and provide some insight into the structural features related to the biological activity of these compounds.

Journal ArticleDOI
TL;DR: A specialized fragment-based method is employed to develop robust quantitative structure–activity relationship models for a series of synthetic discodermolide analogs, generating molecular recognition patterns that were combined with three-dimensional molecular modeling studies as a fundamental step on the path to understanding the molecular basis of drug–receptor interactions within this important series of potent antitumoral agents.
Abstract: Inhibition of microtubule function is an attractive rational approach to anticancer therapy. Although taxanes are the most prominent among the microtubule-stabilizers, their clinical toxicity, poor pharmacokinetic properties, and resistance have stimulated the search for new antitumor agents having the same mechanism of action. Discodermolide is an example of nontaxane natural product that has the same mechanism of action, demonstrating superior antitumor efficacy and therapeutic index. The extraordinary chemical and biological properties have qualified discodermolide as a lead structure for the design of novel anticancer agents with optimized therapeutic properties. In the present work, we have employed a specialized fragment-based method to develop robust quantitative structure–activity relationship models for a series of synthetic discodermolide analogs. The generated molecular recognition patterns were combined with three-dimensional molecular modeling studies as a fundamental step on the path to understanding the molecular basis of drug–receptor interactions within this important series of potent antitumoral agents.