scispace - formally typeset
Search or ask a question
JournalISSN: 0095-2338

Journal of Chemical Information and Computer Sciences 

American Chemical Society
About: Journal of Chemical Information and Computer Sciences is an academic journal. The journal publishes majorly in the area(s): Quantitative structure–activity relationship & Topological index. It has an ISSN identifier of 0095-2338. Over the lifetime, 3105 publications have been published receiving 127615 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This chapter discusses the construction of Benzenoid and Coronoid Hydrocarbons through the stages of enumeration, classification, and topological properties in a number of computers used for this purpose.
Abstract: (1) Klamer, A. D. “Some Results Concerning Polyominoes”. Fibonacci Q. 1965, 3(1), 9-20. (2) Golomb, S. W. Polyominoes·, Scribner, New York, 1965. (3) Harary, F.; Read, R. C. “The Enumeration of Tree-like Polyhexes”. Proc. Edinburgh Math. Soc. 1970, 17, 1-14. (4) Lunnon, W. F. “Counting Polyominoes” in Computers in Number Theory·, Academic: London, 1971; pp 347-372. (5) Lunnon, W. F. “Counting Hexagonal and Triangular Polyominoes”. Graph Theory Comput. 1972, 87-100. (6) Brunvoll, J.; Cyvin, S. J.; Cyvin, B. N. “Enumeration and Classification of Benzenoid Hydrocarbons”. J. Comput. Chem. 1987, 8, 189-197. (7) Balaban, A. T., et al. “Enumeration of Benzenoid and Coronoid Hydrocarbons”. Z. Naturforsch., A: Phys., Phys. Chem., Kosmophys. 1987, 42A, 863-870. (8) Gutman, I. “Topological Properties of Benzenoid Systems”. Bull. Soc. Chim., Beograd 1982, 47, 453-471. (9) Gutman, I.; Polansky, O. E. Mathematical Concepts in Organic Chemistry·, Springer: Berlin, 1986. (10) To3i6, R.; Doroslovacki, R.; Gutman, I. “Topological Properties of Benzenoid Systems—The Boundary Code”. MATCH 1986, No. 19, 219-228. (11) Doroslovacki, R.; ToSic, R. “A Characterization of Hexagonal Systems”. Rev. Res. Fac. Sci.-Univ. Novi Sad, Math. Ser. 1984,14(2) 201-209. (12) Knop, J. V.; Szymanski, K.; Trinajstic, N. “Computer Enumeration of Substituted Polyhexes”. Comput. Chem. 1984, 8(2), 107-115. (13) Stojmenovic, L; Tosió, R.; Doroslovaóki, R. “Generating and Counting Hexagonal Systems”. Proc. Yugosl. Semin. Graph Theory, 6th, Dubrovnik 1985; pp 189-198. (14) Doroslovaóki, R.; Stojmenovió, I.; Tosió, R. “Generating and Counting Triangular Systems”. BIT 1987, 27, 18-24. (15) Knop, J. V.; Miller, W. R.; Szymanski, K.; Trinajstic, N. Computer Generation of Certain Classes of Molecules·, Association of Chemists and Technologists of Croatia: Zagreb, 1985.

4,541 citations

Journal ArticleDOI
TL;DR: It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
Abstract: A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accu...

2,634 citations

Journal ArticleDOI
TL;DR: The focus is on regression problems, which are those in which one of the measures, the dependent Variable, is of special interest, and the authors wish to explore its relationship with the other variables.
Abstract: Model fitting is an important part of all sciences that use quantitative measurements. Experimenters often explore the relationships between measures. Two subclasses of relationship problems are as follows: • Correlation problems: those in which we have a collection of measures, all of interest in their own right, and wish to see how and how strongly they are related. • Regression problems : those in which one of the measures, the dependent Variable, is of special interest, and we wish to explore its relationship with the other variables. These other variables may be called the independent Variables, the predictor Variables, or the coVariates. The dependent variable may be a continuous numeric measure such as a boiling point or a categorical measure such as a classification into mutagenic and nonmutagenic. We should emphasize that using the words ‘correlation problem’ and ‘regression problem’ is not meant to tie these problems to any particular statistical methodology. Having a ‘correlation problem’ does not limit us to conventional Pearson correlation coefficients. Log-linear models, for example, measure the relationship between categorical variables in multiway contingency tables. Similarly, multiple linear regression is a methodology useful for regression problems, but so also are nonlinear regression, neural nets, recursive partitioning and k-nearest neighbors, logistic regression, support vector machines and discriminant analysis, to mention a few. All of these methods aim to quantify the relationship between the predictors and the dependent variable. We will use the term ‘regression problem’ in this conceptual form and, when we want to specialize to multiple linear regression using ordinary least squares, will describe it as ‘OLS regression’. Our focus is on regression problems. We will use y as shorthand for the dependent variable and x for the collection of predictors available. There are two distinct primary settings in which we might want to do a regression study: • Prediction problems:We may want to make predictions of y for future cases where we know x but do not knowy. This for example is the problem faced with the Toxic Substances Control Act (TSCA) list. This list contains many tens of thousands of compounds, and there is a need to identify those on the list that are potentially harmful. Only a small fraction of the list however has any measured biological properties, but all of them can be characterized by chemical descriptors with relative ease. Using quantitative structure-activity relationships (QSARs) fitted to this small fraction to predict the toxicities of the much larger collection is a potentially cost-effective way to try to sort the TSCA compounds by their potential for harm. Later, we will use a data set for predicting the boiling point of a set of compounds on the TSCA list from some molecular descriptors. • Effect quantification:We may want to gain an understanding of how the predictors enter into the relationship that predicts y. We do not necessarily have candidate future unknowns that we want to predict, we simply want to know how each predictor drives the distribution of y. This is the setting seen in drug discovery, where the biological activity y of each in a collection of compounds is measured, along with molecular descriptors x. Finding out which descriptors x are associated with high and which with low biological activity leads to a recipe for new compounds which are high in the features associated positively with activity and low in those associated with inactivity or with adverse side effects. These two objectives are not always best served by the same approaches. ‘Feature selection’ skeeping those features associated withy and ignoring those not associated with y is very commonly a part of an analysis meant for effect quantification but is not necessarily helpful if the objective is prediction of future unknowns. For prediction, methods such as partial least squares (PLS) and ridge regression (RR) that retain all features but rein in their contributions are often found to be more effective than those relying on feature selection. What Is Overfitting? Occam’s Razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more. For example, if a regression model with 2 predictors is enough to explainy, then no more than these two predictors should be used. Going further, if the relationship can be captured by a linear function in these two predictors (which is described by 3 numbers sthe intercept and two slopes), then using a quadratic violates parsimony. Overfitting is the use of models or procedures that violate parsimonysthat is, that include more terms than are necessary or use more complicated approaches than are necessary. It is helpful to distinguish two types of overfitting: • Using a model that is more flexible than it needs to be. For example, a neural net is able to accommodate some curvilinear relationships and so is more flexible than a simple linear regression. But if it is used on a data set that conforms to the linear model, it will add a level of complexity without * Corresponding author e-mail: doug@stat.umn.edu. 1 J. Chem. Inf. Comput. Sci. 2004,44, 1-12

1,931 citations

Journal ArticleDOI
TL;DR: The concept of similarity searching is introduced, differentiating it from the more common substructure searching, and the current generation of fragment-based measures that are used for searching chemical structure databases are discussed.
Abstract: This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching, differentiating it from the more common substructure searching, and then discusses the current generation of fragment-based measures that are used for searching chemical structure databases. The next sections focus upon two of the principal characteristics of a similarity measure: the coefficient that is used to quantify the degree of structural resemblance between pairs of molecules and the structural representations that are used to characterize molecules that are being compared in a similarity calculation. New types of similarity measure are then compared with current approaches, and examples are given of several applications that are related to similarity searching.

1,662 citations

Journal ArticleDOI
TL;DR: The CSD itself acts as a computerized depository for large-volume numerical results for some 30 journals and may conveniently be categorized according to its "dimensionality", as described below and illustrated in Figure 1.
Abstract: ed, together with any associated supplementary (deposited) data. The CSD itself acts as a computerized depository for large-volume numerical results for some 30 journals. A total of 584 primary sources are now referenced in the CSD, of which 74 are regularly scanned in-house to provide ca. 80% of current input. Remaining references are located via a scan of secondary sources, particularly Chemical Abstracts. Each entry in the CSD relates to a specific crystal structure determination of a specific chemical compound. Each entry is identified by a CSD reference code (REFCODE). This consists of eight characters: the first six are alphabetic and identify the chemical compound (initially assigned as a mnemonic of the compound name, now generated automatically for new compounds), the last two characters are digits which trace the publication history and define (a) whether the paper is a republication by the same authors (perhaps reporting an improved coordinate set) or (b) whether the paper is a redetermination by a different set of authors. The information recorded for each entry may conveniently be categorized according to its "dimensionality", as described below and illustrated in Figure 1. 1 D information consists of bibliographic and chemical text strings, together with certain individual numeric items: comBATCH OR VERSION 4 GRAPHICS VERSION 4 GRAPHICS

1,205 citations

Network Information
Related Journals (5)
Journal of Chemical Information and Modeling
6.1K papers, 231.1K citations
87% related
Journal of Computational Chemistry
8.7K papers, 719.4K citations
79% related
International Journal of Quantum Chemistry
15.1K papers, 244.2K citations
78% related
Theoretical Chemistry Accounts
7.3K papers, 231.5K citations
78% related
Journal of Medicinal Chemistry
33.3K papers, 1.6M citations
76% related
Performance
Metrics
No. of papers from the Journal in previous years
YearPapers
2004251
2003255
2002178
2001200
2000178
1999157