scispace - formally typeset
Search or ask a question
Institution

Chemical Abstracts Service

About: Chemical Abstracts Service is a based out in . It is known for research contribution in the topics: Ring (chemistry) & Information system. The organization has 137 authors who have published 146 publications receiving 2083 citations. The organization is also known as: CAS.


Papers
More filters
Journal ArticleDOI
TL;DR: By analyzing the scaffold content of the CAS Registry, this work believes this power law is evidence that the minimization of synthetic cost has been a key factor in shaping the known universe of organic chemistry.
Abstract: By analyzing the scaffold content of the CAS Registry, we attempt to characterize in a comprehensive way the structural diversity of organic chemistry. The scaffold of a molecule is taken to be its framework, defined as all its ring systems and all the linkers that connect them. Framework data from more than 24 million organic compounds is analyzed. The distribution of frameworks among compounds is found to be top-heavy, i.e., a small percentage of frameworks occur in a large percentage of compounds. When frameworks are analyzed at the graph level, an even more top-heavy distribution is found: half of the compounds can be described by only 143 framework shapes. The most significant finding is that the framework distribution conforms almost exactly to a power law. This suggests that the more often a framework has been used as the basis for a compound, the more likely it is to be used in another compound. This may be explained by the cost of synthesis: making a new derivative of a framework is probably less...

263 citations

Journal ArticleDOI
TL;DR: The SPEEDCOP program as discussed by the authors was developed as part of the Spelling Error Detection Project (Spelling Error. Detection/ ©1984ACM0001-0782/84/0400-0358 75¢ Correction Project).
Abstract: The study of computerized correction of spelling errors has a relatively long history and remains of considerable current interest if regularly appearing papers on the topic are any gauge. Whereas early papers focused on the correction of output from optical character recognition (OCR), voice recognition, Morse code, or on spelling errors in program code, the application of most interest today is probably the correction of machine-readable text. Also, the techniques involved in spelling error correction have other important applications, for example, measuring the similarity of two strings of symbols to determine the evolutionary distance of proteins. The specialized subtopics are covered extensively in recent bibliographies by Peterson [7] and Pollock [8] and are not discussed further here. The most suitable correction strategy for text often depends on both its nature and its source. Correcting source code for a procedural language with a smafl vocabulary of short words (e.g., a typical programming language or the command language for a bibliographic search system) presents quite different problems than scientific text. Similarly, OCR output contains almost exclusively substitution errors, which ordinarily account for less than 20 percent of keyboarded misspellings [9, 10]. This paper describes the correction pro.gram developed as part of SPEEDCOP (Spelling Error. Detection/ ©1984ACM0001-0782/84/0400-0358 75¢ Correction Project), a Chemical Abstracts Service (CAS) project supported by the National Science Foundation. The program is intended not as a theoretical construct but as a useful tool for text editing. Under SPEEDCOP, approximately 25,000,000 words from seven scientific and scholarly textual databases were processed to extract over 50,000 misspellings using a dictionary equivalent to 40,000 words. This contrasts sharply with most work on spelling correction that tends to feature small dictionaries and either few or artificial misspellings. In our case, the use of real data gave credibility to the proposed solution, whereas the large dictionary revealed problems such as ambiguity that would not otherwise have been discovered. Internal reports [10-13] and papers [9, 19] describe in considerable detail how the misspellings were gathered and analyzed. The core of the SPEEDCOP program is an algorithm for correcting only isolated misspellings that contain a single error and whose correct forms are in a dictionary. This is not as drastic a restriction as it may seem as 90-95 percent of misspellings in raw keyboarding typically contain only one error [9, 10]. The SPEEDCOP program also incorporates a common misspelling dictionary and a function word routine and is therefore not rigidly restricted to the above class of misspellings. In practice, the program corrected 85-95 percent of the misspellings for which it was designed, 75-90 percent of those whose corresponding words were in the dic-

231 citations

Journal ArticleDOI
TL;DR: A new, simple proof that this distance satisfies the triangle inequality is presented and can be defined based on the Tanimoto coefficient.
Abstract: A distance, or dissimilarity measure, can be defined based on the Tanimoto coefficient, a similarity measure widely applied to chemical structures. A new, simple proof that this distance satisfies the triangle inequality is presented.

213 citations

Journal ArticleDOI
TL;DR: The speciation of contaminants after electrokinetic treatment showed that significant change in exchangeable and soluble fractions occurred, and low migration rates occurred as a result of contaminants existing as immobile complexes and precipitates.

173 citations

Journal ArticleDOI
TL;DR: The trigram analysis technique developed determined the error site within a misspelling accurately, but did not distinguish effectively between different error types or between valid words and misspellings.
Abstract: Work performed under the SPElling Error Detection COrrection Project (SPEEDCOP) supported by National Science Foundation (NSF) at Chemical Abstracts Service (CAS) to devise effective automatic methods of detecting and correcting misspellings in scholarly and scientific text is described. The investigation was applied to 50,000 word/misspelling pairs collected from six datasets (Chemical Industry Notes (CIN), Biological Abstracts (BA). Chemical Abstracts (CA), Americal Chemical Society primary journal keyboarding (ACS), Information Science Abstracts (ISA), and Distributed On-Line Editing (DOLE) (a CAS internal dataset especially suited to spelling error studies). The purpose of this study was to determine the utility of trigram analysis in the automatic detection and/or correction of misspellings. Computer programs were developed to collect data on trigram distribution in each dataset and to explore the potential of trigram analysis for detecting spelling errors, verifying correctly-spelled words, locating the error site within a misspelling, and distinguishing between the basic kinds of spelling errors. The results of the trigram analysis were largely independent of the dataset to which it was applied but trigram compositions varied with the dataset. The trigram analysis technique developed determined the error site within a misspelling accurately, but did not distinguish effectively between different error types or between valid words and misspellings. However, methods for increasing its accuracy are suggested.

133 citations


Network Information
Related Institutions (5)
Eindhoven University of Technology
52.9K papers, 1.5M citations

70% related

KAIST
77.6K papers, 1.8M citations

69% related

Vienna University of Technology
49.3K papers, 1.3M citations

69% related

Syracuse University
47.5K papers, 1.6M citations

68% related

University of Massachusetts Amherst
83.9K papers, 3.8M citations

68% related

Performance
Metrics
No. of papers from the Institution in previous years
YearPapers
20211
20192
20183
20177
20164
20151