scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the Association for Information Science and Technology in 1971"


Journal ArticleDOI
TL;DR: The automatic abstracting system developed consists basically of a dictionary, called the Word Control List, and of a set of rules for implementing certain functions specified for each WCL entry, which include contextual inference, intersentence reference, frequency criteria, and coherency considerations.
Abstract: Together with the increasing shortage of qualified abstractors, the factors of time, cost and value have lent impetus to a trend toward the automatic generation of abstracts and indexes. This trend has caused increased emphasis to be placed on the abstract as the locus of data for automatic retrieval systems. This necessitates the creation of high quality abstracts. It is the purpose of this paper to report on the development of techniques for the automatic production of high quality abstracts from the full text of the original document. It is necessary to analyze the conditions under which various methods of sentence selection are successful, in order to develop criteria for selecting sentences to form an abstract. But clearly, an abstract can also be produced by rejecting sentences of the original which are irrelevant to the abstract. As will be seen, it is this point which is perhaps the most significant contribution of this paper. Methods of sentence selection and rejection are discussed. These include contextual inference, intersentence reference, frequency criteria, and coherency considerations. The automatic abstracting system we have developed consists basically of a dictionary, called the Word Control List, and of a set of rules for implementing certain functions specified for each WCL entry. The abstracts we have obtained so far are of sufficiently good quality to indicate that large-scale testing of the methods of the automatic abstracting system is warranted.

97 citations



Journal ArticleDOI
TL;DR: In this paper, an informal communication network was identified which included 73% of the scientists, a core group of scientists who were the focus of a disproportionately large number of contacts and who were differentiated from others by greater productivity, higher citation record and wider readership.
Abstract: At the frontiers of an active area of science, social structure based upon communication is demonstrated. Using sociometric techniques, an informal communication network was identified which included 73% of the scientists. Within the network was a core group of scientists who were the focus of a disproportionately large number of contacts and who were differentiated from others by greater productivity, higher citation record and wider readership. Information transferred to these scientists is so situated that it could be transmitted to 95% of the network scientists through one intermediary scientist or less.

70 citations


Journal ArticleDOI
TL;DR: These tests demonstrate that further improvements in performance over that for unclassified keywords can be obtained, and that definite conclusions can be drawn about the correct approach to classification for collections like the test one.
Abstract: Though the idea of constructing a keyword classification for retrieval purposes automatically is not a new one, comparatively few systematic experiments have been carried out in this area; and while many suggestions have been put forward, not enough is known about the behaviour of automatic keyword classifications, and hence about the properties such classifications should have and the ways they should be used. In previous experiments we showed that some forms of classification could give good results, and this paper describes a further series of tests designed to examine this sort of classification in more detail, with a view to establishing the optimum forms of classification and procedures for using them in different retrieval situations. These tests demonstrate that further improvements in performance over that for unclassified keywords can be obtained, and that definite conclusions can be drawn about the correct approach to classification for collections like the test one: the best results are given when grouping is confined to strongly connected, non-frequent keywords, when the classification is used to provide additional rather than alternative indexing terms, particularly for requests, and when matching is controlled by keyword collection frequency.

49 citations


Journal ArticleDOI
TL;DR: A mathematical model is presented which explains the observed exponential growth rates of citations and references in a scientific discipline and how the parameters of the model can be estimated.
Abstract: A mathematical model is presented which explains the observed exponential growth rates of citations and references in a scientific discipline. The independent variables are the growth rate of the number of articles published and the decay rate of citation of old literature. It is shown how the parameters of the model can be estimated.

35 citations


Journal ArticleDOI
TL;DR: This article focuses on the human interaction characteristics of an information retrieval system, suggests some design considerations to improve man-machine cooperation, and describes a research system at Stanford that is exploring some of these techniques.
Abstract: This article focuses on the human interaction characteristics of an information retrieval system, suggests some design considerations to improve man-machine cooperation, and describes a research system at Stanford that is exploring some of these techniques. Librarians can only be of limited assistance in helping the naive user formulate an unstructured feeling in his mind into an appropriate search query that maps into the retrieval system. Consequently, the process of query formulation by the user, interactively with the information available in the system, remains one of the principal problems in information retrieval today. In an attempt to solve this problem by improving the interface communication between man and the computer, we have pursued the objective of displaying hierarchically structured index trees on a CRT in a decision tree format permitting the user merely to point (with a light pen) at alternatives which seem most appropriate to him. Using his passive rather than his active vocabulary expands his interaction vocabulary by at least an order of magnitude. Moreover, a hierarchically displayed index is a modified thesaurus, and may be augmented by adding lateral links to provide semantic assistance to the user. A hierarchical structure was chosen because it seems to replicate the structure of cognitive thought processes most closely, thus allowing the simplest, most direct transfer of the man's problem into the structure and vocabulary of the system.

35 citations


Journal ArticleDOI
TL;DR: Various aspects of system operation that are susceptible to cost-effectiveness analysis are discussed, including system coverage, indexing policies and procedures, system vocabulary, searching procedures, and mode of interaction between system and user.
Abstract: A distinction is made between cost-effectiveness analysis and cost-benefits analysis as applied to information systems; and the relationship between costs, performance, and benefits is discussed. Some factors influencing the cost-effectiveness of retrieval and dissemination systems are identified. Various aspects of system operation that are susceptible to cost-effectiveness analysis are discussed, including system coverage, indexing policies and procedures, system vocabulary, searching procedures, and mode of interaction between system and user. Possible tradeoffs between input and output costs, and the effects of these tradeoffs on cost-effectiveness are presented.

32 citations


Journal ArticleDOI
TL;DR: Mathematical evaluation measures to characterize the effect of known erroneous performance by stemming routines are presented, and an expanded probabilistic model is introduced to handle a more general case in which any element need not belong unambiguously to a single cluster.
Abstract: This paper presents mathematical evaluation measures to characterize the effect of known erroneous performance by stemming routines, and generalizes these procedures to other types of nonstatistical clustering algorithms. When clusters, or groups of intrinsically related elements, are split into smaller groups (by under-matching the elements), there is a loss in recall in information retrieval; larger groups (caused by over-matching) induce a loss in precision or relevance. The magnitude of error is taken to be a function of frequencies of cluster elements. When these are words in a subject-term index generated by a stemming algorithm, retrieval capability is also affected by the strength of the algorithm, the size and content of the stemmed index, and the number of words in a query. The present Project Intrex stemming algorithm has estimated stemming-error losses of 4% in recall and 1% in relevance on one-word queries; the former could be reduced to almost zero by straightforward corrections of known errors in the algorithm. An expanded probabilistic model is introduced to handle a more general case in which any element need not belong unambiguously to a single cluster. Error evaluation in document classification and thesauri also is discussed in broad terms.

27 citations


Journal ArticleDOI
TL;DR: A simple general method for determining labor costs, random time sampling with self‐observation, is described, and under comparable conditions, the cost of providing a photocopy did not exceed thecost of lending an original document.
Abstract: A simple general method for determining labor costs, random time sampling with self-observation, is described. Unit costs of providing interlibrary loans and photocopies were determined by this method. The working time of all appropriate library personnel was sampled using Random Alarm Mechanisms and a structured checklist of mutually exclusive tasks. The workers' actual wage rates were applied to the resulting percentages. The total lender's unit cost per request received, including direct labor, materials, fringe benefits, and overhead, was $1.526 for originals mailed postpaid by lender and $1.534 for photocopies mailed. Corresponding unit costs per request filled were: originals $1.932 and photocopies $1.763. Labor costs included the costs of verifying, paging, copying, packaging & mailing, record keeping, and reshelving, based on wage rates in effect February 1969. This practical, objective method of work sampling causes minimal interference with service operations, and does not distort the data being collected. Acceptable reliability can be achieved at low cost. Under comparable conditions, the cost of providing a photocopy did not exceed the cost of lending an original document.

25 citations


Journal ArticleDOI
TL;DR: The human factor appears to be the main variable in all components of an IR system; length of indexes affects performance considerably more than indexing languages; question analyses and search strategies affect performance to a great extent—as much, if not more thanIndexing.
Abstract: A variety of aspects related to testing of retrieval systems were examined. A model of a retrieval system, together with a set of measures and a methodology for performance testing were developed. In the main experiment the effect on performance of the following variables was tested: sources of indexing, indexing languages, coding schemes, question analyses, search strategies and formats of output. In addition, a series of separate experiments was carried out to investigate the problems of controls in experimentation with IR systems. The main conclusions: the human factor appears to be the main variable in all components of an IR system; length of indexes affects performance considerably more than indexing languages; question analyses and search strategies affect performance to a great extent—as much, if not more than indexing. Retrieval systems seem to be able to perform at present only on a general level, failing to be at the same time comprehensive and specific. It seems that testing of total IR systems controlling and monitoring all factors (environmental and systems-related) is not possible at present.

23 citations



Journal ArticleDOI
TL;DR: The emergence and development of informationScience within its wider disciplinary framework is interpreted and possible relationships and roles of information science within a potentially emergent suprasystem of knowledge are discussed.
Abstract: The emergence and development of information science within its wider disciplinary framework is interpreted. Information science is approached as one of a modern generation of communication or behavioral disciplines which emerged almost simultaneously around World War II. Consequently, an attempt is made to discern the evolution of relationships between information science and other modern generation disciplines. The internal development of information science is first sketched. Second, possible relationships and roles of information science within a potentially emergent suprasystem of knowledge are discussed.

Journal ArticleDOI
TL;DR: The major national problem is to avoid or limit wasteful and expensive duplication in providing nationwide search access to the hundreds of public and private data bases that will be readily available during the next few years.
Abstract: Interactive systems, in existence for nearly 15 years, are becoming increasingly important, both for information retrieval and library support operations. The virtues of these systems are speed, intimacy, and—if time-sharing is involved—economy. The major problems are the cost of the large computers and files necessary for bibliographic data, the still-high cost of communications, and the generally poor design of the user-system interfaces. The desirable features of online retrieval interfaces are only now being defined and tested in a systematic way, e.g., by the National Library of Medicine in its AIM-TWX nationwide experimental retrieval service. System implementers must, in addition to engineering the right capabilities into online systems, also make a careful, concerted effort to engineer user acceptance. Common pitfalls here include overselling system capabilities and failure to take into account the social context around the user terminal. The major national problem is to avoid or limit wasteful and expensive duplication in providing nationwide search access to the hundreds of public and private data bases that will be readily available during the next few years. We do not need technological breakthroughs to exploit the potential of online systems, but we do need breakthroughs in organizing for technological change.

Journal ArticleDOI
TL;DR: The Curriculum Committee of the Special Interest Group/Education Information Science of ASIS is charged with the responsibility for determining the scope and characteristics of information science programs in the U.S. and Canada in terms of curriculum developments and course offerings as mentioned in this paper.
Abstract: The Curriculum Committee of the Special Interest Group/Education Information Science of ASIS is charged with the responsibility for determining the scope and characteristics of information science programs in the U.S. & Canada in terms of curriculum developments and course offerings. To fulfil this responsibility, questionnaires were developed to elicit reliable information concerning courses being offered relating to information storage and retrieval, information science and/or documentation. The data requested included course levels, pre- and post-requisite courses, textbooks used, topics covered, frequency with which offered, etc. Responses were received from 45 schools, providing information about 185 courses and 242 topics. Using several methods of clustering the data, it was difficult to arrive at firm results, because of the diversity and scatter of the topics included in this field. It was therefore decided to hold a workshop of experts which would examine the validity of the questionnaire results. This workshop, using the Delphi technique to arrive at consensus, was held at the University of Pittsburgh on September 21–23, 1970. Sixteen specialists in the field representing universities, industry and government were brought together to participate. Consensus was reached in identifying nine factors which contribute to the curriculum in information science and seven courses which constitute the core for the Master's program. The topics to be included in each of these courses were also isolated. The 9 factors are: Psychology/Behavioral Science, Language/Linguistics, Management, Statistics, Library Science, Systems, Mathematics, Information and Communication Theory, and Computer Science/Automata. The 7 courses are: Introduction to Information Science, Systems Theory and Applications, Mathematical Methods in Information Science, Computer Organization and Programming Systems, Abstracting/Indexing/Cataloging, Information and Communication Theory, and Research Methods. The topics relating to these courses are given in Appendix III. Not all the objectives have been attained. The “meat” surrounding the core has not yet been supplied; the core for a Doctoral program must also be determined. The committee feels that some conventions for evaluating the levels of professionalism reached at the completion of such programs could result as a byproduct of ths study.

Journal ArticleDOI
TL;DR: Operations Research models of the acquisition and storage functions of a library are developed and a generalized model of library costs and benefits is proposed.
Abstract: Operations Research models of the acquisition and storage functions of a library are developed. Rules for selection of materials for a depository are analyzed and models of circulation interference and usage are explored. A generalized model of library costs and benefits is proposed.

Journal ArticleDOI
TL;DR: A title searching technique is described which allows the number of references retrieved to be fixed before a search commences, and the relative retrieval efficiency of Titles and Index terms is so close that the choice of one method or the other must be primarily on economic grounds.
Abstract: Previous research has indicated that the titles rather than index terms would, in the standard MEDLARS system, give lower Recall but higher Precision. A title searching technique is described which allows the number of references retrieved to be fixed before a search commences. With this technique the greater applicability of title-terms offsets their relative paucity. The title-searching technique is tested using queries put to MEDLARS. These queries were not specially solicited for the test. Title searching is compared with the standard MEDLARS index term search and with an index term search with fixed output size. For equal output sizes, Title searching retrieves 4 relevant references for every 5 retrieved by index term searching. Thus the relative retrieval efficiency of Titles and Index terms is so close that the choice of one method or the other must be primarily on economic grounds.

Journal ArticleDOI
TL;DR: The background and rationale of the “REFSEARCH” system is discussed, its current use in the School of Librarianship, University of California, and its potential for direct service to library patrons are discussed.
Abstract: A collection of 144 general reference works was analyzed and encoded according to 254 identifiable characteristics of services and contained data, comprising an “approach language” expressing search parameters. In response to a request submitted at an on-line terminal, the “REFSEARCH” system retrieves the names of those works whose profiles meet or exceed the specification. The background and rationale of the system is discussed, its current use in the School of Librarianship, University of California, and its potential for direct service to library patrons.

Journal ArticleDOI
TL;DR: The authors advocate a systematic procedure involving six steps and logical analysis of the picture thus presented to determine the optimum sequence in which decisions should be made during the design process and the nature of the decision process itself.
Abstract: Systems design consists of a tremendously complex series of choices in which no decision point is completely independent of other decisions which have already been made or have yet to be made. A systems approach to the design of document-handling information systems would require a detailed examination of the choices to be made in the design process and the ramifications of possible choices in terms of the capabilities, performance, cost, and other characteristics of the system. The authors advocate a systematic procedure involving six steps: 1) identification of fixed parameters, 2) identification of variable parameters, 3) identification of available options for each variable parameter, 4) identification of factors affecting a choice among available options, 5) identification of factors affected by a choice among available options, and 6) logical analysis of the picture thus presented to determine the optimum sequence in which decisions should be made during the design process and the nature of the decision process itself.

Journal ArticleDOI
Caryl McAllister1, John M. Bell1
TL;DR: This paper discusses ELMS features that facilitate user interaction, and may prove useful in similar systems: techniques for tutoring the user (display format, one‐question, one-answer displays, and KWIC indexing); adaptability for the experienced user (command chains and a standard set of four‐letter mnemonic codes for higher‐level control).
Abstract: ELMS (Experimental Library Management System) is an experimental system for total library management, operating on-line with an IBM 360 through IBM 2260 and 2741 terminals. The system is designed to handle large amounts of highly variable information which it processes on command, giving on-line computer service for all library operations. At the same time, it must accommodate the different needs and skills of a broad range of library users, from new patrons to well-trained librarians. Such a system presents programming problems that will be typical of large, interactive computer systems in the seventies. This paper discusses ELMS features that facilitate user interaction, and may prove useful in similar systems: techniques for tutoring the user (display format, one-question, one-answer displays, and KWIC indexing); adaptability for the experienced user (command chains and a standard set of four-letter mnemonic codes for higher-level control); minimization of keying (line numbers, one-character mnemonic codes used with procedures, and use of default conditions); performance of clerical tasks by exception notification; and collection of operational statistics to help improve the system.

Journal ArticleDOI
TL;DR: In this article, a review of variations in national usage, the practices followed by 19 English-language abstracting and indexing services, and typical problems encountered by an indexer in the entry of foreign personal names, was concluded that entry of all prefixes regardless of nationality may be the wisest procedure for the average author index.
Abstract: Based on a review of variations in national usage, the practices followed by 19 English-language abstracting and indexing services, and typical problems encountered by an indexer in the entry of foreign personal names, it is concluded that the entry of all prefixes regardless of nationality may be the wisest procedure for the average author index. A strong plea is made for increased standardization in the transliteration of Slavic, Greek, and Oriental names and for “correction” of the transliterated names of authors publishing in languages other than their own. The determination of the correct entry element for the compound sur-names found in many nationalities is felt to be an almost insoluble problem unless the authors and publishers cooperate in indicating the desired indexing format. A general consideration of these problems and of indexing theory, as well as of the principles of transliteration and transcription, is followed by guidelines designed to help the practicing indexer improve the consistency of personal name entries; these guidelines are arranged by language family on the basis of whether the author's native language and the language of the article are written in Roman script.



Journal ArticleDOI
TL;DR: A statistical measure is developed for predicting the terms from a restricted vocabulary that will be used to index a document, given that one of the index terms is known, and the results indicate that a large proportion of terms can be predicted using co-occurrence data.
Abstract: A statistical measure is developed for predicting the terms from a restricted vocabulary that will be used to index a document, given that one of the index terms is known. The results indicate that a large proportion of terms can be predicted using co-occurrence data and that the best method of ordering the terms ranks them first in descending order of co-occurrence frequency and then breaks ties in descending order of total frequency. The central assumption is that data from a previously indexed collection can be useful in predicting the terms to be assigned to a new document. The procedure for presenting an indexer with computer-produced ordered lists of suggested index terms in response to an initial term choice could be implemented in an interactive man-machine environment.

Journal ArticleDOI
TL;DR: Given a transformation algorithm and data to be transformed, it is possible to characterize certain qualities of the algorithm that relate to retrieval problems.
Abstract: Frequently it is useful to abbreviate or otherwise transform keys used for the retrieval of information. These transformations include the compression of long keys into a fixed field length by operations on characters or groups of characters, hash or random transformations in order to obtain a direct address, or phonetic coding to order to group together keys that are in some way similar. The various transformations have differing effects on file retrieval schemes. Given a transformation algorithm and data to be transformed, it is possible to characterize certain qualities of the algorithm that relate to retrieval problems. This paper is concerned with some measures of effectiveness of such transformation algorithms.

Journal ArticleDOI
TL;DR: These syntactic influences, together with some of the philosophy from earlier studies, have been combined to produce a set of rules which have greatly eased decision making and have enabled the thesaural vocabulary to be made more consistent.
Abstract: Compound words cause some difficulty in post-coordinate indexing systems: if too many are fractured, or the wrong categories are selected for fracturing, noise will be produced at unacceptable levels on retrieval. Various prior suggestions for handling compound terms are examined which include those for precoordinated or rotated, indexes. The syntactic origins are also explored and it is found that many compound words hinge on a prepositional relationship between the components, and that this relationship can be applied to decision making. Other compound words are in effect abbreviated statements from longer phrases, while some are influenced by the presence of a verb-like form. These syntactic influences, together with some of the philosophy from earlier studies–especially that of the ‘force’ required to fracture a term, have been combined to produce a set of rules which have been employed at the Natural Rubber Producers' Research Association (NRPRA) for over two years. These have greatly eased decision making and have enabled the thesaural vocabulary to be made more consistent. It is also suggested that the rules have some bearing on the application of roles especially if these are employed on a pre-coordinate basis.

Journal ArticleDOI
TL;DR: The results of an experiment simulating a library catalog search indicated that a relation exists between the lexical content load of the text, used to define the search topic, and the number of query terms generated by a group of experimental subjects (college students).
Abstract: The set of different terms used to search a directory for information on a given topic varies considerably from one searcher to another. If the topic is represented by a written text, the characteristics of the text are among the variables which influence the formulation of search terms. The results of an experiment simulating a library catalog search indicated that a relation exists between the lexical content load of the text, used to define the search topic, and the number of query terms generated by a group of experimental subjects (college students). No connection was found between the content load of the text and the commonality of query term choices. Some of the properties of “popular” as opposed to “unique,” query terms were examined. Only short terms enjoyed a high level of popularity. The repeated appearance of a term in the text favored the selection of that term by a large number of searchers.

Journal ArticleDOI
TL;DR: An experiment using a computer to assign content designators to unedited machine readable bibliographic data to create MARC records is described, and the results compared favorably with the current MARC input system in use at the Library of Congress.
Abstract: An experiment using a computer to assign content designators to unedited machine readable bibliographic data to create MARC records is described. Input typing conventions are briefly discussed. A computer program (Assembler Language for DOS) is being developed which analyzes unedited data according to predefined algorithms and builds MARC records. A manual simulation using 150 catalog records was done to test the computer algorithms. The results of the test in terms of accuracy and throughput compared favorably with the current MARC input system in use at the Library of Congress.

Journal ArticleDOI
TL;DR: It is concluded that holography is one of the most promising methods now under research for the achievement of high bulk storage with fast random access at a reasonable cost.
Abstract: An attempt was made to assess the potential use of holography in information storage and retrieval systems. Various scientific (physics), engineering, and business indexes were searched as well as those dealing with information science and libraries. A great many articles were found in the first three, all of which indicate the vast amount of interest, research, and development in the field, and lead to the conclusion that holography is one of the most promising methods now under research for the achievement of high bulk storage with fast random access at a reasonable cost. Because such a method implies a far greater than present ability to store, access, and transmit information on the part of information centers and libraries, and because little has been written in their literature, this paper has been prepared as a “state of the art” to provide background knowledge which is expected to be useful to many in the future.


Journal ArticleDOI
TL;DR: This paper presents a detailed analysis of the content and format of seven machine-readable bibliographic data bases: Chemical Abstracts Service Condensates, Chemical and Biological Activities, and Polymer Science and Technology, Biosciences Information Service's BA Previews including Biological Abstracts and BioResearch Index, Institute for Science Information Source Tape, and Engineering Index COMPENDEX.
Abstract: This paper presents a detailed analysis of the content and format of seven machine-readable bibliographic data bases: Chemical Abstracts Service Condensates, Chemical and Biological Activities, and Polymer Science and Technology, Biosciences Information Service's BA Previews including Biological Abstracts and BioResearch Index, Institute for Science Information Source Tape, and Engineering Index COMPENDEX. Selected issue test tapes of each data base were printed and checked for the types of data that were contained in the issue and the methods in which the data were formatted. This paper compares the physical formats of the tapes and describes the varied treatments given to such data elements as authors, titles, abstracts, etc. Comparison of data bases requires common use of terms. All terms are defined at the beginning of the paper. The authors found great discrepancies in the presentation of essentially similar bibliographic data, and they offer some suggestions for mitigating the discrepancies by use of standards.