scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 1977"


Journal ArticleDOI
TL;DR: Techniques for processing simple fuzzy queries expressed in the relational query language SEQUEL are introduced and the feasibility of implementing such techniques in a real environment is studied.
Abstract: This paper is concerned with techniques for fuzzy query processing in a database system. By a fuzzy query we mean a query which uses imprecise or fuzzy predicates (e.g. AGE = “VERY YOUNG”, SALARY = “MORE OR LESS HIGH”, YEAR-OF-EMPLOYMENT = “RECENT”, SALARY ⪢ 20,000, etc.). As a basis for fuzzy query processing, a fuzzy retrieval system based on the theory of fuzzy sets and linguistic variables is introduced. In our system model, the first step in processing fuzzy queries consists of assigning meaning to fuzzy terms (linguistic values), of a term-set, used for the formulation of a query. The meaning of a fuzzy term is defined as a fuzzy set in a universe of discourse which contains the numerical values of a domain of a relation in the system database. The fuzzy retrieval system developed is a high level model for the techniques which may be used in a database system. The feasibility of implementing such techniques in a real environment is studied. Specifically, within this context, techniques for processing simple fuzzy queries expressed in the relational query language SEQUEL are introduced.

189 citations


Journal ArticleDOI
TL;DR: An algorithm is described which accomplishes journal classification using the single-link clustering technique and a novel application of the method of bibliographic coupling, which consists in the use of two-step bibliographical coupling linkages, rather than the usual one-step linkages.
Abstract: The classification of journal titles into fields or specialties is a problem of practical importance in library and information science. An algorithm is described which accomplishes such a classification using the single-link clustering technique and a novel application of the method of bibliographic coupling. The novelty consists in the use of two-step bibliographic coupling linkages, rather than the usual one-step linkages. This modification of the similarity measure leads to a marked improvement in the performance of single-link clustering in the formation of field or specialty clusters of journals. Results of an experiment using this algorithm are reported which grouped 890 journals into 168 clusters. This scope is an improvement of nearly an order of magnitude over previous journal clustering experiments. The results are evaluated by comparison with an independently derived manual classification of the same journal set. The generally good agreement indicates that this method of journal clustering will have significant practical utility for journal classification.

65 citations


Journal ArticleDOI
TL;DR: Estimates were made of the spelling error frequencies of each of these data bases, as well as the frequency of posting to misspelled terms, using a composite sample of over 3600 index terms drawn from 11 different machine-readable bibliographic data bases.
Abstract: Using a composite sample of over 3600 index terms drawn from 11 different machine-readable bibliographic data bases, estimates were made of the spelling error frequencies of each of these data bases, as well as the frequency of posting to misspelled terms. The terms studied included assigned index terms as well as some terms from titles and abstracts. The frequency of index term misspellings ranged from a high of almost 23% for one data base to a low of less than 1 2 % for another data base. The frequency of posting to misspelled terms ranged from about one posting in 8000 citations for one data base, to about one posting in 160 citations in another data base. The impact of these error rates is discussed for the tape supplier, tape user and end user. Some suggestions are given regarding search strategry.

48 citations


Journal ArticleDOI
TL;DR: Considerable evidence exists to show that the use of term relevance weights is beneficial in interactive information retrieval, and various relevance ranking systems are evaluated, including fully automatic systems based on inverse document frequency parameters, and human rankings performed by the user population.
Abstract: Considerable evidence exists to show that the use of term relevance weights is beneficial in interactive information retrieval. Various term weighting systems are reviewed. An experiment is then described in which information retrieval users are asked to rank query terms in decreasing order of presumed importance prior to actual search and retrieval. The experimental design is examined, and various relevance ranking systems are evaluated, including fully automatic systems based on inverse document frequency parameters, human rankings performed by the user population, and combinations of the two.

47 citations


Journal ArticleDOI
TL;DR: An alternative model is proposed, which identifies two different types of “error” or probabilistic variation between relevance judgements, and the problems of quantifying the model, and of assessing its implications for retrieval testing.
Abstract: Gebhardt's[1] probabilistic model of relevance is examined and found not to represent adequately some characteristics of the relevance judgement process. An alternative model is proposed, which identifies two different types of “error” or probabilistic variation between relevance judgements. The two types arise from, first, the definition of the boundaries of the relevance classes, and secondly the actual assessment of an individual document on the underlying scale (which is assumed to be a continuum). The problems of quantifying the model, and of assessing its implications for retrieval testing, are discussed.

45 citations


Journal ArticleDOI
TL;DR: It was found to be very difficult to develop a good strategy for searching a catalog using LC subject headings, and the overriding conclusion was that the LC subject cataloging approach is badly in need of rationalization.
Abstract: Sixty-one undergraduate and graduate students in psychology, economics, and librarianship provided the subject terms they would use to search an academic library catalog in 30 hypothetical search instances. The subject indexing tested was that of the Library of Congress, which is used in most large libraries in the United States. The large number of responses on each search instance enabled an unusually detailed, systematic evaluation of various aspects of the LC approach. Results (including evidence of many inadequacies) were produced on see references, subject/place order, noun/adjective order, specific entry, direct entry, and a priori probability of subject term matching. It was found to be very difficult to develop a good strategy for searching a catalog using LC subject headings. The overriding conclusion was that the LC subject cataloging approach is badly in need of rationalization.

39 citations


Journal ArticleDOI
TL;DR: The computerized correcting process is presented as a heuristic tree search and has the highest error correction accuracy to date.
Abstract: An automatic method for correcting spelling and typing errors from teletypewriter keyboard input is proposed. The computerized correcting process is presented as a heuristic tree search. The correct spellings are stored character-by-character in a psuedo-binary tree. The search examines a small subset of the database (selected branches of the tree) while checking for insertion, substitution, deletion and transposition errors. The correction procedure utilizes the inherent redundancy of natural language. Multiple errors can be handled if at least two correct characters appear between errors. Test results indicate that this approach has the highest error correction accuracy to date.

37 citations


Journal ArticleDOI
TL;DR: The automatic procedure is superior to traditional searching procedures in terms of both recall and precision and probably for more than 80% of the inquiries the need for a documentalist as an intermediary between the user and the system can be avoided.
Abstract: A system is described for the automatic adjustment of queries addressed to information retrieval systems employing a structurised thesaurus for the coordinate indexing of an average of at least five or six descriptors per document. Starting with at least two documents considered by the user as relevant to his inquiry, the system formulates different queries using descriptors occuring in the relevant documents. Results from these queries are presented to the user for relevance assessment as a result of which the most efficient queries are automatically selected and loosened (broadened). The new documents retrieved are again checked for relevance by the user; and with new relevant documents the loop starts again. The result of the automatic procedure is independent of the point of departure. The automatic procedure is superior to traditional searching procedures in terms of both recall and precision. The automatic procedure requires more computing, but probably for more than 80% of the inquiries the need for a documentalist as an intermediary between the user and the system can be avoided.

30 citations


Journal ArticleDOI
TL;DR: The organization of a set of document search patterns proposed in the paper ensures the limitation of documentSearch pattern set searching process—when retrieving a response to a given information request—to one (or several) subset from previously determined subsets, which makes the information system response time acceptable.
Abstract: Search patterns of documents and information requests are their better or worse representatives only, so it is important to carry on examinations on possibilities of designing self-learning information retrieval systems. Another important question is to elaborate such an organization of document search pattern set as to obtain an acceptable response time of the information system to a given information request. A self-learning process of the proposed information system consists in the determination—on a set of document and information request search patterns—of the similarity relation according to L. A. Zadeh. The organization of a set of document search patterns proposed in the paper ensures the limitation of document search pattern set searching process—when retrieving a response to a given information request—to one (or several) subset from previously determined subsets. This makes the information system response time acceptable. The proposed information retrieval strategy is discussed in terms of fuzzy sets.

29 citations


Journal ArticleDOI
TL;DR: It is shown that accepting the assumptions made by Swets would result in the possible rejection of the most pertinent documents in favor of those that are less pertinent, a consequence of the normality assumptions of the model, while other distributions, such as the Poisson distribution, is consistent with the standard procedure.
Abstract: Most automated information retrieval systems operate by relating a document to a request by means of a measure of pertinance, and then retrieving the most pertinent documents for their patrons. In this paper the consistency of this operating procedure with the well known Swets Model is examined. It is shown that accepting the assumptions made by Swets would result in the possible rejection of the most pertinent documents in favor of those that are less pertinent. This conclusion is a consequence of the normality assumptions of the model, while other distributions, such as the Poisson distribution, is consistent with the standard procedure. In the course of the development, the fundamentals of decision theory and signal detection theory are reviewed.

27 citations


Journal ArticleDOI
TL;DR: Indexing theories formulated by Jonker, Heilprin, Landry and Salton are described, which need to be tested experimentally and eventually combined into a unified comprehensive theory of indexing.
Abstract: A theory of indexing helps explain the nature of indexing, the structure of the vocabulary, and the quality of the index. Indexing theories formulated by Jonker, Heilprin, Landry and Salton are described. Each formulation has a different focus. Jonker, by means of the Terminological and Connective Continua, provided a basis for understanding the relationships between the size of the vocabulary, the hierarchical organization, and the specificity by which concepts can be described. Heilprin introduced the idea of a search path which leads from query to document. He also added a third dimension to Jonker's model; the three variables are diffuseness, permutivity and hierarchical connectedness. Landry made an ambitious and well conceived attempt to build a comprehensive theory of indexing predicated upon sets of documents, sets of attributes, and sets of relationships between the two. It is expressed in theorems and by formal notation. Salton provided both a notational definition of indexing and procedures for improving the ability of index terms to discriminate between relevant and nonrelevant documents. These separate theories need to be tested experimentally and eventually combined into a unified comprehensive theory of indexing.

Journal ArticleDOI
TL;DR: MARIS is described, a conversational system of the latter category, designed to provide relatively powerful consulting services for the management of patients in internal medicine.
Abstract: Computer systems for clinical consulting on patient management operate on descriptions of medical expertise derived from repositories of systematized knowledge of medicine (textbooks and/or panels of experts) or from empirical situations embedded in medical records. The paper describes MARIS, a conversational system of the latter category, designed to provide relatively powerful consulting services for the management of patients in internal medicine.

Journal ArticleDOI
TL;DR: The growth, origins, technological development, and current activities of bibliographic data bases are explored and the NCLIS National Program relative to these aspects are examined and its ability to promote and provide a framework for the coordination of data base-related activities and research in response to national needs is examined.
Abstract: Library and information services already feel the impact of the burgeoning development in the field of bibliographic data bases, and this effect will increase in the future. This article explores the growth, origins, technological development, and current activities of bibliographic data bases and examines the NCLIS National Program relative to these aspects. The relationships between data base function, funding, and use are set forth in a discussion of data base producers. Discussions of data formats, data elements and file structure provide the groundwork for a closer look at the methods and purposes underlying the retrospective and current awareness search capabilities of existent data bases. A review of related data base and data base center characteristics highlights the discussion of retrospective and current awareness search functions and intermediary search services. In all data base activity the prime objective of making available information easily accessible to all who need it emerges as no small task, especially in light of the realities of scattered resources and unsteady funding. Data base networking and resource sharing constitute one means to the achievement of this ideal. The greatest potential of the NCLIS National Program lies in this direction, in its ability to promote and provide a framework for the coordination of data base-related activities and research in response to national needs.

Journal ArticleDOI
TL;DR: This paper challenges the meaningfulness of precision and recall values as a measure of performance of a retrieval system by advocating the use of a normalised form of Shannon's functions (entropy and mutual information).
Abstract: This paper challenges the meaningfulness of precision and recall values as a measure of performance of a retrieval system. Instead, it advocates the use of a normalised form of Shannon's functions (entropy and mutual information). Shannon's four axioms are replaced by an equivalent set of five axioms which are more readily shown to be pertinent to document retrieval. The applicability of these axioms and the conceptual and operational advantages of Shannon's functions are the central points of the work. The applicability of the results to any automatic classification is also outlined.

Journal ArticleDOI
TL;DR: This paper attempts to show how a relationally organised data base is well suited to bibliographic data management, and how, given such a relational organisation, it is possible to construct an interface which separates the query language from the physical representation of the data base.
Abstract: Among the problems associated with modern information retrieval systems is the lack of any systematic approach to the design of query language interfaces. In this paper we attempt to show how a relationally organised data base is well suited to bibliographic data management, and how, given such a relational organisation it is possible to construct an interface which separates the query language from the physical representation of the data base. It is also shown how such a query language organisation may be usefully interfaced to existing retrieval systems. Finally a query language for retrieval applications is proposed.

Journal ArticleDOI
TL;DR: By ordering the sources in a more relevant manner, the users' search and retrieval costs are reduced, which leads to an increase in the value and amount of information processed.
Abstract: Mathematical models are developed for describing and optimizing the way libraries organize information sources. By ordering the sources in a more relevant manner, the users' search and retrieval costs are reduced, which leads to an increase in the value and amount of information processed. The cost of organizing collections so as to minimize user effort is dependent on the scatter of relevant sources in the literature and the specification of core classes.

Journal ArticleDOI
TL;DR: The finding that the two strategies failed to generate the two significantly different lists of terms challenges the validity of the assumption and raises several important questions to the theorists who write the guidelines for thesaurus design and to those who must put the guidelines into practice for design of a theSaurus.
Abstract: The present-day guidlines for thesaurus design recommend the two different strategies—the committee and empirical approaches—for identifying candidate terms. An argument is made that the basis for the recommendation is the assumption that the knowledge based on the consensus of experts of a field is different from the knowledge expressed in the literature of that field. An experiment was conducted to test the validity of this assumption. The finding that the two strategies failed to generate the two significantly different lists of terms challenges the validity of the assumption and raises several important questions to the theorists who write the guidelines for thesaurus design and to those who must put the guidelines into practice for design of a thesaurus.

Journal ArticleDOI
TL;DR: It has been found that the systems examined—ASCA, Ringdoc, Drugdoc, BIOSIS, Chemical Abstracts and Medlars/Index Medicus—differ considerably in their coverage of conference abstracts and their indexing of general articles, although they all retrieve most of the main pharmacological papers.
Abstract: The main indexes to recent drug literature have been studied using a file of nearly 400 references to a new class of drugs, the histamine H 2 -receptor antagonists. The results are analysed with reference to the inclusion of references in the data bases, retrieval by searches for non-proprietary names, and timeliness. It has been found that the systems examined—ASCA, Ringdoc, Drugdoc, BIOSIS, Chemical Abstracts and Medlars/Index Medicus—differ considerably in their coverage of conference abstracts and their indexing of general articles, although they all retrieve most of the main pharmacological papers. Attention is drawn to the high proportion of conference abstracts in the literature of this expanding field.

Journal ArticleDOI
TL;DR: A formal standardized procedure for the decision-making process for the purchase or rejection of an information storage and retrieval system, where either purchaser or users find a particular system unacceptable.
Abstract: This paper describes a formal standardized procedure for the decision-making process for the purchase or rejection of an information storage and retrieval system. The interaction of both the purchaser of the system and its potential users with the various models of the system (such as cost-time-volume models and performance evaluation models) ensures that the purchase decision for a given system is affected by all possible constraints and is universally acceptable. If either purchaser or users find a particular system unacceptable, the procedure either rejects it or institutes modifications (within given constraints) until a generally acceptable system is determined, if one exists.

Journal ArticleDOI
TL;DR: A comparative evaluation has been carried out on the Philips “DIRECT” and the British “INSPEC” retrieval system.
Abstract: A comparative evaluation has been carried out on the Philips “DIRECT” and the British “INSPEC” retrieval system. DIRECT is based on automatic indexing whereas INSPEC uses manual subject indexing. Two queries were submitted to both systems, using the same data base. The results are expressed in terms of recall and precision. Both recall and precision of INSPEC were found to be higher than those of DIRECT by 20%. It is concluded that this is mainly a result of the query formulation. The effectiveness obtained with automatic indexing of documents is equivalent to that of the manual procedure.

Journal ArticleDOI
TL;DR: Recent conceptual and empirical work related to development of decision-oriented frameworks for management information systems design is surveyed, particularly as related to the improving the management of organizations.
Abstract: Despite the rapid growth in the use of computers in organizations, few of the resulting systems have had a significant impact on the way in which management makes decisions. Frameworks are needed which aid in understanding the structure of management information systems, toward providing focus and improving the effectiveness of systems efforts. This paper surveys recent conceptual and empirical work related to development of decision-oriented frameworks for management information systems design, particularly as related to the improving the management of organizations.

Journal ArticleDOI
TL;DR: A method is introduced to recognize the part-of-speech for English texts using knowledge of linguistic regularities rather than voluminous dictionaries and may be a valuable tool aiding automatic indexing of documents and automatic thesaurus construction as well as other kinds of natural language processing.
Abstract: A method is introduced to recognize the part-of-speech for English texts using knowledge of linguistic regularities rather than voluminous dictionaries. The algorithm proceeds in two steps; in the first step information concerning the part-of-speech is extracted from each word of the text in isolation using morphological analysis as well as the fact that in English there are a reasonable number of word endings which are characteristic of the part-of-speech. The second step is to look at a whole sentence and, using syntactic criteria, to assign the part-of-speech to a single word according to the parts-of-speech and other features of the surrounding words. In particular, those parts-of-speech which are relevant for automatic indexing of documents, i.e. nouns, adjectives, and verbs, are recognized. An application of this method to a large corpus of scientific text showed the result that for 84% of the words the part-of-speech was identified correctly and only for 2% definitely wrong; for the rest of the words ambiguous assignments were made. Using only word lists of a limited extent, the technique thus may be a valuable tool aiding automatic indexing of documents and automatic thesaurus construction as well as other kinds of natural language processing.

Journal ArticleDOI
TL;DR: The concept of information language (IL), its vocabulry and syntax and the notion of the “semantic power” of an information language are defined and the concept of ideally functioning information retrieval systems (IRS) is suggested.
Abstract: The primary aim of this study is to suggest a formalized definition (“explication”) of “relevance relationship” between texts, including the explication of the concept of “degree of relevance”. The concept of information language (IL), its vocabulry and syntax and the notion of the “semantic power” of an information language are defined. The concept of ideally functioning information retrieval systems (IRS) is suggested and different kinds of deviations from such IRS are considered.

Journal ArticleDOI
TL;DR: The aims of this project were to develop methods for measuring the efficiency of information and documentation systems and services and to test such methods in actual practice.
Abstract: Significant results of a project carried out by the Studiengruppe fur System-forschung e.V. (Heidelberg, W. Germany) on the “Economy of Information and Documentation Systems” are described. The aims of this project in the framework of the German national “Program of the Federal Government for the Promotion of Information and Documentation (I&D-Program) 1974–1977” were to develop methods for measuring the efficiency of information and documentation systems and services and to test such methods in actual practice.

Journal ArticleDOI
C.D. Hurt1
TL;DR: A correlation of the hypothesis that environmental scientists tend to publish in multiple subject areas, with the observed data, produced a significantly negative correlation.
Abstract: This article examines the publication pattern of environmental scientists in terms of a tendency to remain within one subject area or scatter their production across a broad subject range. A correlation of the hypothesis that environmental scientists tend to publish in multiple subject areas, with the observed data, produced a significantly negative correlation. Possible reasons for this production behavior are discussed.

Journal ArticleDOI
TL;DR: A model of the retrieval process, based on continuous variables, is described, and the effectiveness of each method is predicted, both in terms of the Precision-Recall graph and language measures.
Abstract: A full treatment of the significance of a document for an enquirer should include a joint description of the similarity between the document and the enquiry in a linquistic sense, and the age of the document at the time of the enquiry. The basic variables are identified in terms of a signal detection model. The age variable is related to the phenomenon of obsolescence, which is treated as a perceived, signed attribute of relevant documents. Two retrieval methods that use both index terms and document age are described: one in which a set of documents, first selected by a term-intersection process, is reduced by applying a date of publication criterion (the “subset method”); and one in which a bivariate function attaches a single number to each document, and a retrieved set is defined by a single threshold value (the “bivariate weight method”). In the latter method, discriminant analysis is a useful aid. A model of the retrieval process, based on continuous variables, is described, and the effectiveness of each method is predicted, both in terms of the Precision-Recall graph and language measures. The model suggests that either method can improve retrieval performance but incorrect usage will depress it. The better choice of method will depend on the Recall/Precision mix required by the user, as well as the actual parameters of the distributions. A relationship is hypothesised between the growth rate of a data base and the underlying distributions defined by relevance judgements.

Journal ArticleDOI
TL;DR: The historical development of integrated database management systems is reviewed and competing approaches are examined, including management and utilization perspectives, implementation and design issues, query languages, security, integrity, privacy and concurrency.
Abstract: This paper reviews the historical development of integrated database management systems and examines the competing approaches. Topics covered include management and utilization perspectives, implementation and design issues, query languages, security, integrity, privacy and concurrency. Extensive references are provided.


Journal ArticleDOI
TL;DR: The use of design equations in citation retrieval system research is explored and examples are given to illustrate how design equations could be employed in experimental design and data analysis.
Abstract: This paper explores the use of design equations in citation retrieval system research. Design equations are models which relate the performance of a system to its design parameters. A sample is developed; then, the potential of design equations for description, prediction, and prescription is generally explored. Finally, examples are given to illustrate how design equations could be employed in experimental design and data analysis.