scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1979"


Journal ArticleDOI
TL;DR: In this paper, the authors consider the situation where no relevance information is available, that is, at the start of the search, and propose strategies based on a probabilistic model for the initial search and an intermediate search.
Abstract: Most probabilistic retrieval models incorporate information about the occurrence of index terms in relevant and non‐relevant documents. In this paper we consider the situation where no relevance information is available, that is, at the start of the search. Based on a probabilistic model, strategies are proposed for the initial search and an intermediate search. Retrieval experiments with the Cranfield collection of 1,400 documents show that this initial search strategy is better than conventional search strategies both in terms of retrieval effectiveness and in terms of the number of queries that retrieve relevant documents. The intermediate search is shown to be a useful substitute for a relevance feedback search. Experiments with queries that do not retrieve relevant documents at high rank positions indicate that a cluster search would be an effective alternative strategy.

399 citations


Journal ArticleDOI
TL;DR: Criteria are given for the functions used to evaluate the relevance of the records to a specific query, including self-consistency, as a generalization of a Boolean retrieval system.
Abstract: The use of weights to denote a query representation and/or the indexing of a document is analysed as a generalization of a Boolean retrieval system. Criteria are given for the functions used to evaluate the relevance of the records to a specific query, including self-consistency. Various mechanisms suggested in the literature for evaluating the relevance of records with regard to a given query are tested and found to be less than satisfactory. A new approach is suggested to avoid some of the perils of a weighted Boolean retrieval system.

161 citations


Journal ArticleDOI
TL;DR: A new method of document retrieval based on the fundamental operations of the fuzzy set theory is presented, starting by introducing basic notions, then the syntax and semantics of the proposed language for document retrieval will be given and an algorithm allocating documents to particular queries will be described and its properties discussed.
Abstract: The aim of a document retrieval system is to issue documents which contain the information needed by a given user of an information system The process of retrieving documents in response to a given query is carried out by means of the search patterns of these documents and the query It is thus clear that the quality of this process, ie the pertinence of the information system response to the information need of a given user depends on the degree of accuracy in which document and query contents are represented by their search patterns It seems obvious that the weighting of descriptors entering document search patterns improves the quality of the document retrieval process A mathematical apparatus which takes into consideration, in a natural manner, the fact that the grades of importance of the descriptors in document search patterns are of the continuum type, that is an apparatus adequate to the description of a retrieval system of documents indexed by weighted descriptors is—among known mathematical methods—the theory of fuzzy sets, formulated by LA Zadeh It is the aim of this paper to present a new method of document retrieval based on the fundamental operations of the fuzzy set theory We start by introducing basic notions, then the syntax and semantics of the proposed language for document retrieval will be given and an algorithm allocating documents to particular queries will be described and its properties discussed The basic advantage of the use of the fuzzy set theory for document retrieval system description is that it takes into consideration, in a simple way, the differentiation of the importance of descriptors in document search patterns and the differentiation of the formal relevance grades of particular documents of an information system to a given query Documents of the highest grades (in the given information system) of formal relevance to the given query may be retrieved by means of the application of simple operations of the fuzzy set theory

154 citations



Journal ArticleDOI
TL;DR: The hardware required for efficient text retrieval differs from that required for retrieval of formatted data, particularly term comparators.
Abstract: The hardware required for efficient text retrieval differs from that required for retrieval of formatted data. Here is an examination of such hardware, particularly term comparators.

87 citations


Journal ArticleDOI
TL;DR: Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results.
Abstract: This paper describes the use of fixed‐length character strings for controlling the size of indexing vocabularies in reference retrieval systems. Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results. Hashing of terms gives a better performance than that obtained from a vocabulary of comparable size produced by right‐hand truncation. The application of small indexing vocabularies to the sequential searching of large document files is discussed.

54 citations


Journal ArticleDOI
TL;DR: A study has been made of the effect of controlled variations in indexing vocabulary size on retrieval performance using the Cranfield 200 and 1400 test collections.
Abstract: A study has been made of the effect of controlled variations in indexing vocabulary size on retrieval performance using the Cranfield 200 and 1400 test collections. The vocabularies considered are sets of variable‐length character strings chosen from the fronts of document and query terms so as to occur with approximate equifrequency. Sets containing between 120 and 720 members were tested both using an application of the Cluster Hypothesis and in a series of linear associative retrieval experiments. The effectiveness of the smaller sets is low but the larger ones exhibit retrieval characteristics comparable to those of words.

25 citations


Journal ArticleDOI
TL;DR: It is shown that the issue of depth of indexing is, in fact, not a central issue in the design of effective document retrieval systems and is a logical consequence of answers to more fundamental questions about indexing and retrieval.
Abstract: For many years it has been believed that in order to design optimal document retrieval systems one must assign index terms to documents at their optimal depth; therefore, it was of primary importance to answer the following question: “What is the optimal depth of indexing?” This article offers an analysis and answer to this question. We show that the issue of depth of indexing is, in fact, not a central issue in the design of effective document retrieval systems. It turns out that the answer to the question about optimal depth is a logical consequence of answers (which this article provides) to more fundamental questions about indexing and retrieval.

16 citations


Journal ArticleDOI
TL;DR: A model of information retrieval system based on thesaurus with weights is described, with emphasis onclusiveness and two other fundamental properties of the considered system are given.
Abstract: This paper describes a model of information retrieval system based on thesaurus with weights. Definitions of the following terms: thesaurus, document description, information query, similarity of queries and descriptions of documents, similarity measure and accuracy of response are given. Inclusiveness and two other fundamental properties of the considered system are given.

8 citations


Journal ArticleDOI
TL;DR: Property and operations on inverted files, which are used in system based on thesaurus with weights, are studied in this paper.
Abstract: The inverted file structure is often used to organize data in the information retrieval system. When the hierarchy relation on the set descriptors and weights of descriptors in document description would be taken into account, the conventional concept of the inverted file may be extended. Properties and operations on inverted files, which are used in system based on thesaurus with weights, are studied in this paper.

6 citations


Journal ArticleDOI
TL;DR: For a certain class of information systems, the normal multiplication table method yields far more rapid retrieval with a more economical space requirement than conventional systems, and incorporates an improved modification of the inverted file technique.
Abstract: This paper describes a method for the organization and retrieval of attribute based information systems, using the normal multiplication table as a directory for the information system. Algorithms for the organization and retrieval of information are described. This method is particularly suitable for queries requesting a group of information items, all of which possess a particular set of attributes (and possibly some other attributes as well). Several examples are given; the results with respect to the number of disk accesses and disk space are compared to other common approaches. Algorithms evaluating the appropriateness of the above approach to a given information system are described. For a certain class of information systems, the normal multiplication table method yields far more rapid retrieval with a more economical space requirement than conventional systems. Moreover this method incorporates an improved modification of the inverted file technique.

Journal ArticleDOI
TL;DR: The conclusion is that with a combination of advances in communications technol ogy, and sophisticated indexing input from librarians and information scientists, the new generation of automated micrographs devices may constitute the on-line document retrieval systems of the future.
Abstract: This paper notes the benefits accruing from interaction between computerized retrieval systems and micrographic retrieval systems. It reviews current state of automated micrographic retrieval technology. The conclusion is that with a combination of advances in communications technol ogy, and sophisticated indexing input from librarians and information scientists, the new generation of automated micrographs devices may constitute the on-line document retrieval systems of the future.

Proceedings ArticleDOI
01 Jan 1979
TL;DR: The paper highlights the use and impact of interactive computing, the choice of a project implementation language, and the relationship of the course to an individual's transition from student to professional, and a comparison of project grade and the associated computer development time.
Abstract: In the decade since Curriculum '68 [1], the suggested structure of courses related to data management has evolved, as evidenced by the report of the ACM committee on curriculum in 1977 [2] and also noted by Dale [3]. A course in Curriculum '68 entitled “Information Organization and Retrieval” [IOR] does not appear in the 1977 report, while a new course in "File Processing [FP] is included. Influenced by Curriculum '68, N.C. State in 1970 instituted a senior-level course entitled “Information Retrieval” to correspond essentially to the IOR course. Over the years that course in information retrieval has changed gradually, as material related to document retrieval has been supplanted by material related to file organization. Although the title has remained constant, the content is now more similar to FP than IOR. This paper describes the current project-oriented course in information retrieval which stresses the importance of query languages in an information retrieval system. In addition, the paper highlights the use and impact of interactive computing, the choice of a project implementation language, and the relationship of the course to an individual's transition from student to professional. The paper concludes with a comparison of project grade and the associated computer development time.

Journal ArticleDOI
01 Sep 1979
TL;DR: Document retrieval system models are presented and measures to rank the closeness of documents to a query are given.
Abstract: Document retrieval system models are presented. Measures to rank the closeness of documents to a query are given. Algorithms to calculate the measures for graph and partition models are provided.

Journal ArticleDOI
01 Sep 1979
TL;DR: The main conclusion is that models which concentrate on improving the effectiveness of the search process are not rendered redundant by the availability of new hardware, however, the efficiency of their implementation would be improved.
Abstract: Recently several models of the search process in a document retrieval system have been proposed and retrieval experiments have shown that they will improve system performance. These include models which use relevance judgements to rank documents in order of probability of relevance and models of retrieval from clusters of documents. In this paper various models are compared in terms of the ease with which they could be implemented. An important consideration is how this implementation would be affected by the introduction of new hardware such as content-addressable memories. The main conclusion is that models which concentrate on improving the effectiveness of the search process are not rendered redundant by the availability of new hardware. However, the efficiency of their implementation would be improved.

Journal ArticleDOI
08 Jan 1979
TL;DR: It is argued that since modern on-line systems have more than achieved the technological aims of the original workers in the field, there is no further need for research in automatic information retrieval.
Abstract: Automatic information retrieval, that is document retrieval, was an early concern in computing. It might, however, be thought that since modern on-line systems have more than achieved the technological aims of the original workers in the field, there is no further need for research. I shall argue that this is not the case.

01 Jan 1979
TL;DR: A standard approach is introduced for the representation of information content in data base and document retrieval environments and the use of composite concept vectors representing individual information items leads to a uniform system in different retrieval situations.
Abstract: A standard approach is introduced for the representation of information content in data base and document retrieval environments. The use of composite concept vectors representing individual information items leads to a uniform system in different retrieval situations for the identification of answers in response to incoming information requests.


Journal ArticleDOI
Robert T. Dattola1
01 Sep 1979
TL;DR: It is shown that regular discrimination values are too costly to compute after every update to the data base and dynamic discrimination values that are easy to update are defined for use as approximations to regular values.
Abstract: The use of discrimination values as a term weighting function in document retrieval systems is examined. It is shown that regular discrimination values are too costly to compute after every update to the data base. Dynamic discrimination values that are easy to update are defined for use as approximations to regular values. Experiments are performed comparing regular vs. dynamic discrimination values. Actual user queries from an operational data base are used to evaluate dynamic discrimination values in a production environment. Generalized forms of normalized recall and precision are used as evaluation measures. Retrieval results indicate statistically significant improvements using dynamic discrimination weighting.