scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1973"


Journal ArticleDOI
Gerard Salton1
TL;DR: An attempt is made to identify those automatic procedures which appear most effective as a replacement for the missing language analysis procedures, and it is shown that the fully automatic methodology is superior in effectiveness to the conventional procedures in normal use.
Abstract: Many experts in mechanized text processing now agree that useful automatic language analysis procedures are largely unavailable and that the existing linguistic methodologies generally produce disappointing results. An attempt is made in the present study to identify those automatic procedures which appear most effective as a replacement for the missing language analysis.A series of computer experiments is described, designed to simulate a conventional document retrieval environment. It is found that a simple duplication, by automatic means, of the standard, manual document indexing and retrieval operations will not produce acceptable output results. New mechanized approaches to document handling are proposed, including document ranking methods, automatic dictionary and word list generation, and user feedback searches. It is shown that the fully automatic methodology is superior in effectiveness to the conventional procedures in normal use.

50 citations


Journal ArticleDOI
TL;DR: This study tends to support the conclusion of Sparck‐Jones that weighted index terms provide better retrieval performance than unweighted terms, and concludes that the results are highly dependent upon the document collection, and the technique should be employed with caution.
Abstract: The objectives of this paper are to describe the effect of using weighted index terms in a document retrieval system, and to evaluate retrieval performance when queries are expanded by terms occurring in clusters with the query terms. Three data collections, each indexed by several methods, two of which were studied and reported on in previous work, are used to develop explicit results. The study both expands upon and extends previous work at the University of Maryland. The effect of weighting index terms in the document collection, the queries and the formation of clusters is analyzed. Eight cases are investigated in which index terms are weighted and unweighted. The best results are obtained when weighted index terms are used in forming clusters, in queries, and in documents. In this case, the results on the new collection demonstrate a significant improvement in retrieval performance relative to the performance with the unmodified data base, when clustered terms are added to queries. The improvement is in contrast to the results in the previous study, where a degradation in performance, or at best an insignificant improvement, was obtained. Comparisons are made to related work by Sparck-Jones and her colleagues. This study tends to support the conclusion of Sparck-Jones that weighted index terms provide better retrieval performance than unweighted terms. The cluster addition of index terms to queries yields unpredictable results. Some collections show an improvement in retrieval performance, others a degradation or no change in performance. Sparck-Jones obtained an improvement in retrieval performance for her document collection. We conclude that the results are highly dependent upon the document collection, and the technique should be employed with caution.

14 citations


Proceedings ArticleDOI
04 Nov 1973

11 citations


01 Jan 1973
TL;DR: A formal model for keyword based file structures is proposed by which the concept of storage cell is defined and from which not only the frequently-used structures such as indexed sequential, multilist, and inverted files, but also the more recent cellular multilists can be derived.
Abstract: : A formal model for keyword based file structures is proposed by which the concept of storage cell is defined and from which not only the frequently-used structures such as indexed sequential, multilist, and inverted files, but also the more recent cellular multilist files can be derived. The cellular multilist file enables the user to have an effective control over the storage medium in terms of storage utilization and record retrieval strategy. An algorithm is provided for retrieving records from file structures derivable from the model. The access algorithm is characterized by the following: (1) It retrieves all records satisfying a query from one storage cell before it retrieves records from other storage cells for the same query. (2) It selects, for each storage cell, the smallest set of records which could possibly satisfy a given query for retrieval. (3) It determines, for inverted files, exactly those records which satisfy a given query prior to record retrieval. (Author)

4 citations


Journal ArticleDOI
Gerard Salton1
TL;DR: A comparison is made between the automatic text processing methods incorporated into the SMART system and a manual search using the classified index to Time, indicating that equivalent retrieval results are obtainable when both the manual and the automatic searches are carried out in a feedback mode.

2 citations


Journal ArticleDOI
TL;DR: The system to be described involves a relatively modest capital expenditure for equipment, provides a facility which is immediately accessible even for a small department, and involves insignificant operating costs as far as material is concerned.
Abstract: Beginnings. The Programs in Occupational Therapy at the University of Western Ontario began setting up a Journal Library for the use of students and faculty in August, 1971. This procedure entailed obtaining back issues of Journals of Occupational Therapy for review and cataloguing of articles which they contained. It was also necessary to read and catalogue articles in current Journals of Occupational Therapy, Medicine, Surgery, Orthopaedics, Developmental Pediatrics, Psychiatry, Neurology, Rheumatology, and many others. (Please see Appendix i, for list of journals presently in library.) Recognition of a Problem. During the cataloguing stage, it became evident that most articles reviewed would be cross-referenced at least twice, and a relatively large number would be cross-referenced into as many as six or more categories. To pursue the establishment of an adequate cataloguing system using the traditional \"one card for each cross reference category\" would have been impractical for our purposes in terms of space required to store the card catalogue system. Solving the Problem. A general solution is to classify all relevant subject areas and sub-areas by a detailed coding system designed to fit a specified retrieval system where articles can simply be added in order of accession without the necessity of filing and cross-filing by subject classification. Examples of such retrieval systems are edgepunched cards and the use of large computers, but any system has disadvantages as well as advantages. For example, an edge-punched card system can become unwieldy to handle when large numbers are involved, and one runs into the problem of the physical deterioration of the cards with extensive use. Large computers have the undeniable advantages of a very large storage capacity and extremely rapid retrieval, but there may be difficulties and delays in access, and not inconsiderable costs involved even when use is made of already existing computer facilities. The system to be described involves a relatively modest capital expenditure for equipment, provides a facility which is immediately accessible even for a small department, and involves insignificant operating costs as far as material is concerned. Recent years have seen the development of programmable desk-top electronic calculators which in their more sophisticated forms function as mini-computers.

1 citations


Proceedings ArticleDOI
27 Aug 1973
TL;DR: It is shown that for the set of parameters and indexed document collection used, two of the techniques performed worse than the initial queries and that the greatest gains in precision for the others occur at low levels of recall.
Abstract: This paper describes an experiment in which seven relevance feedback document retrieval techniques are compared. It is shown that for the set of parameters and indexed document collection used, two of the techniques performed worse than the initial queries and that the greatest gains in precision for the others occur at low levels of recall. At the extreme recall levels no significant differences were found among any of the techniques while over all recall levels no significant differences were found among five of the techniques.

1 citations


Journal ArticleDOI
TL;DR: The theoretical model of a document retrieval system is described by giving logical meaning to the "vagueness" and the “similarity”, with the frequency of occurrences of keywords and the relations among them taken into consideration.

1 citations



Book ChapterDOI
08 Oct 1973
TL;DR: The objective of on-line information retrieval systems is the retrieval of information residing in the system’s data base which is relevant to a query of the system.
Abstract: The objective of on-line information retrieval systems is the retrieval of information residing in the system’s data base which is relevant to a query of the system. A query is specified as a set of attributes and the system is expected to retrieve only the information described by (or relevant to) the query. Such systems are being utilized in a variety of applications including medical information systems, automatic document retrieval, computer aided instruction, management information systems, banking systems, and inventory control.