scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1980"


Journal ArticleDOI
TL;DR: A probabilistic model of cluster searching based on query classification is described and it is tested with retrieval experiments which indicate that it can be more effective than heuristic cluster searches and cluster searches based on other models.

164 citations


Journal ArticleDOI
TL;DR: A linkage similarity measure which takes into account both the bibliographic coupling of documents and their cocitations produced improved document retrieval over a measure based only on bibliographical coupling.
Abstract: A linkage similarity measure which takes into account both the bibliographic coupling of documents and their cocitations (both cited and citing papers) produced improved document retrieval over a measure based only on bibliographic coupling. The test collection consisted of 1712 papers whose relevance to specific queries had been judged by users. To evaluate the effect of using cocitation data, we calculated for each query two measures of similarity between each relevant paper and every other paper retrieved. Papers were then sorted by the similarity measures, producing two ordered lists. We then compared the resulting predictions of relevance, partial relevance, and non-relevance to the user's evaluations of the same papers. Over-all, the change from the bibliographic coupling measure to the linkage similarity measure, representing the introduction of cocitation data, resulted in better retrieval performance.

67 citations


Journal ArticleDOI
TL;DR: A searching algorithm is suggested that helps the inquirer searching for documents on a large interactive system to construct and modify queries inefficiently and to avoid the effect of these biases.
Abstract: The way that individuals construct and modify search queries on a large interactive document retrieval system is subject to systematic biases similar to those that have been demonstrated in experiments on judgments under uncertainty. These biases are shared by both naive and sophisticated subjects and cause the inquirer searching for documents on a large interactive system to construct and modify queries inefficiently. A searching algorithm is suggested that helps the inquirer to avoid the effect of these biases.

53 citations


Journal ArticleDOI
John O'Connor1
TL;DR: The present experiment involved a greater variety of forms of retrieval question and search words were selected independently by two different people for each retrieval question, producing average recall ratios and false retrieval rates.
Abstract: Passage retrieval (already operational for lawyers) has advantages in output form over reference retrieval and is economically feasible. Previous experiments in passage retrieval for scientists have demonstrated recall and false retrieval rates as good or better than those of present reference retrieval services. The present experiment involved a greater variety of forms of retrieval question. In addition, search words were selected independently by two different people for each retrieval question. The search words selected, in combination with the computer procedures used for passage retrieval, produced average recall ratios of 72 and 67%, respectively, for the two selectors. The false retrieval rates were (except for one predictably difficult question) respectively 13 and 10 falsely retrieved sentences per answer-paper retrieved.

52 citations


Journal ArticleDOI
TL;DR: An automatic document clustering procedure is described which does not require the use of an inter-document similar ity matrix and which is independent of the order in which the documents are processed.
Abstract: An automatic document clustering procedure is described which does not require the use of an inter-document similar ity matrix and which is independent of the order in which the documents are processed. The procedure makes use of an initial set of clusters which is derived from certain of the terms in the indexing vocabulary used to characterise the documents in the file. The retrieval effectiveness obtained using the clustered file is compared with that obtained from serial searching and from use of the single-linkage clustering method.

50 citations


Journal ArticleDOI
Salton1
TL;DR: Advances such as specialized parallel hardware and new algorithms for text searching will improve the effectiveness of information retrieval systems.
Abstract: Advances such as specialized parallel hardware and new algorithms for text searching will improve the effectiveness of information retrieval systems.

41 citations


Proceedings ArticleDOI
23 Jun 1980
TL;DR: The research described here explores the use of subject area knowledge in an 'expert' document retrieval system, and the goals of the research are to characterise the semantics of information retrieval requests, and to develop methods for representing and using subject areaknowledge in computer retrieval systems.
Abstract: Recent interest in computer representation of knowledge has led to the development of 'expert' computer assistants for tasks such as medicaldiagnosis, technical instruction, and problem solving in restricted domains (Brown and Burton, 1975; Davis, Buchanan and Shortliffe, 1977; Goldstein and Roberts, 1977). Each of these systems is based on a semantic model of a specific subject area, along with some general methods for using subject area knowledge to understand and respond to users' requests. The emphasis of these systems is not on 'solving' problems by computer, but rather on helping a human problem solver organise and apply a complex body of knowledge. The research described here explores the use of subject area knowledge in an 'expert' document retrieval system. The goals of the research are to characterise the semantics of information retrieval requests, and to develop methods for representing and using subject area knowledge in computer retrieval systems. The Legal Research System (LRS) is a knowledge-based computer retrieval system, intended to be used by lawyers and legal assistants to retrieve information about court decisions (cases) and laws passed by legislatures (statutes). The subject of its knowledge is Negotiable Instruments Law, an area of Commerical Law that deals with cheques and promissory notes (White and Summers, 1972; Speidel, Summers and White, 1974). The current implementation of the system (Hafner, 1978) has a database of about 200 statutes from the Uniform Commerical Code (American Law Institute, 1972) and 200 related cases. In LRS four kinds of knowledge about legal concepts and relationships are represented: functional knowledge, structural knowledge, semantic knowledge and factual knowledge. In this chapter the motivation for including each kind of knowledge is discussed, the computer representation of each kind of knowledge is described and examples of the use of each kind of knowledge in LRS are presented. The next section gives a very brief overview of current legal retrieval systems, both manual and automated. Subsequent sections describe the representation of knowledge in LRS, and the use of this knowledge to understand and interpret user queries.

25 citations


Journal ArticleDOI
TL;DR: A preliminary application of the retrieval performance of book indexes to the subject indexing of two major encyclopedias showed one encyclopedia apparently superior in both the finding and discrimination abilities of retrieval performance.
Abstract: The retrieval performance of book indexes can be measured in terms of their ability to direct a user selectively to text material whose identity but not location is known. The method requires human searchers to base their searching strategies on actual passages from the book rather than on test queries, natural or contrived. It circumvents the need for relevance judgment, but still yields performance indicators that correspond approximately to the recall and precision ratios of large document retrieval system evaluation. A preliminary application of the method to the subject indexing of two major encyclopedias showed one encyclopedia apparently superior in both the finding and discrimination abilities of retrieval performance. The method is presently best suited for comparative testing since its ability to yield absolute or reproducible measures is as yet not established.

9 citations


Proceedings ArticleDOI
23 Jun 1980
TL;DR: In the subsequent sections the clustering of document representations and search request formulations will be identified with the clustered of documents and queries, respectively.
Abstract: One of the most essential parameters of any information retrieval system is the time taken to retrieve answers to particular queries submitted to it. This quantity is especially important for information systems with large-sized document collections and/or in cases when immediate response to the user's query is required (for example, in on-line information retrieval systems). As a result of investigations aimed at shortening the retrieval time, a number of information retrieval methods have been developed (see, for example, Salton, 1968, 1971, 1975; van Rijsbergen, 1979). Among them one can distinguish a class of numerous information retrieval methods based on the clustering of document representations. In these cases the mutual similarity between document representations, determined in a direct way, is used to cluster the document representations. An alternative competitive class utilises previously created clusters of search request formulations for clustering the document representations. For simplicity, in the subsequent sections the clustering of document representations and search request formulations will be identified with the clustering of documents and queries, respectively. Information retrieval methods of the latter type developed some years ago (Lesser, 1966; Salton, 1968, 1975; Worona, 1971; Yu, 1974) can be applied only to those information systems in which both search request formulations and document representations are sets of

8 citations


Journal ArticleDOI
TL;DR: It is shown that as a control system, BC is subject to the laws of cybernetics and only the descriptive, transcriptive, and ordering functions of a BC system can be subjected to full control governed by generally applicable rules.
Abstract: The concept of bibliographic control (BC) is explored from its origin to its development into Universal Bibliographic Control (UBC). It is analyzed as to its functions and operations, namely (a) the form-oriented or descriptive function, (b) the transcription of descriptive data onto a document surrogate, (c) the sequential ordering of these surrogates, and (d) the content-oriented or exploitative function. It is shown that as a control system, BC is subject to the laws of cybernetics. Only the descriptive, transcriptive, and ordering functions of a BC system can be subjected to full control governed by generally applicable rules, while the content-oriented retrieval function, being based on subjective judgments of relevance by indexers and ultimate users, are not completely controllable. The attainable limits of BC and UBC can thus be established.

7 citations



Proceedings ArticleDOI
19 Jun 1980
TL;DR: Representation Evaluation % YES % NO % % NO INTERM.
Abstract: Representation Evaluation % YES % NO % % NO INTERM. RESP.



Proceedings ArticleDOI
23 Jun 1980
TL;DR: This chapter is a preliminary study on an ongoing research effort devoted to the development of backend machine architecture for large textual databases or information retrieval systems.
Abstract: A textual database can be loosely defined to be a group of related documents, each containing an essentially unstructured string of characters and symbols, which describe some information in English or any other high-level natural language on a specific subject matter by use of a set of words, phrases and sentences that depend very much on the subject matter and the intended use of the document. Such databases cover a wide range of applications, viz. libraries, newspapers, medical diagnostics, abstracts of papers and dissertations, legal case reports, military and intelligence reports, etc. With the advent of highdensity memory technology and the availability of computerised typesetting and machine reading technology, an explosion in the growth of such databases for a variety of applications may be anticipated. This chapter is a preliminary study on an ongoing research effort devoted to the development of backend machine architecture for large textual databases or information retrieval systems. Search and retrieval operations on textual databases are complicated by the very nature of information that they contain. The text databases show a wide variation in formatting, and have a large number of oddities, redundancies, non-informational words and context-dependent as well as spelling ambiguities; the range of potential query is unrestricted. There is no data model that is applicable to develop a structured approach to search and retrieval, as is the case for conventional databases (viz. relational or hierarchical). Conventional machine architectures and software systems perform search and retrieval operations on such databases using a combination of inversion of text to produce an inverted list and sequential scan on the secondary storage media. These systems are inherently slow, because the machines do not have build-in hardware to do high-speed pattern matching, searching, sorting or retrieval operations. Furthermore, the phenomenon of a 'yon Neuman bottleneck' between the CPU and the main memory and the data transportation problem over a bandwidth-limited channel which uses complicated navigational procedures to locate data on serial-access bulk storage add to this slow performance and inefficiency. Furthermore, the inverted file system may add as much as 300 per cent storage overhead and needs rather time-

Journal ArticleDOI
01 Aug 1980
TL;DR: A model is proposed for estimating the total number of relevant documents in a collection for a given query, where x represents document rank and y represents precision, and the equation with the best fit satisfying certain constraints is used.
Abstract: A model is proposed for estimating the total number of relevant documents in a collection for a given query The total number of relevant documents is needed in order to compute recall values for use in evaluating document retrieval systems If x represents document rank and y represents precision, then one of the following functions is fit to the points obtained by plotting precision vs document rank after each retrieved document:1 y = AeBx exponential2 y = AxB power3 y = A - B/x hyperbolic4 y = 1/(A + Bx) hyperbolic5 y = x/(A + Bx) hyperbolicThat equation with the best fit satisfying certain constraints is used to estimate the total number of relevant documents for any given query Experimental comparisons of this best fit are made with random sampling methods

Journal ArticleDOI
TL;DR: A corequisite course insures student background in the design requirements of deterministic data retrieval systems and provides a useful framework in which to explore index language limitations and design features of document retrieval systems which must provide multiple access point...
Abstract: A course in basic information retrieval principles and use of online document retrieval systems is a curriculum requirement for undergraduate computer science students at Stockton State College in New Jersey. A combination of theory‐oriented lectures and online search sessions using DIALOG enables students to observe course principles in action. An undergraduate course of this type differs significantly in content from those normally offered in graduate library schools in one primary area; while most students in graduate library schools are presumably aware of the functions and issues of index languages and library operations, undergraduate computer science students need to be taught the basics of these subjects. A corequisite course insures student background in the design requirements of deterministic data retrieval systems. This background provides a useful framework in which to explore index language limitations and design features of document retrieval systems which must provide multiple access point...