scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1977"


Journal ArticleDOI
TL;DR: Local clustering is practical also for large databases and appears to improve overall performance, especially if metrical constraints and weighting by proximity are embedded m the local feedback.
Abstract: AaSTRACT. In a full-text natural-language retrieval system, local feedbacl~ is the process of formulating a new ~mproved search based on clustering terms from the documents returned m a previous search of any given query Experiments were run on a database of US patents It ~s concluded that m contrast toglobalclustermg, w h e r e the size of matrices hmmts apphcatmns to small databases and improvements are doubtful, local clustering is practical also for large databases and appears to improve overall performance, especially tf metrical constraints and weighting by proximity are embedded m the local feedback The local methods adapt themselves to each mdwtdual search and produce useful searchonyms terms which are \"synonymous\" m the context of one query Searchonyms lead to new ~mproved search formulahons both via manual and vm automahc feedback

266 citations


Journal ArticleDOI
TL;DR: This paper shows how aboutness is related to probability of satisfaction and shows that about is, in fact, not the central concept in a theory of document retrieval.
Abstract: The primary objective of this paper is to examine the concept of about as it is used in its information retrieval sense when, for example, an indexer judges that a document is (or is not) about some given subject. The problem with about is that it is a very complex notion and we are unable to say precisely what it is we do when we make judgment of aboutness. Since about is at the heart of indexing, how are we to formulate any proper theory of indexing if we cannot explicate precisely the key concept of about? In this paper we look at this concept of about and offer a solution to the problem mentioned; it consists of an operational definition of about which interprets about in terms of search behavior. A second objective of this paper is to show that about is, in fact, not the central concept in a theory of document retrieval. A document retrieval system ought to provide a ranked output (in response to a search query) not according to the degree that they are about the topic sought by the inquiring patron, but rather according to the probability that they will satisfy that person's information need. This paper shows how aboutness is related to probability of satisfaction.

170 citations


Journal ArticleDOI
TL;DR: This paper examines three important and well-known information retrieval experiments, with a focus on certain internal inconsistencies and on the high variability of search results.
Abstract: Recognition of the essential role of trial and error in access to scientific literature may point the way toward improved information services and may illuminate inconsistencies that have beset many retrieval experiments. This paper examines three important and well-known information retrieval experiments, with a focus on certain internal inconsistencies and on the high variability of search results. In these experiments, retrieval systems are evaluated in terms of their ability to select relevant documents and reject those that are irrelevant. It is suggested that this criterion is inadequate because of ambiguities inherent in the concept of relevance and that closer attention to trial-and-error processes may be helpful in developing better criteria. Specific examples of how one might improve document retrieval, library use, and citation indexing are offered.

135 citations


Journal ArticleDOI
TL;DR: The study found that relevant documents were ranked significantly higher than nonrelevant documents in the set of documents retrieved in response to a Boolean query.
Abstract: This study examined the effectiveness and efficiency of employing a fully automatic algorithm for ranking the results of Boolean searches of an inverted file design document retrieval system. The study indicated that with minor modification of file designs, such as those implemented in the Syracuse Information Retrieval Experiment (SIRE), document retrieval systems could efficiently provide users with output lists on which the rank order of a document is a good indicator of its probable relevance to the user's information need. The study found that relevant documents were ranked significantly higher than nonrelevant documents in the set of documents retrieved in response to a Boolean query. By utilizing an augmented inverted file design the variable incremental cost for ranked output was only ten cents per query. There was no increased user effort.

92 citations


Journal ArticleDOI
TL;DR: Considerable evidence exists to show that the use of term relevance weights is beneficial in interactive information retrieval, and various relevance ranking systems are evaluated, including fully automatic systems based on inverse document frequency parameters, and human rankings performed by the user population.
Abstract: Considerable evidence exists to show that the use of term relevance weights is beneficial in interactive information retrieval. Various term weighting systems are reviewed. An experiment is then described in which information retrieval users are asked to rank query terms in decreasing order of presumed importance prior to actual search and retrieval. The experimental design is examined, and various relevance ranking systems are evaluated, including fully automatic systems based on inverse document frequency parameters, human rankings performed by the user population, and combinations of the two.

47 citations


Journal ArticleDOI
TL;DR: The earher model is extended to include interactions among terms, which allows one to decide whether to retrieve a document by taking into consideration occurrences of all the words in the text.
Abstract: This paper begins with a review of earher work in which a model of word occurrence formed the basis of a decision-making procedure for indexing or, more generally, retrieving documents in response to a request In the earlier work words were considered individually This paper extends the earher model to include interactions among terms The elaborated model allows one to decide whether to retrieve a document by taking into consideration occurrences of all the words in the text Retrieval in response to Boolean expresstons IS also considered, as are procedures for ranking documents in accordance with their assessed relevance to a request The discussion is within the framework of Bayesian decision theory

45 citations


Proceedings ArticleDOI
01 Jan 1977
TL;DR: The system, called the Associative File Processor (AFP), utilizes a conventional minicomputer for control, off-the-shelf high density disks for storage, a special purpose parallel search module as a text term detector, and query and retrieval software.
Abstract: This paper describes an approach to solving a major problem in the information processing sciences— that of searching very large (5-50 billion characters) data bases of unstructured free-text for random queries within a reasonable time and at an affordable price.The need by information specialists and knowledge workers for large, fast low-cost text and document retrieval systems is growing rapidly. Conventional approaches to the problem have usually depended upon expensive, general purpose computers, upon special pre-preprocessing of the textual data (e.g. file inverting, indexing, abstracting, etc.), and upon elaborate, costly software. The resulting retrieval systems often cost hundreds of dollars per query and the full scanning of an uninverted, unstructured billion byte textual data base could take hours of computer services. However, in spite of these restrictions, such full text search systems have proved useful and even indispensible for many applications.Computer technology of the late 1960's and the 1970's, in both hardware and software (e.g., minicomputers, low-cost, high density disk storage, “chip” electronics, natural language query systems, etc.), have made i t practical to build special purpose, low-cost text retrieval systems. Such a system has been built, tested, and is now in a production stage. The system called the Associative File Processor (AFP), utilizes a conventional minicomputer (DEC's PDP-11/45) for control, off-the-shelf high density disks for storage, a special purpose parallel search module as a text term detector, and query and retrieval software. The AFP is currently being field tested at two sites. Full text, parallel searches on un-preprocessed textual data bases are being performed at the effective matching rates of 4 billion bytes per second (8K byte key memory times 500 Kbyte/second data stream). Estimated costs are 10 to 25 cents per query for a one billion byte data base. The costs per query and the time for searching increase in a linear fashion as data base increases. A basic architecture for the AFP is described and an implemented version is discussed. A more powerful term detector module is also under development. This system is designed around a finite state automaton algorithm.

43 citations


Journal ArticleDOI
TL;DR: The automatic procedure is superior to traditional searching procedures in terms of both recall and precision and probably for more than 80% of the inquiries the need for a documentalist as an intermediary between the user and the system can be avoided.
Abstract: A system is described for the automatic adjustment of queries addressed to information retrieval systems employing a structurised thesaurus for the coordinate indexing of an average of at least five or six descriptors per document. Starting with at least two documents considered by the user as relevant to his inquiry, the system formulates different queries using descriptors occuring in the relevant documents. Results from these queries are presented to the user for relevance assessment as a result of which the most efficient queries are automatically selected and loosened (broadened). The new documents retrieved are again checked for relevance by the user; and with new relevant documents the loop starts again. The result of the automatic procedure is independent of the point of departure. The automatic procedure is superior to traditional searching procedures in terms of both recall and precision. The automatic procedure requires more computing, but probably for more than 80% of the inquiries the need for a documentalist as an intermediary between the user and the system can be avoided.

30 citations


Journal ArticleDOI
TL;DR: The organization of a set of document search patterns proposed in the paper ensures the limitation of documentSearch pattern set searching process—when retrieving a response to a given information request—to one (or several) subset from previously determined subsets, which makes the information system response time acceptable.
Abstract: Search patterns of documents and information requests are their better or worse representatives only, so it is important to carry on examinations on possibilities of designing self-learning information retrieval systems. Another important question is to elaborate such an organization of document search pattern set as to obtain an acceptable response time of the information system to a given information request. A self-learning process of the proposed information system consists in the determination—on a set of document and information request search patterns—of the similarity relation according to L. A. Zadeh. The organization of a set of document search patterns proposed in the paper ensures the limitation of document search pattern set searching process—when retrieving a response to a given information request—to one (or several) subset from previously determined subsets. This makes the information system response time acceptable. The proposed information retrieval strategy is discussed in terms of fuzzy sets.

29 citations


Journal ArticleDOI
TL;DR: Using this equipment, a complicated sample search involving 70 terms and over 67 000 document references can be performed from 13 to 60 times faster than with a conventional machine.
Abstract: Response time in large, inverted file document retrieval systems is determined primarily by the time required to access files of document identifiers on disk and perform the processing associated with a Boolean search request. This paper describes a specialized computer system capable of performing these functions in hardware. Using this equipment, a complicated sample search involving 70 terms and over 67 000 document references can be performed from 13 to 60 times faster than with a conventional machine. Alternatively, many small searches can be processed concurrently with little effect upon system performance. Similar configurations can be applied to standard merging and sorting problems.

25 citations


Journal ArticleDOI
01 Sep 1977
TL;DR: Information Storage and Retrieval encompasses a broad scope of topics ranging from basic techniques for accessing data to sophisticated approaches for the analysis of natural language text and the deduction of information.
Abstract: Information Storage and Retrieval (IS&R) encompasses a broad scope of topics ranging from basic techniques for accessing data to sophisticated approaches for the analysis of natural language text and the deduction of information. Within the field, three general areas of investigation can be distinguished not only by their subject matter but also by the types of individuals presently interested in them:(1) Document retrieval,(2) Generalized data management, and(3) Question-answering.A functional description which applies to each of the three areas is presented together with a survey of work being conducted. The similarities and differences of the three areas of IS&R are described. Typical systems which incorporate many of the functions and techniques are described in the appendix.



Journal ArticleDOI
TL;DR: This paper challenges the meaningfulness of precision and recall values as a measure of performance of a retrieval system by advocating the use of a normalised form of Shannon's functions (entropy and mutual information).
Abstract: This paper challenges the meaningfulness of precision and recall values as a measure of performance of a retrieval system. Instead, it advocates the use of a normalised form of Shannon's functions (entropy and mutual information). Shannon's four axioms are replaced by an equivalent set of five axioms which are more readily shown to be pertinent to document retrieval. The applicability of these axioms and the conceptual and operational advantages of Shannon's functions are the central points of the work. The applicability of the results to any automatic classification is also outlined.

Proceedings ArticleDOI
13 Jun 1977
TL;DR: Characteristics which distinguish text retrieval from retrieval of formatted files are discussed, and a computer configuration employing for special purpose processors is described.
Abstract: Retrieval of information from the complete text of large document collections cannot be performed efficiently or rapidly by current general purpose digital computers or by most special purpose rotating memory associative processors frequently proposed for efficient processing of relational databases. Characteristics which distinguish text retrieval from retrieval of formatted files are discussed, and a computer configuration employing for special purpose processors is described.

Journal ArticleDOI
TL;DR: Sixty percent of the data papers in an experiment were retrieved by human-computer text searching, in which the human contribution consisted of selection of search words for input to the computer search.
Abstract: Sixty percent of the data papers in an experiment were retrieved by human-computer text searching, in which the human contribution consisted of selection of search words for input to the computer search. Most of the successful retrieval consisted of identifying within papers those figures containing data asked for by the retrieval questions, and automatically labeling those data within the figures. The retrieval procedures are economically feasible now because they primarily require only that words from figures be in computer-readable form.



Journal ArticleDOI
TL;DR: A system to help computer-naive people deal with their Intelligent Terminal (IT), an envisioned personal mini-computer of great sophistication and power, which will be capable of mediating intelligently between these users and the tools that will be available to them in the Intelligent Terminal.
Abstract: Our objective is to develop a system to help computer-naive people deal with their Intelligent Terminal (IT), an envisioned personal mini-computer of great sophistication and power. Our system will be capable of mediating intelligently between these users and the tools that will be available to them in the Intelligent Terminal: it will provide streamlined instructions on how to use these tools, and it will answer questions about them posed in relatively unconstrained English. The system's teachings will rely heavily on letting people 'learn by doing', by setting up practice sessions under tutorial supervision. Our initial plan includes developing one such Intelligent oN-Line Assistant and Tutor (INLAT) system that will "know" about an editing system, a mail system, and a document retrieval system.

Proceedings Article
06 Oct 1977
TL;DR: This paper will describe implementations of three combinatorial file organization schemes, viz., an inverted filing scheme of order 1 (IFS1), a generalized Hiroshima University balanced filing schemes of order 2 (GHUBFS2), a filing scheme having consecutive retrieval property with redundancy (CRWR), as a document retrieval system.
Abstract: This paper will describe implementations of three combinatorial file organization schemes, viz, an inverted filing scheme of order 1 (IFS1), a generalized Hiroshima University balanced filing scheme of order 2 (GHUBFS2), a filing scheme having consecutive retrieval property with redundancy (CRWR), as a document retrieval system The results of an experimentation for evaluating the efficiency of those storage and retrieval schemes will be presented The characteristic features of those schemes by the growth of the number of data will also be discussed