scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1987"


Journal ArticleDOI
TL;DR: A system that provides a number of FACILITIES and SEARCH STRATEGIES based on an EMPHASIS on domain knowledge used for refining the model of the information need, and the provision of a blowing mechanism that allows the user to NAVIGATE through the knowledge base.
Abstract: THE MOST EFFECTIVE METHOD OF IMPROVING THE RETRIEVAL PERFORMANCE OF A DOCUMENT RETRIEVAL SYSTEM IS TO ACQUIRE A DETAILED SPECIFICATION OF THE USER''S INFORMATION NEED. THE SYSTEM DESCRIBED IN THIS PAPER, (I(EXPONENT 3)R), PROVIDES A NUMBER OF FACILITIES AND SEARCH STRATEGIES BASED ON THIS APPROACH. THE SYSTEM USES A NOVEL ARCHITECTURE TO ALLOW MORE THAN ONE SYSTEM FACILITY TO BE USED AT A GIVEN STAGE OF A SEARCH SESSION. USERS INFLUENCE THE SYSTEM ACTIONS BY STATING GOALS THEY WISH TO ACHIEVE, BY EVALUATING SYSTEM OUTPUT, AND BY CHOOSING PARTICULAR FACILITIES DIRECT- LY. THE OTHER MAIN FEATURES OF (I(EXPONENT 3)R)) ARE AN EMPHASIS ON DOMAIN KNOWLEDGE USED FOR REFINING THE MODEL OF THE INFORMATION NEED, AND THE PROVISION OF A BROWSING MECHANISM THAT ALLOWS THE USER TO NAVIGATE THROUGH THE KNOWLEDGE BASE.

323 citations


Book
01 Sep 1987
TL;DR: In this article, the authors examined the hypothesis that better representations of document content can be constructed if the content analysis method takes into consideration the syntactic structure of document and query texts, and implemented two methods of automatically generating phrases for use as content indicators.
Abstract: In order for an automatic information retrieval system to effectively retrieve documents related to a given subject area, the content of each document in the system''s database must be represented accurately. This study examines the hypothesis that better representations of document content can be constructed if the content analysis method takes into consideration the syntactic structure of document and query texts. Two methods of automatically generating phrases for use as content indicators have been implemented and tested experimentally. The non-syntactic (or statistical) method is based on simple text characteristics such as word frequency and the proximity of words in text. The syntactic method uses augmented phrase structure rules (production rules) to selectively extract phrases from parse trees generated by an automatic syntactic analyzer. Experimental results show that the effect of non-syntactic phrase indexing is inconsistent. For the five collections tested, increases in average precision ranged from 22.7% to 2.2% over simple, single term indexing. The syntactic phrase indexing method was tested on two collections. Precision figures averaged over all test queries indicate that non-syntactic phrase indexing performs significantly better than syntactic phrase indexing for one collection, but that the difference is insignificant for the other collection. More detailed analysis of individual queries, however, indicates that the performance of both methods is highly variable, and that there is evidence that syntax-based indexing has certain benefits not available with the non-syntactic approach. Possible improvements of both methods of phrase indexing are considered. It is concluded that the prospects for improving the syntax-based approach to document indexing are better than for the non-syntactic approach. The PLNLP system was used for syntactic analysis of document and query texts, and for implementing the syntax-based phrase construction rules. The SMART information retrieval system was used for retrieval experimentation.

230 citations


Journal ArticleDOI
TL;DR: The use of discourse analysis and observation to acquire knowledge about expert problem solving in an information provision environment and an intelligent document retrieval system based on a distributed expert, blackboard architecture are described.
Abstract: This paper is concerned with the use of discourse analysis and observation to elicit expert knowledge. In particular, we describe the use of these techniques to acquire knowledge about expert problem solving in an information provision environment. Our method of analysis has been to make audio-recordings of real-life information interactions between users (the clients) and human intermediaries (the experts) in document retrieval situations. These tapes have then been transcribed and analysed utterance-by-utterance in the following ways: assigning utterances to one of the prespecified functional categories; identifying the specific purposes of each utterance; determining the knowledge required to perform each utterance; grouping utterances into functional and focus-based sequences. The long-term goal of the project is to develop an intelligent document retrieval system based on a distributed expert, blackboard architecture.

190 citations


Proceedings ArticleDOI
Joel L. Fagan1
01 Nov 1987
TL;DR: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented.
Abstract: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented. Problems related to this non-syntactic phrase construction method are discussed, and some possible solutions are proposed that make use of information about the syntactic structure of document and query texts.

130 citations


Proceedings ArticleDOI
01 Nov 1987
TL;DR: This approach responds to a query by initially treating each hypertext card as a full-text document, which utilizes information about document structure to propagate weights to neighboring cards and produces a ranked list of potential starting points for graphical browsing.
Abstract: Effective information retrieval from large medical hypertext systems will require a combination of browsing and full-text document retrieval techniques. Using a prototype hypertext medical therapeutics handbook, I discuss one approach to information retrieval problems in hypertext. This approach responds to a query by initially treating each hypertext card as a full-text document. It then utilizes information about document structure to propagate weights to neighboring cards and produces a ranked list of potential starting points for graphical browsing.

110 citations


Journal ArticleDOI
Edward A. Fox1
TL;DR: It appears that a number of artificial intelligence techniques are needed to best handle such common but complex document analysis and retrieval tasks.
Abstract: The CODER ( Co mposite D ocument E xpert/Extended/Effective R etrieval) system is a testbed for investigating the application of artificial intelligence methods to increase the effectiveness of information retrieval systems. Particular attention is being given to analysis and representation of heterogeneous documents, such as electronic mail digests or messages, which vary widely in style, length, topic, and structure. Since handling passages of various types in these collections is difficult even for experimental systems like SMART, it is necessary to turn to other techniques being explored by information retrieval and artificial intelligence researchers. The CODER system architecture involves communities of experts around active blackboards, accessing knowledge bases that describe users, documents, and lexical items of various types. The initial lexical knowledge base construction work is now complete, and experts for search and time/date handling can perform a variety of processing tasks. User information and queries are being gathered, and a simple distributed skeletal system is operational. It appears that a number of artificial intelligence techniques are needed to best handle such common but complex document analysis and retrieval tasks.

104 citations


Journal ArticleDOI
TL;DR: In this paper, a computerized intermediary system is proposed to facilitate online document retrieval from large-scale data bases directly by users of the retrieved information, which does not require the user to be knowledgeable or undergo any training in the use of the underlying retrieval system.
Abstract: This paper concerns the provision of a computerized intermediary system to facilitate online document retrieval from large-scale data bases directly by users of the retrieved information. The system does not require the user to be knowledgeable or undergo any training in the use of the underlying retrieval system. The scope for a novel intermediary system relating to recent developments in expert systems has been identified and a system entitled CANSEARCH designed to enable doctors to specify queries to retrieve cancer-therapy-related documents stored in the MEDLINE data base. The design of the intermediary system uses the principle of search space abstraction, employing menu selection from a touch terminal and encapsulating the necessary intermediary expertise using rule-based techniques programmed in PROLOG. CANSEARCH performed well enough to justify the approach taken, suggesting that further development of CANSEARCH and of intermediary systems for document retrieval in other subject areas should be undertaken.

70 citations


Journal ArticleDOI
TL;DR: This experimental system features flexible document retrieval, a distributed architecture, and the capacity to store many very large documents.
Abstract: New technology is changing the way we store documents. This experimental system features flexible document retrieval, a distributed architecture, and the capacity to store many very large documents.

64 citations


Journal ArticleDOI
TL;DR: What an intelligent information retrieval system involves and why expert system techniques might be of interest to the designers of such systems are examined, and the extent to which it is feasible to think of applying Expert System techniques to intelligent retrieval is explored.
Abstract: Researchers have begun to investigate whether “intelligent” information retrieval systems can be built using expert system techniques. This paper will explore what an intelligent information retrieval system involves and examine why expert system techniques might be of interest to the designers of such systems. A brief review will be presented of expert systems research, describing what an expert system is, what it can do (and cannot do), and how this performance is achieved. The emphasis will be on components, architecture, and human-system interaction rather than on specific applications or individual systems. The paper will then explore the extent to which it is feasible to think of applying expert system techniques to intelligent retrieval.

61 citations


Journal ArticleDOI
TL;DR: IOTA is the name of the resulting prototype presented here, which is the first step toward what the authors call an intelligent system for information retrieval, and is based on a procedural expert system acting as the general scheduler of the entire query processing.
Abstract: Recent results in artificial intelligence research are of prime interest in various fields of computer science; in particular we think information retrieval may benefit from significant advances in this approach. Expert systems seem to be valuable tools for components of information retrieval systems related to semantic inference. The query component is the one we consider in this paper. IOTA is the name of the resulting prototype presented here, which is our first step toward what we call an intelligent system for information retrieval . After explaining what we mean by this concept and presenting current studies in the field, the presentation of IOTA begins with the architecture problem, that is, how to put together a declarative component, such as an expert system, and a procedural component, such as an information retrieval system. Then we detail our proposed solution, which is based on a procedural expert system acting as the general scheduler of the entire query processing. The main steps of natural language query processing are then described according to the order in which they are processed, from the initial parsing of the query to the evaluation of the answer. The distinction between expert tasks and nonexpert tasks is emphasized. The paper ends with experimental results obtained from a technical corpus, and a conclusion about current and future developments.

60 citations



Proceedings ArticleDOI
J. Bing1
01 Dec 1987
TL;DR: Computerized systems for legal information are traditionally based on text retrieval, which permits the documents to be retrieved In authentic form if a document is enriched by editorial material, and the system treats th 4 words in the additional material as the Words in the authentic text.
Abstract: Computerized systems for legal information are traditionally based on text retrieval. The retrieval presumes the creation of a search file, indexing in principle all words occurring in the documents, with pointers to the text lile, where the documents are stored. The searching is traditionally based on Boolean arguments. which are matched to the search file. Documents are read by accessing the text file through the pointers associated with the indexing terms of the search file. Text retrieval permits the documents to be retrieved In authentic form. without any added editoriel material like headnotes, citations, indexing terms erc. In the US. Canada. and Northern Europe the tendency has been to use documents with little or no edditional editorial material. When such material has been included, it generally is based on the editorial effort going into meklng a paratlel paper based version of the same documents.’ In Latin speaking countries. there have been a tendency to emphasize intellectual indexing, headnotes or ebstrects of cases, and bibliographical information, oftjn described BS a “documentary superstructure” of the document. If a document is enriched by editorial material, the traditional retrieval strategies may stilt be employed. The system treats th 4 words in the additional material as the words in the authentic text.-

Journal ArticleDOI
TL;DR: It is shown that the three tests are not in complete agreement with each other in their evaluation of the degree of clustering tendency present in seven document test collections, and it is suggested that the density test gives the most useful results.
Abstract: The use of automatic classification techniques has been suggested as a means of increasing the effectiveness of docu ment retrieval systems; however, the automatic generation of a classification requires a large amount of computation, and it is thus of importance to know whether this computation will result in material increases in retrieval performance. This paper describes three methods - the overlap test, the nearest neighbour test and the density test - which can be used to measure the degree of clustering tendency in a set of docu ments. It is shown that the three tests are not in complete agreement with each other in their evaluation of the degree of clustering tendency present in seven document test collections. A comparison of the predicted degree of clustering tendency with the relative effectiveness of cluster and non-cluster searches suggests that the density test gives the most useful results; it also has the advantage that it does not require query and relevance data and can thus be used in a...


Journal ArticleDOI
01 Jun 1987
TL;DR: Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach to text signatures, fixed-length bit string representations of document content.
Abstract: This paper considers the use of text signatures, fixed-length bit string representations of document content, in an experimental information retrieval system: such signatures may be generated from the list of keywords characterising a document or a query. A file of documents may be searched in a bit-serial parallel computer, such as the ICL Distributed Array Processor, using a two-level retrieval strategy in which a comparison of a query signature with the file of document signatures provides a simple and efficient means of identifying those few documents that need to undergo a computationally demanding, character matching search. Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach.

Journal ArticleDOI
TL;DR: The study reveals the extent of subject searching activity, and suggests that this may have been underestimated in previous studies, and proposes that a future online searching environment will encourage a more truly interactive approach to subject searching.
Abstract: Searching behaviour in a university library is studied using a wholistic approach, encompassing the use of bibliographic tools and shelf browsing. The present study is designed as the first half of a ‘before and after’ study to permit the evaluation of the impact of a future online catalogue on users' searching behaviour. A combined methodology was devised: searchers were encouraged to talk aloud during their search, and this information, together with some probing and real time expert interpretation, enabled the experimenter to record the searching activity on a highly structured observation form. The study reveals the extent of subject searching activity, and suggests that this may have been underestimated in previous studies. The analysis of expressed topics, search formulation strategy and documents retrieved reveals the adaptive nature of the subject searching process, whereby the user adapts to the structure of the available tools. The information retrieval task in a traditional library system is tailored by the system to a single, one dimensional, sequential process. It is suggested that a major obstacle to subject searching effectiveness may lie in the lack of interaction between the different possible approaches in the searching process: the indexing language, the classification, and the titles. It is to be hoped that a future online searching environment will encourage a more truly interactive approach to subject searching.

Proceedings ArticleDOI
J. P. Dick1
01 Dec 1987
TL;DR: The main body of the paper gives a rundown of the research undertaken for the doctoral dissertation the design of a mode1 for the retrieval of of law cases, with emphasis on the development of a knowledge representation.
Abstract: The main body of the paper gives a rundown of the research undertaken for my doctoral dissertation the design of a mode1 for the retrieval of of law cases, with emphasis on the development of a knowledge representation. The project intersects a number of distinct interest areas: information retrieval, text processing, artificial intelligence, and legal reasoning. In section 2, the areas of intersection are defined.

Journal Article
TL;DR: Etude comparative des avantages et des inconvenients du langage naturel and du vocabulaire controle pour la recherche documentaire automatise.
Abstract: Etude comparative des avantages et des inconvenients du langage naturel et du vocabulaire controle pour la recherche documentaire automatise L'auteur analyse egalement la pertinence de l'une ou l'autre methode, dans le contexte de systemes experts ou de banques de donnees en texte integral

Patent
30 Dec 1987
TL;DR: In this article, a document storage and retrieval system for storing a document body in the form of image, means for storing text information in a form of a character code string for retrieval, apparatus for executing a retrieval with reference to the text information, and apparatus for displaying a document image relating thereto on a retrieval terminal according to the retrieval result.
Abstract: A document storage and retrieval system for storing a document body in the form of image, means for storing text information in the form of a character code string for retrieval, apparatus for executing a retrieval with reference to the text information, and apparatus for displaying a document image relating thereto on a retrieval terminal according to the retrieval result. Such a form of the system is available for retrieving the full contents of a document and also for displaying the document body printed in a format easy to read straight in the form of image. Users are capable of retrieving documents with arbitrary words and also capable of reading even such a document as is complicated to include mathematical expressions and charts through a terminal in the form of image, the same as on paper. A system is provided wherein the text information for retrieval is extracted automatically from the document image through character recognition. Since a precision of the character recognition has not been satisfactory hitherto, a visual retrieval and correction have been carried out without fail by operators. However, there is no necessity for the operators to attend therefor.

Journal ArticleDOI
TL;DR: The issues involved in the construction of an expert system for retrieval and the solutions adopted by the prototype expert system PLEXUS are described, with particular reference to the semantic processing that takes place.
Abstract: The issues involved in the construction of an expert system for retrieval are described, together with some of the techniques that have been used in artificial intelligence and information science to tackle them. The solutions adopted by the prototype expert system PLEXUS are described, with particular reference to the semantic processing that takes place. The paper concludes with a discussion of continuing issues on which work is currently proceeding.

Journal ArticleDOI
TL;DR: This article presents a conceptual model of the retrieval process of a document-retrieval system, which has been prototypically implemented in modular form to test system response to changes in model parameters.
Abstract: This article presents our conceptual model of the retrieval process of a document-retrieval system. The retrieval mechanism input is an unambiguous intermediate form of a user query generated by the language processor using the method described previously. Our retrieval mechanism uses a two-step procedure. In the first step a list of documents pertinent to the query are obtained from the document database, and then an evidence-combination scheme is used to compute the degree of support between the query and individual documents. The second step uses a ranking procedure to obtain a final degree of support for each document chosen, as a function of individual degrees of support associated with one or more parts of the query. The end result is a set of document citations presented to the user in ranked order in response to the information request. Numerical examples are given to illustrate various facets of the overall system, which has been prototypically implemented in modular form to test system response to changes in model parameters. © 1987 John Wiley & Sons, Inc.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: The proposed NLP techniques are used to develop a request model based on “conceptual case frames” and to compare this model with the texts of candidate documents and statistical searches carried out using dependency and relative importance information derived from the request models indicate that performance benefits can be obtained.
Abstract: Document retrieval systems have been restricted, by the nature of the task, to techniques that can be used with large numbers of documents and broad domains. The most effective techniques that have been developed are based on the statistics of word occurrences in text. In this paper, we describe an approach to using natural language processing (NLP) techniques for what is essentially a natural language problem - the comparison of a request text with the text of document titles and abstracts. The proposed NLP techniques are used to develop a request model based on “conceptual case frames” and to compare this model with the texts of candidate documents. The request model is also used to provide information to statistical search techniques that identify the candidate documents. As part of a preliminary evaluation of this approach, case frame representations of a set of requests from the CACM collection were constructed. Statistical searches carried out using dependency and relative importance information derived from the request models indicate that performance benefits can be obtained.

Journal ArticleDOI
TL;DR: A prototype implementation of a distributed information retrieval system with a number of possible configurations is described together with the experiences gained from this implementation.
Abstract: In this article we discuss the need for distributed information retrieval Systems. A number of possible configurations are presented. A general approach to the design of such systems is discussed. A prototype implementation is described together with the experiences gained from this implementation.

Journal ArticleDOI
TL;DR: This paper proposes the use of non-first normal form universal relations to simply the user interface in DBMIRS.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: Models of document retrieval systems assuming random selection and best-first selection are developed and compared under binary independence and two Poisson independence feature distribution models.
Abstract: Most document retrieval systems based on probabilistic models of feature distributions assume random selection of documents for retrieval. The assumptions of these models are met when documents are randomly selected from the database or when retrieving all available documents. A more suitable model for retrieval of a single document assumes that the best document available is to be retrieved first. Models of document retrieval systems assuming random selection and best-first selection are developed and compared under binary independence and two Poisson independence feature distribution models. Under the best-first model, feature discrimination varies with the number of documents in each relevance class in the database. A weight similar to the Inverse Document Frequency weight and consistent with the best-first model is suggested which does not depend on knowledge of the characteristics of relevant documents.


Journal ArticleDOI
TL;DR: Novel software for displaying documents on graphics screens is described, which provides a fast and simple way for users to peruse documents and to examine the parts that interest them and to provide feedback between the user and the underlying retrieval system.
Abstract: Computers with graphics screens and pointing devices such as the ‘mouse’ provide the opportunity for highly interactive user interfaces. This paper describes some novel software for displaying documents on graphics screens: the software provides a fast and simple way for users to peruse documents and to examine the parts that interest them. One application of the software is as a front end to a document retrieval system, since it provides a way for users to identify quickly the records that are of relevance to them and to provide feedback between the user and the underlying retrieval system.


Journal ArticleDOI
TL;DR: This paper describes the simulation of a nearest neighbour searching algorithm for document retrieval using a pool of microprocessors, and the results support the use of pooled microprocessor systems for searching applications in information retrieval.
Abstract: This paper describes the simulation of a nearest neighbour searching algorithm for document retrieval using a pool of microprocessors. The documents in a database are organised in a multi‐dimensional binary search tree, and the algorithm identifies the nearest neighbour for a query by a backtracking search of this tree. Three techniques are described which allow parallel searching of the tree. A PASCAL‐based, general purpose simulation system is used to simulate these techniques, using a pool of Transputer‐like microprocessors with three standard document test collections. The degree of speed‐up and processor utilisation obtained is shown to be strongly dependent upon the characteristics of the documents and queries used. The results support the use of pooled microprocessor systems for searching applications in information retrieval.

Journal ArticleDOI
Padmini Das-Gupta1
TL;DR: If the two conjuncts are semantically similar then the conjunction is best interpreted as a Boolean OR, otherwise as an AND, which resulted in an algorithm which utilizes semantic information and some syntactic information to obtain the appropriate Boolean interpretation.
Abstract: It is generally recognized that the conjunction “and” plays an ambiguous role in natural language. When considered within the domain of Boolean document retrieval, this ambiguity makes the automatic Boolean interpretation of statements representing information needs a difficult task. The human analyst is able to resolve this ambiguity with relative ease. However, the processes employed appear complex and are not well understood. This article examines a semantic property of the conjunction, i.e., the semantic similarity between the conjuncts with a view to automatically resolving this ambiguity. Specifically, the idea examined is that if the two conjuncts are semantically similar then the conjunction is best interpreted as a Boolean OR, otherwise as an AND. The study resulted in an algorithm which utilizes semantic information and some syntactic information (both of which are derivable from a standard dictionary) to obtain the appropriate Boolean interpretation. The algorithm was successful when evaluated against human decisions. In addition to contributing the algorithm, this article draws attention to the effects of this ambiguity on the derivation of appropriate Boolean search specifications from natural‐language statements representing information needs. © 1987 John Wiley & Sons, Inc.