scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1988"


Journal ArticleDOI
TL;DR: Algorithms that can be used to allow the implementation of hierarchic agglomerative clustering methods for document retrieval, and experimental evidence suggests that nearest neighbor clusters provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.
Abstract: This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.

842 citations


Book
01 Dec 1988
TL;DR: This paper considers the situation where no relevance information is available, that is, at the start of the search, based on a probabilistic model, and proposes strategies for the initial search and an intermediate search.
Abstract: Most probabilistic retrieval models incorporate information about the occurrence of index terms in relevant and non‐relevant documents. In this paper we consider the situation where no relevance information is available, that is, at the start of the search. Based on a probabilistic model, strategies are proposed for the initial search and an intermediate search. Retrieval experiments with the Cranfield collection of 1,400 documents show that this initial search strategy is better than conventional search strategies both in terms of retrieval effectiveness and in terms of the number of queries that retrieve relevant documents. The intermediate search is shown to be a useful substitute for a relevance feedback search. Experiments with queries that do not retrieve relevant documents at high rank positions indicate that a cluster search would be an effective alternative strategy.

453 citations


Journal ArticleDOI
TL;DR: Implementing a popular medical handbook in hypertext underscores the need to study hypertext in the context of full-text document retrieval, machine learning, and user interface issues.
Abstract: Medicine is an ideal domain for hypertext applications and research. Implementing a popular medical handbook in hypertext underscores the need to study hypertext in the context of full-text document retrieval, machine learning, and user interface issues.

395 citations


Journal ArticleDOI
TL;DR: Competing document descriptions are associated with a document and altered over time by a genetic algorithm according to the queries used and relevance judgments made during retrieval.
Abstract: Document retrieval systems are built to provide inquirers with computerized access to relevant documents. Such systems often miss many relevant documents while falsely identifying many non-relevant documents. Here, competing document descriptions are associated with a document and altered over time by a genetic algorithm according to the queries used and relevance judgments made during retrieval.

252 citations


Book
01 Jan 1988

80 citations


Journal ArticleDOI
TL;DR: A document retneval system is presented, based upon the vector processing model, which employs an automatic indexing procedure with a weighting scheme to reflect term importance and an emphasis on nearest neighbour searching to locate documents closest to a given query.
Abstract: Document filing and retrieval systems can be designed using advanced techniques resulting from recent research in information retneval.In this paper, a document retneval system is presented, based upon the vector processing model. The system employs an automatic indexing procedure with a weighting scheme to reflect term importance. Documents are stored using an in verted file organization. Natural language quenes are sup ported with a retrieval strategy based on best match techniques and relevance feedback.The emphasis is on nearest neighbour searching to locate documents closest to a given query. That means, after having defined a sirrularitv function, the identification of those docu ments in the collection which exhibit a higher degree of re semblance to the query.The problem is introduced with reference to a straightfor ward search procedure that returns the nearest neighbour set manipulating the inverted file entnes. Then. an improved al gorithm is presented which optimizes both the number of documen...

71 citations


Journal ArticleDOI
01 Sep 1988
TL;DR: A theoretical model of a knowledge-based information retrieval system developed in this thesis specifies the requirements and properties, of such a system, and a novel term-similarity function could be defined.
Abstract: Information retrieval can be defined as the extraction of specific information out of a great number of stored information items. Information retrieval systems, used for the retrieval of documents, try to answer more or less precise questions about interesting topics with a number of suitable documents or references to documents. Such systems should contain 'knowledge' about the meaning of questions, about the content of the stored information and the particular user's needs for information.Knowledge-bases systems claim to be able to store knowledge an draw conclusions from it. The goal of this thesis is to investigate the use of knowledge-based methods and technologies for information retrieval. A knowledge-based information retrieval system should represent its Information Structures, as well as knowledge in a common knowledge representation formalism. The retrieval process of the system should employ the inferential methods of the used knowledge representation formalism.A subset of first order logic is chosen for this thesis to represent knowledge. Specially designed retrieval rules represent knowledge for the purpose of retrieval. Retrieval rules capture knowledge about the user's vocabulary, his working domain and his way to perform the retrieval of documents. The problem of recall and precision of the answers of an information retrieval system is approached by an explicit representation of control knowledge.A theoretical model of a knowledge-based information retrieval system developed in this thesis specifies the requirements and properties, of such a system. In particular a novel term-similarity function could be defined. Properties like completeness and termination could be derived and boundaries for the amount of overhead of false control strategies could be investigated.The proposed model is implemented in a prototype of a knowledge-based information retrieval system, called KIR. KIR is a single-user system for personal document- and knowledge retrieval running on computer workstations. It is implemented using Prolog and Modula-2.

68 citations


Proceedings ArticleDOI
01 May 1988
TL;DR: Investigating whether linguistic processes can be used as part of a document retrieval strategy by predefining a level of syntactic analysis of user queries only, suggests that the approach of using linguistic processing in retrieval, is valid.
Abstract: Traditional information has relied on the extensive use of statistical parameters in the implementation of retrieval strategies. This paper sets out to investigate whether linguistic processes can be used as part of a document retrieval strategy. This is done by predefining a level of syntactic analysis of user queries only, to be used as part of the retrieval process. A large series of experiments on an experimental test collection are reported which use a parser for noun phrases as part of the retrieval strategy. The results obtained from the experiments do yield improvements in the level of retrieval effectiveness and given the crude linguistic process used and the way it was used on queries and not on document texts, suggests that the approach of using linguistic processing in retrieval, is valid.

65 citations


Proceedings ArticleDOI
01 May 1988
TL;DR: Experiments indicate that regression methods can help predict relevance, given query-document similarity values for each concept type, and the role of links is shown to be especially beneficial.
Abstract: This report considers combining information to improve retrieval. The vector space model has been extended so different classes of data are associated with distinct concept types and their respective subvectors. Two collections with multiple concept types are described, ISI-1460 and CACM-3204. Experiments indicate that regression methods can help predict relevance, given query-document similarity values for each concept type. After sampling and transformation of data, the coefficient of determination for the best model was .48 (.66) for ISI (CACM). Average precision for the two collections was 11% (31%) better for probabilistic feedback with all types versus with terms only. These findings may be of particular interest to designers of document retrieval or hypertext systems since the role of links is shown to be especially beneficial.

51 citations


Book
01 Dec 1988
TL;DR: The origins of information retrieval research: the first information retrieval system tests testing indexing systems Cranfield I, and research on relevance judgement relevance as a performance criterion the Cranfield tradition and the information retrieval model.
Abstract: Part 1 Introduction - the origins of information retrieval research: the first information retrieval system tests testing indexing systems Cranfield I testing indexing devices - Cranfield II relevance judgement and retrieval system tests research on relevance judgement relevance as a performance criterion the Cranfield tradition and the information retrieval model Part 2 Statistical and probabilistic retrieval: automatic indexing, classification and searching - SMART document clustering probabilistic models of relevance and relevance feedback achievements and limitations of the statistical approach Part 3 Cognitive user modelling: information retrieval through man-machine dialogue - the THOMAS program anomalous states of knowledge ASK-based retrieval stereotype-based fiction retrieval - the GRUNDY program cognitive models for retrieval Part 4 Expert intermediary systems: expert intermediary systems - CONIT, CANSEARCH and PLEXUS distributed expert-based intermediary system MONSTRAT intelligent intermediary for information retrieval I3r COmposite document expert/extended/effective retrieval CODER expert systems for information retrieval Part 5 Associations, relations and hypertext: 'as we may think' - MEMEX database browsing and navigation - TINman the origins of hypertext - Xanadu and NLS/augment card-based hypertext systems - NoteCards and HyperCard transforming text to hypertext guide hypercatalog information retrieval by association potential and problems of hypertext

44 citations


Journal ArticleDOI
TL;DR: This article will present a relational logical model to support a sophisticated document retrieval system in which flexible forms of inferential and associative searching can be performed and several problems of particular importance to document retrieval will be discussed.
Abstract: Relational Data Base Management Systems offer a commercially available tool with which to build effective document retrieval systems. The full potential of the relational model for supporting the kind of ad hoc inquiry characteristic of document retrieval has only recently been explored. In addition, commercially available relational DBMS's also provide effective tools for managing document data bases by providing facilities for, inter alia , concurrency control, data migration and reorganization routines, authorization mechanisms, enforcement of integrity constraints, dynamic data definition, etc. This article will present a relational logical model to support a sophisticated document retrieval system in which flexible forms of inferential and associative searching can be performed. Examples of ad hoc inquiry will be presented in SQL. Several problems of particular importance to document retrieval will be discussed, including the importance of Conjunctive Normal Form in query formulation, unique aspects of document retrieval storage and processing overhead, and techniques for reducing the size of storage without severely impacting retrieval effectiveness.

Journal ArticleDOI
TL;DR: An experimental office system currently being developed at Olivetti research integrates two major requirements of office work: content based document retrieval and mail distribution that closes the gap between electronic document entry systems and processing of (semi-) structured document content.
Abstract: An experimental office system currently being developed at Olivetti research integrates two major requirements of office work: content based document retrieval and mail distribution In this system documents are described and classified by their semantic structure that provides access to abstract concepts contained in the document The derivation of the semantic structure of a document supports both an efficient retrieval by content and an intelligent mail filtering through document semantics A knowledge based classification system automatically generates the conceptual description of a document to be inserted into the system by means of content analysis, and associates the document to an appropriate predefined type The classification system closes the gap between electronic document entry systems and processing of (semi-) structured document content

Proceedings ArticleDOI
01 May 1988
TL;DR: The approach to plausible inference for retrieval is explained and some preliminary experiments designed to test this approach using a spreading activation search to implement the plausible inference process show that significant effectiveness improvements are possible.
Abstract: Choosing an appropriate document representation and search strategy for document retrieval has been largely guided by achieving good average performance instead of optimizing the results for each individual query. A model of retrieval based on plausible inference gives us a different perspective and suggests that techniques should be found for combining multiple sources of evidence (or search strategies) into an overall assessment of a document's relevance, rather than attempting to pick a single strategy. In this paper, we explain our approach to plausible inference for retrieval and describe some preliminary experiments designed to test this approach. The experiments use a spreading activation search to implement the plausible inference process. The results show that significant effectiveness improvements are possible using this approach.

Journal ArticleDOI
TL;DR: The role of the bibliographical agency in the browsing situation is changed from one which derives concepts to one which maximizes a user's physical inspection capabilities.

Journal ArticleDOI
01 Nov 1988-Online
TL;DR: The authors present des different types of vocabulaire controles: vocabulaires post-controles, vedettes-matieres et descripteurs, codes de categorie, classifications hierarchiques and a facettes.
Abstract: Presentation des differents types de vocabulaires controles: vocabulaires post-controles, vedettes-matieres et descripteurs, codes de categorie, classifications hierarchiques et a facettes. Discussion sur les meilleures facons d'utiliser ces differents types de vocabulaire dans la recherche documentaire en ligne

Journal ArticleDOI
TL;DR: Experimental results compare the performance of a sequential learning probabilistic retrieval model with both the proposed integrated Boolean-probabilistic model and with a fuzzy-set model.
Abstract: Most commercial document retrieval systems require queries to be valid Boolean expressions that may be used to split the set of available documents into a subset consisting of documents to be retrieved and a subset of documents not to be retrieved. Research has suggested that the ranking of documents and use of relevance feedback may significantly improve retrieval performance. We suggest that by placing Boolean database queries into Conjunctive Normal Form, a conjunction of disjunctions, and by making the assumption that the disjunctions represent a hyperfeature, documents to be retrieved can be probabilistically ranked and relevance feedback incorporated, improving retrieval performance. Experimental results compare the performance of a sequential learning probabilistic retrieval model with both the proposed integrated Boolean-probabilistic model and with a fuzzy-set model.

Journal ArticleDOI
TL;DR: It is proposed that an appropriate metric for gauging the performances of information retrieval systems is a measure of the (relative) total relevance that a user can obtain from a set of documents sequentially scanned and evaluated in an information retrieval environment.
Abstract: The article presents a model based on the notion of the total relevance of a set of documents. The concept of a total relevance function is subsequently derived from the notion of cumulated relevance implied in the traditional summation of relevance ratings over the documents in a collection or in retrieved sets of documents. The model is intended to make explicit the perceptual underpinnings of relevance assessments while allowing for the consideration of interdocument dependencies as perceived by the user. Within this framework, it is proposed that an appropriate metric for gauging the performances of information retrieval systems is a measure of the (relative) total relevance that a user can obtain from a set of documents sequentially scanned and evaluated in an information retrieval environment. Some implications of the model are noted.

Journal ArticleDOI
03 Jan 1988
TL;DR: INSTRUCT as discussed by the authors is a multi-user text retrieval system which was developed as an interactive teaching package for demonstrating modern information retrieval techniques, these including natural language query processing, best match searching and automatic relevance feedback based on probabilistic term weighting.
Abstract: INSTRUCT is a multi‐user, text retrieval system which was developed as an interactive teaching package for demonstrating modern information retrieval techniques, these including natural language query processing, best match searching and automatic relevance feedback based on probabilistic term weighting INSTRUCT has recently been extended and now additionally has facilities for query expansion using both relevance and term co‐occurrence data, for cluster‐based searching and for two browsing search strategies These retrieval mechanisms are used to search a file of 26,280 titles and abstracts from the Library and Information Science Abstracts database; both menu‐based and command‐based searching are allowed

Journal ArticleDOI
TL;DR: The touch screen was found to be the fastest in time, the least accurate but the overall favorite of the participants.
Abstract: This study measured the speed, error rates, and subjective evaluation of arrow-jump keys, a jump-mouse, number keys, and a touch screen in an interactive encyclopedia. A summary of previous studies comparing selection devices and strategies is presented to provide the background for this study. We found the touch screen to be the fastest in time, the least accurate but the overall favorite of the participants. The results are discussed and improvements are suggested accordingly.

Journal ArticleDOI
TL;DR: This article reviews recent advances in information retrieval research and examines their practical potential for overcoming deficiencies, although earlier results published elsewhere have been considered.
Abstract: Operational retrieval systems are firmly embedded within the pure Boolean framework, and the theoretical model underlying these systems is based on the implicit assumption that documents and user information needs can be precisely and completely characterized by sets of index terms and Boolean search request formulations, respectively. However, this assumption must be considered grossly inaccurate since uncertainty is intrinsic to the document retrieval process. The inability of the standard Boolean model to deal effectively with the inherent fallibility of retrieval decisions is the main reason for a number of serious deficiencies exhibited by present-day operational retrieval systems. This article reviews recent advances in information retrieval research and examines their practical potential for overcoming these deficiencies. The primary source for this review is the subsequent articles that comprise this special issue of Information Processing & Management, although earlier results published elsewhere have also been considered.

Book
01 Dec 1988
TL;DR: A Navigator of Natural Language Organized Data (ANNOD) as discussed by the authors is a retrieval system which combines use of probabilistic, linguistic, and empirical means to rank individual paragraphs of full text for their similarity to natural language queries proposed by users.
Abstract: “A Navigator of Natural Language Organized Data” (ANNOD) is a retrieval system which combines use of probabilistic, linguistic, and empirical means to rank individual paragraphs of full text for their similarity to natural language queries proposed by users. ANNOD includes common word deletion, word root isolation, query expansion by a thesaurus, and application of a complex empirical matching (ranking) algorithm. The Hepatitis Knowledge Base, the text of a prototype information system, was the file used for testing ANNOD. Responses to a series of users' unrestricted natural language queries were evaluated by three testers. Information needed to answer 85 to 95‰ of the queries was located and displayed in the first few selected paragraphs. It was successful in locating information in both the classified (listed in Table of Contents) and unclassified portions of text. Development of this retrieval system resulted from the complementarity of and interaction between computer science and medical domain expert knowledge. Extension of these techniques to larger knowledge bases is needed to clarify their proper role.

Journal ArticleDOI
TL;DR: That the presented theory is useful for the retrieval of information in natural language information systems, is shown by the results of the prototype TRIGIR based on trigrams.

Journal ArticleDOI
TL;DR: OAKDEC is a program that uses expert system techniques to assess the status of a database search done through the intermediary program, OAK, and to provide a recommendation on how to proceed, based on decision-making logic in the original OAK system.
Abstract: OAKDEC is a program that uses expert system techniques to assess the status of a database search done through the intermediary program, OAK, and to provide a recommendation to the user on how to proceed Based on decision-making logic in the original OAK system, OAKDEC works at a far greater degree of detail, or finer grain size, in resolving the situation and in making the decision on the next step to be recommended OAKDEC is intended as a research tool for studying user behavior and, in particular, for studying the effect of decision detail on user behavior and search outcome

Proceedings ArticleDOI
Nick Belkin1
01 May 1988
TL;DR: A general model of clarity in human-computer systems, of which explanation is one component, is proposed, and a model for explanation by the computer intermediary in information retrieval is proposed.
Abstract: We discuss the complexity of explanation activity in human-human goal-directed dialogue, and suggest that this complexity ought to be taken account of in the design of explanation in human-computer interaction. We propose a general model of clarity in human-computer systems, of which explanation is one component. On the bases of: this model; of a model of human-intermediary interaction in the document retrieval situation as one of cooperative model-building for the purpose of developing an appropriate search formulation; and, on the results of empirical observation of human user-human intermediary interaction in information systems, we propose a model for explanation by the computer intermediary in information retrieval.

Journal ArticleDOI
TL;DR: An efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme is described.
Abstract: The term discrimination model provides a means of evaluating indexing terms in automatic document retrieval systems. This article describes an efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme. The algorithm has an expected running time proportional to Nn2 for a collection of N documents, each of which has been assigned an average of n terms.



Journal ArticleDOI
TL;DR: It turns out that a front end designed to permit searchers to attach probabilistically interpreted weights to their query terms could be adapted for conventional IR systems, and such an enhancement could lead to improved performance.
Abstract: In order for conventionally designed commercial document retrieval systems to perform perfectly, the following two (logical) conditions must be satisfied for every search: (1) There exists a document property (or combination of properties) that belongs to those (and only those) documents that are relevant. (2) That property (or combination of properties) can be correctly guessed by the searcher. In general, the first assumption is false, and the second is impossible to satisfy; hence no conventional IR system can perform at a maximum level of effectiveness. (We are painfully aware of the current poor performance values for Recall and Precision. Furthermore, Recall deteriorates rapidly as document corpora continue to grow in size.) However, different design principles can lead to improved performance. This article presents a view of the document retrieval problem that shows that since the relationship between document properties (whether they be humanly assigned index terms or words that occur in the running text) and relevance is at best probabilistic, one should approach the design problem using probabilistic principles. It turns out that a front end designed to permit searchers to attach probabilistically interpreted weights to their query terms could be adapted for conventional IR systems. Such an enhancement could lead to improved performance.

Journal Article
01 May 1988-Online
TL;DR: Efficacite comparee de la recherche documentaire dans ERIC, version CD-ROM et version imprimee islamique comparee.
Abstract: Efficacite comparee de la recherche documentaire dans ERIC, version CD-ROM et version imprimee

Book
01 Dec 1988
TL;DR: In this article, the use of inverted files for the calculation of similarity coefficients and other types of matching function is discussed in the context of mechanised document retrieval systems and a critical evaluation is presented of a range of algorithms which have been described for the matching of documents with queries.
Abstract: The use of inverted files for the calculation of similarity coefficients and other types of matching function is discussed in the context of mechanised document retrieval systems. A critical evaluation is presented of a range of algorithms which have been described for the matching of documents with queries. Particular attention is paid to the computational efficiency of the various procedures, and improved search heuristics are given in some cases. It is suggested that the algorithms could be implemented sufficiently efficiently to permit the provision of nearest neighbour searching as a standard retrieval option.