scispace - formally typeset
Search or ask a question

Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 1987"


Proceedings ArticleDOI
Joel L. Fagan1
01 Nov 1987
TL;DR: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented.
Abstract: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented. Problems related to this non-syntactic phrase construction method are discussed, and some possible solutions are proposed that make use of information about the syntactic structure of document and query texts.

130 citations



Proceedings ArticleDOI
01 Nov 1987
TL;DR: The problem of retrieving information from large full-text databases is unblquitous and increasing in importance as lowcost optical storage media become available.
Abstract: The problem of retrieving information from large full-text databases is unblquitous and increasing in importance as lowcost optical storage media become available. In many cases, simple keyword-based retrieval systems have shown themselves inadequate for the task and a number of more or less sophisticated alternatives have been proposed. Of particular interest are those tha t derive from efforts in natural language understanding and which advocate a conceptually oriented approach (Schank et aL, 1981; DeJong, 1982; Kolodner, 1983). These efforts emphasize semantically driven text parsing with the goal of understanding only so much of the text as is necessary to perform satisfactory retrieval.

56 citations



Proceedings ArticleDOI
01 Nov 1987
TL;DR: This work proposes to combine the concordance and bit-map approaches, and shows how this can speed up the processing of queries: fast ANDing and ORing of the maps in a preprocessing stage, lead to large I/O savings in collating coordinates of keywords needed to satisfy the metrical and Boolean constraints.
Abstract: In static full-text retrieval systems, which accommodate metrical as well as Boolean operators, the traditional approach to query processing uses a “concordance”, from which large sets of coordinates are retrieved and then merged and/or collated. Alternatively, in a system with l documents, the concordance can be replaced by a set of bit-maps of fixed length l, which are constructed for every different word of the database and serve as occurrence maps. We propose to combine the concordance and bit-map approaches, and show how this can speed up the processing of queries: fast ANDing and ORing of the maps in a preprocessing stage, lead to large I/O savings in collating coordinates of keywords needed to satisfy the metrical and Boolean constraints. Moreover, the bit-maps give partial information on the distribution of the coordinates of the keywords, which can be used when queries must be processed by stages, due to their complexity and the sizes of the involved sets of coordinates. The new techniques are partially implemented at the Responsa Retrieval Project.

29 citations


Proceedings ArticleDOI
01 Nov 1987
TL;DR: The interaction of suffixing algorithms and ranking techniques in retrieval performance, particularly in an online environment, was investigated and two modifications to ranking techniques were suggested: variable weighting of word variants and selective stemming depending on query length.
Abstract: The interaction of suffixing algorithms and ranking techniques in retrieval performance, particularly in an online environment, was investigated. Three general purpose suffixing algorithms were used for retrieval on the Cranfield 1400, Medlars, and CACM collections, and the results analysed with several standard evaluation measures. An examination of the retrieval performance using suffixing suggested two modifications to ranking techniques: variable weighting of word variants and selective stemming depending on query length. The experimental data is presented, and the limitations of suffixing in an online environment is discussed.

28 citations


Proceedings ArticleDOI
01 Nov 1987
TL;DR: A new cluster maintenance strategy is proposed and its similarity/stability characteristics, cost analysis, and retrieval behavior in comparison with unclustered and completely reclustered database environments have been examined by means of a series of experiments.
Abstract: Partitioning by clustering of very large databases is a necessity to reduce the space/time complexity of retrieval operations. However, the contemporary and modern retrieval environments demand dynamic maintenance of clusters. A new cluster maintenance strategy is proposed and its similarity/stability characteristics, cost analysis, and retrieval behavior in comparison with unclustered and completely reclustered database environments have been examined by means of a series of experiments.

28 citations



Proceedings ArticleDOI
01 Nov 1987
TL;DR: This paper proposes a method and gives the precise semantics of the retrieval operations in a system where imprecision is allowed and suggests a way to handle the uncertainty introduced by imprecise data values.
Abstract: Missing, non-applicable and imprecise values arise frequently in Office Information Systems. There is a need to treat them in a consistent and useful manner. This paper proposes a method and gives the precise semantics of the retrieval operations in a system where imprecision is allowed. It also suggests a way to handle the uncertainty introduced by imprecise data values.

25 citations


Proceedings ArticleDOI
B.J. Oommen1, D. Ma1
01 Nov 1987
TL;DR: The first solution is relatively fast, but its accuracy is not so remarkable in some environments, and the second solution, which uses a new variable structure stochastic automation, demonstrates an excellent partitioning capability.
Abstract: Let O = {A1, …, AW} be a set of W objects to be partitioned into R classes {P1, …, PR}. The objects are accessed in groups of unknown size and the size of these groups need not be equal. Additionally, the joint access probabilities of the objects are unknown. The intention is that the objects accessed more frequently together are located in the same class. This problem has been shown to be NP-hard [15, 16]. In this paper, we propose two stochastic learning automata solutions to the problem. Although the first one is relatively fast, its accuracy is not so remarkable in some environments. The second solution, which uses a new variable structure stochastic automation, demonstrates an excellent partitioning capability. Experimentally, this solution converges an order of magnitude faster than the best known algorithm in the literature [15, 16].

25 citations


Proceedings ArticleDOI
01 Nov 1987
TL;DR: The proposed NLP techniques are used to develop a request model based on “conceptual case frames” and to compare this model with the texts of candidate documents and statistical searches carried out using dependency and relative importance information derived from the request models indicate that performance benefits can be obtained.
Abstract: Document retrieval systems have been restricted, by the nature of the task, to techniques that can be used with large numbers of documents and broad domains. The most effective techniques that have been developed are based on the statistics of word occurrences in text. In this paper, we describe an approach to using natural language processing (NLP) techniques for what is essentially a natural language problem - the comparison of a request text with the text of document titles and abstracts. The proposed NLP techniques are used to develop a request model based on “conceptual case frames” and to compare this model with the texts of candidate documents. The request model is also used to provide information to statistical search techniques that identify the candidate documents. As part of a preliminary evaluation of this approach, case frame representations of a set of requests from the CACM collection were constructed. Statistical searches carried out using dependency and relative importance information derived from the request models indicate that performance benefits can be obtained.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: Models of document retrieval systems assuming random selection and best-first selection are developed and compared under binary independence and two Poisson independence feature distribution models.
Abstract: Most document retrieval systems based on probabilistic models of feature distributions assume random selection of documents for retrieval. The assumptions of these models are met when documents are randomly selected from the database or when retrieving all available documents. A more suitable model for retrieval of a single document assumes that the best document available is to be retrieved first. Models of document retrieval systems assuming random selection and best-first selection are developed and compared under binary independence and two Poisson independence feature distribution models. Under the best-first model, feature discrimination varies with the number of documents in each relevance class in the database. A weight similar to the Inverse Document Frequency weight and consistent with the best-first model is suggested which does not depend on knowledge of the characteristics of relevant documents.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: A family of compression methods using a hash table for searching the prediction information, which are especially apt for “on-the-fly” compression of transmitted data and could be a basis for specialized hardware.
Abstract: The knowledge of a short substring constitutes a good basis for guessing the next character in a natural language text. This observation, i.e. repeated guessing and encoding of subsequent characters, is very fundamental for the predictive text compression. The paper describes a family of such compression methods, using a hash table for searching the prediction information. The experiments show that the methods produce good compression gains and, moreover, are very fast. The one-pass versions are especially apt for “on-the-fly” compression of transmitted data, and could be a basis for specialized hardware.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: An enhancement of such a clustering scheme is presented by the formulation of the user-oriented clustering as a function-optimization problem, termed the Boundary Selection Problem (BSP).
Abstract: User-oriented clustering schemes enable the classification of documents based upon the user perception of the similarity between documents, rather than on some similarity function presumed by the designer to represent the user criteria. In this paper, an enhancement of such a clustering scheme is presented. This is accomplished by the formulation of the user-oriented clustering as a function-optimization problem. The problem formulated is termed the Boundary Selection Problem (BSP). Heuristic approaches to solve the BSP are proposed and a preliminary for evaluation of these approaches is provided.


Proceedings ArticleDOI
01 Nov 1987
TL;DR: This approach allows a limited automatic analysis for image belonging to a domain described in advance to the system using a formalism based on fuzzy sets, based on special access structures generated from the image analysis process.
Abstract: In this paper we address the problem of retrieving images from large image databases, giving a partial description of the image content This approach allows a limited automatic analysis for image belonging to a domain described in advance to the system using a formalism based on fuzzy sets The image query processing is based on special access structures generated from the image analysis process

Proceedings ArticleDOI
01 Nov 1987
TL;DR: An interaction model, which refers to a knowledge based model of document description, is discussed, which employs the feature "informational zooming" to investigate informational entities on an adequate level of abstraction.
Abstract: User interfaces to information systems can be modelled by providing generalized descriptions of the contributions to the dialog from both partners: user and system. In this paper, we refer to such descriptions as "interaction models". Due to the probable integration of heterogeneous types of information in future information systems, we discuss an interaction model, which refers to a knowledge based model of document description (cf HAHN/REIMER 86). Using interactive graphics the model employs the feature "informational zooming" to investigate informational entities on an adequate level of abstraction. The knowledge-based full-text information system TOPIC/TOPOGRAPHIC integrates the presentation of various types of information (topical, factual and textual) into a comprehensive interaction model based on informational objects. Only three operators suffice for accessing the information structures at all levels. This is accomplished by context depending menus that are generated dynamically during the dialog if a further specification of the command is needed. Thus a user-friendly access to several layers of information about texts is possible: (1) Topical structures of relevant texts at different levels of generality (cascaded abstracts) (2) Facts from those texts automatically extracted during the text analysis (3) Passages from the original text which are presented according to the user's zooming operations. A survey of the functionality of the system is given in the appendix. l Interaction Models of Information Systems User interfaces to information systems can be modelled by providing generalized descriptions of the contributions to the dialog from both partners: user and system. In this paper, we will refer to such descriptions as "interaction models", which are determined by design decisions 1 This paper is an enhanced version of the paper presented at ACMSIGIR ’87, published in: Yu, C.T. / Van Rijsbergen, C. J. (eds): Proceedings of the lOth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval. New York, 1987, pp. 45-56. This text is published under the following Creative Commons Licence: AttributionNonCommercial-NoDerivs 2.0 Germany (http://creativecommons.org/licenses/by-nc-nd/2.0/de/).

Proceedings ArticleDOI
01 Nov 1987
TL;DR: A sample interaction with EP-X is discussed, the knowledge representations necessary to support this semantically-based interaction are discussed, preliminary results of empirical studies to evaluate the interface, and recommendations for future directions are made.
Abstract: EP-X (Environmental Pollution eXpert) is a prototype knowledge-based system that assists users in conducting bibliographic searches of the environmental pollution literature. This system combines artificial intelligence and human factors engineering techniques, allowing us to redesign traditional bibliographic information retrieval interfaces. The result supports semantically-based search as opposed to the typical character-string matching approach. This paper discusses a sample interaction with EP-X,the knowledge representations necessary to support this semantically-based interaction,preliminary results of empirical studies to evaluate the interface, andrecommendations for future directions

Proceedings ArticleDOI
P. Schauble1
01 Nov 1987


Proceedings ArticleDOI
01 Nov 1987
TL;DR: Within the framework of the vector space models, a statistical similarity measure between document and query is proposed, which provides a natural and consistent interpretation of term occurrence frequencies obtained from autoindexing.
Abstract: Within the framework of the vector space models, a statistical similarity measure between document and query is proposed. In this approach the assumption that term (or atomic) vectors are pairwise orthogonal is not required. In addition, it provides a natural and consistent interpretation of term occurrence frequencies obtained from autoindexing.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: The attempt in this paper to outline a method for the automatic construction of a knowledge base and propose some methods and a domain knowledge model.
Abstract: We attempt in this paper to outline a method for the automatic construction of a knowledge base. We propose some methods and a domain knowledge model. A new idea is to conceive a system that is able to each phase of its construction to acquire domain knowledge from all new information that it is building, in particular the indexing terms; the last section is an attempt in this sense.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: The experimental results show that in this case no improvement over a simple coordination match function can be achieved, and models based on probabilistic indexing outperform the ranking procedures using search term weights.
Abstract: The effect of probabilistic search term weighting on the improvement of retrieval quality has been demonstrated in various experiments described in the literature. In this paper, we investigate the feasibility of this method for boolean retrieval with terms from a prescribed indexing vocabulary. This is a quite different test setting in comparison to other experiments where linear retrieval with free text terms was used. The experimental results show that in our case no improvement over a simple coordination match function can be achieved. On the other hand, models based on probabilistic indexing outperform the ranking procedures using search term weights.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: The results suggest that the parallel architecture of the DAP is not well suited to the variable-length records which characterise bibliographic data.
Abstract: This paper considers the suitability and efficiency of a highly parallel computer, the ICL Distributed Array Processor (DAP), for document clustering. Algorithms are described for the implementation of the single-pass and reallocation clustering methods on the DAP and on a conventional mainframe computer. These methods are used to classify the Cranfield, Vaswani and UKCIS document test collections. The results suggest that the parallel architecture of the DAP is not well suited to the variable-length records which characterise bibliographic data.

Journal ArticleDOI
Gerard Salton1
01 Mar 1987
TL;DR: The conclusion is reached that expert systems are unlikely to provide much relief in ordinary retrieval environments and simpler and more effective retrieval systems can be implemented by falling back on methodologies proposed and evaluated over twenty years ago that operate without expert system intervention.
Abstract: The existing bibliographic retrieval systems are too complex to permit direct on-line access by untrained end users. Expert system approaches have been introduced in the hope of simplifying the document indexing, search and retrieval operations and rendering these operations accessible to end users. The expert system approach is examined briefly in this note and the conclusion is reached that expert systems are unlikely to provide much relief in ordinary retrieval environments. Simpler and more effective retrieval systems than those currently in use can be implemented by falling back on methodologies proposed and evaluated over twenty years ago that operate without expert system intervention.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: This paper proposes a method and gives the precise semantics of the retrieval operations in a system where imprecision is allowed and suggests a way to handle the uncertainty introduced by imprecise data values.
Abstract: Missing, non-applicable and imprecise values arise frequently in Office Information Systems. There is a need to treat them in a consistent and useful manner. This paper proposes a method and gives the precise semantics of the retrieval operations in a system where imprecision is allowed. It also suggests a way to handle the uncertainty introduced by imprecise data values.

Journal ArticleDOI
01 Mar 1987
TL;DR: Information retrieval has all the elements of a classical decision problem: a set of possible actions, a setof potential states, and a reward or utility attached to each combination of action and state.
Abstract: Information retrieval has all the elements of a classical decision problem: a set of possible actions, a set of potential states, and a reward or utility attached to each combination of action and state. How the actions, states, and utilities are described, however, is variable, and depends very much on the describer's point of view.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: The Indexing Aid System is described and illustrated using an extended example, highlighting the knowledge-based capabilities of the system, namely, inheritance and internal retrieval, enforcement of restrictions, and other functions implemented by procedural attachments, which are characteristic of frame-based knowledge representation languages.
Abstract: This report discusses the Indexing Aid Project for conducting research in interactive knowledge-based indexing of the medical literature. After providing an overview and background, we describe and illustrate the Indexing Aid System using an extended example, highlighting the knowledge-based capabilities of the system, namely, inheritance and internal retrieval, enforcement of restrictions, and other functions implemented by procedural attachments, which are characteristic of frame-based knowledge representation languages. A feature which generates reports for evaluating the system is also shown. The paper concludes with discussion of the research plan. The project is part of the Automated Classification and Retrieval Program at the Lister Hill National Center for Biomedical Communications, the research and development arm of the National Library of Medicine.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: It is found that engineering majors exhibit academic background and personality characteristics most like those of skilled searchers and programmers, with contrasting patterns or no discernible patterns in English and psychology majors.
Abstract: The population using information retrieval systems is becoming increasingly diverse. We find a wide range of skills in ability to use these systems; this diverse population must be accommodated by the next generation of systems. This paper reports on a study to identify variables related to information retrieval aptitude, based on results from earlier studies of searchers and programmers. A sample of undergraduate subjects from English, psychology, and engineering majors was given a series of psychometric tests and compared to known populations. We find that engineering majors exhibit academic background and personality characteristics most like those of skilled searchers and programmers, with contrasting patterns or no discernible patterns in English and psychology majors. The strength of most associations increases when restricted to subjects who have either stayed in one major or who have changed major only within one disciplinary area. About half the variance in choice of major can be explained by scores on the tests administered, and a comparable amount of variance in test scores can be explained by the academic background variables.

Proceedings ArticleDOI
01 Nov 1987
TL;DR: Methods of integrating personal computers (PCs) into large information systems, with emphasis on effective use of the storage and processing capabilities of these computers, are outlined, noting that caching in this environment poses unique problems.
Abstract: Information retrieval (IR) systems provide individual remote access to centrally managed data. The current proliferation of personal computer systems, as well as advances in storage and communication technology, have created new possibilities for designing information systems which are easily accessible, economical, and responsive to user needs. This paper outlines methods of integrating personal computers (PCs) into large information systems, with emphasis on effective use of the storage and processing capabilities of these computers. In particular we discuss means for caching retrieved data at PC-equipped user sites, noting that caching in this environment poses unique problems. An event-driven simulation program is described which models information system operation. This simulator is being used to examine caching strategies. Some results of these studies are presented.