scispace - formally typeset
Search or ask a question

Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 1983"


Journal ArticleDOI
01 Jun 1983
TL;DR: Results indicate that for any two representations considered, performance values differed slightly while overlap scores were low, thus supporting the evidence that recall and precision as performance measures mask differences between the sets of retrieved documents.
Abstract: Most previous investigations comparing the performance of different representations have used recall and precision as performance measures. However, there is evidence to show that these measures are insensitive to an important difference between representations. To explain, two representations may perform similarly on these measures, while retrieving very different sets of documents. Equivalence of representations should be decided on the basis of similarity in performance and similarity in the documents retrieved. This study compared the performance of four representations in the PsycAbs database. In addition, overlap between retrieved sets was also computed where overlap is the proportion of retrieved documents that are the same for pairs of document representations. Results indicate that for any two representations considered, performance values differed slightly while overlap scores were also low, thus supporting the evidence that recall and precision as performance measures mask differences between the sets of retrieved documents. Results are interpreted to propose an optimal ordering of the representations and to examine the contribution of each representation given this combination.

110 citations


Journal ArticleDOI
Karen Kukich1
01 Jun 1983
TL;DR: Examples drawn from the implementation of the stock report generator are used to describe the components of a knowledge-based report generator.
Abstract: Knowledge-Based Report Generation is a technique for automatically generating natural language summaries from databases. It is so named because it applies the tools of knowledge-based expert systems design to the problem of text generation. The technique is currently being applied to the design of an automatic natural language stock report generator. Examples drawn from the implementation of the stock report generator are used to describe the components of a knowledge-based report generator.

57 citations


Journal ArticleDOI
01 Jun 1983
TL;DR: A variety of different organizations has been proposed to enhance processing of text retrieval operations, and the advantages and disadvantages inherent in each of these approaches are discussed, along with a number of proposed implementations.
Abstract: As databases become very large, conventional digital computers cannot provide satisfactory response time. This is particularly true for text databases, which must often be several orders of magnitude larger than formatted databases to store a useful amount of information. Even the standard techniques for improving system performance (such as inverted files) may not be sufficient to give the desired performance, and the use of an unconventional hardware organization may become necessary.A variety of different organizations has been proposed to enhance processing of text retrieval operations. Most of these have concentrated on the design of fast, efficient search engines. These can be divided into three classes: associative memories, cellular pattern matchers, and finite state automata. The advantages and disadvantages inherent in each of these approaches are discussed, along with a number of proposed implementations. Finally, the text retrieval system under development at the University of Utah is discussed in more detail.

52 citations


Journal ArticleDOI
01 Jun 1983
TL;DR: An overview of the IRUS system, which uses the System 1022 dbms on the DEC KL-2060, to be only one of a set of possible implementations, and are not constraining IRUS on the basis of 1022's strengths and weaknesses.
Abstract: This paper describes work in progress to develop a facility for natural language access to a variety of computer databases and database systems. This facility, called IRUS for Information Retrieval using the RUS parsing system, allows users who are unfamiliar with the technical characteristics of the underlying database system to query databases using typed English input. This system can be thought of as a stand-alone query system or as part of a management information system (MIS) or a decision support system (DSS).Many systems boast of having a "user-friendly" or "English-like" or even "English" interface so that users require a minimum of special training to use the system, but most such systems use shallow, relatively ad hoc techniques that are not robust or linguistically sound. We are using a large, well-tested, theoretically-based, general parser of English that has been developed and extended in a variety of research projects for over a decade.One of the primary emphases of IRUS is transportability, which includes three types of changes: (1) changing the domain, (2) changing data bases within the same domain, and (3) changing data base systems. The use of a general parser for English is an important part of the solution to the transportability problem, but there are other parts as well, since portions of the system beyond the parser must know the conceptual content of the domain, the way in which this is reflected in a collection of datasets, and the operating characteristics of the dbms being used to access these datasets.Other researchers have investigated similar issues [8, 5, 6, 12]. We have attacked this problem by building a knowledge-based system, with procedural components independent of domain and data base structure, directed by domain and database dependent knowledge structures. We are also building tools for conveniently creating and maintaining these knowledge structures, with an eventual goal of allowing end-users to extend and modify these knowledge structures to suit their own needs. Given this set of goals, and these tools, we consider the current implementation, which uses the System 1022 dbms on the DEC KL-2060, to be only one of a set of possible implementations, and are not constraining IRUS on the basis of 1022's strengths and weaknesses.This paper presents an overview of the IRUS system, emphasizing those aspects of the design that are critical to transportability. We describe the parsing system, which is a completely independent module that has been interfaced to a variety of different applications, and then discuss the other modules which bridge the gap between the parser and the dbms.

31 citations



Journal ArticleDOI
Christine L. Borgman1
01 Jun 1983
TL;DR: This is the first monitoring study of an online catalog performed without system-defined user sessions, and preliminary results suggest that users have much shorter sessions than on other types of retrieval systems.
Abstract: We report on a computer monitoring study of users of the Ohio State University Libraries' online catalog, an established and heavily used information retrieval system. To our knowledge, this is the first monitoring study of an online catalog performed without system-defined user sessions. Online catalogs represent a class of retrieval systems which are designed for end users, require little or no formal training, and replace an existing manual system. The study characterizes user behavior in terms of types of searches done, patterns of use, time spent on searching, errors, and system problems. Preliminary results suggest that users have much shorter sessions than on other types of retrieval systems. Patterns of use vary between campus libraries, academic quarters, and between short and long sessions. Results of the study will be applied to improving the user interface and other system features.

21 citations


Journal ArticleDOI
V. J. Geller1, Michael Lesk1
01 Jun 1983
TL;DR: The difference is based on the degree of user foreknowledge of the data base and its organization, which means that retrieval by keywords is 50% less common than menu choice.
Abstract: Do users prefer selection from a menu or specification of keywords to retrieve documents? We tried two experiments, one using an on-line library catalog and the other an on-line news wire. In the first, library users could either issue keyword commands to see book catalog entries, or choose categories from a menu following the Dewey Decimal classification of the books. In the second, news wire users could read Associated Press news stories either by posting a keyword profile against which all stories were matched, or by selecting them from a menu of current news items. For the library users, keyword searches were clearly preferred, by votes of 3 and 4 to 1; for the news stories, retrieval by keyword search is 50% less common than menu choice.We suggest that the difference is based on the degree of user foreknowledge of the data base and its organization. Menu-type interfaces tell the user what is available. If the user already knows, as in the library where a majority of the users have a particular book in mind, then the menu is merely time-consuming. But when the user does not know what is available (almost the definition of "news" is that it is new, and unpredictable), the menu is valuable because it displays the choice.

19 citations


Journal ArticleDOI
John E. Tolle1
01 Jun 1983
TL;DR: The methodology chosen to employ is to obtain machine-readable transaction logs, via tapes, from the online catalogs and subsequently analyzing these transactions by stochastic search pattern development and mathematical models utilizing Markov chain analysis and the development of transition probability matrices.
Abstract: From November 1981 to April 1983, OCLC's Office of Research has been conducting research into online public access catalogs (OPACs). This project has been funded in part by the Council on Library Resources, Inc. as an attempt to provide new insight into the use of online catalogs by obtaining information which may serve as input for better system design of OPACs, utilizing not only desired user features but also more effective searching.The overall study is concerned with the patron and the system and consists of three major parts. The first is the study of current use of online catalogs, i.e., the actual use - what is really happening. The second element is concerned with the perceived patron use of the catalogs and involves the use of questionnaires and focus group interviews at the participating institutions. The third part is an application of the findings from the first two parts.This paper focuses on the current utilization of OPACs. The methodology chosen to employ is to obtain machine-readable transaction logs, via tapes, from the online catalogs and subsequently analyzing these transactions by stochastic search pattern development and mathematical models utilizing Markov chain analysis and the development of transition probability matrices.

18 citations


Journal ArticleDOI
01 Jun 1983
TL;DR: A network organization for implementing a document retrieval system that has significant advantages in terms of the range of searches that can be used when compared to either inverted or clustered file organizations is proposed.
Abstract: A network organization for implementing a document retrieval system is proposed. This organization has significant advantages in terms of the range of searches that can be used when compared to either inverted or clustered file organizations. Algorithms for generating and maintaining the network are described together with experiments designed to test their efficiency and effectiveness.

17 citations


Journal ArticleDOI
01 Jun 1983
TL;DR: In this paper, a new clustering algorithm has been described that determines both the number of clusters in a collection, and theNumber of elements in each cluster before beginning the final clustering process.
Abstract: In this paper, a new clustering algorithm has been described. The algorithm proposed determines both the number of clusters in a collection, and the number of elements in each cluster before beginning the final clustering process. The complexity assessment of the algorithm and the implementation issues are also emphasized.

14 citations


Journal ArticleDOI
01 Jun 1983
TL;DR: It is shown that the normalized recall is closely related to other measures such as the CRE-measure and the expected search length and some implications are analysed.
Abstract: The normalized recall is one of the most popular evaluation measures for information retrieval systems. In this paper an overview of its development is given. It is then shown that the normalized recall is closely related to other measures such as the CRE-measure and the expected search length. Some implications are analysed.

Journal ArticleDOI
01 Jun 1983
TL;DR: It is suggested that this approach to building an expert system for searching the cancer therapy literature on MEDLINE is superior in terms of retrieval performance compared with alternative approaches to end-user searching which fail to exhibit detailed knowledge regarding the subject matter of the search.
Abstract: This paper reviews work towards building an expert system for searching the cancer therapy literature on MEDLINE. A modified subset of the Medical Subject Headings (MeSH) has been stored on a micro-computer and accessed via a touch terminal. Searches, previously requested of the Oncology Information Service at the University of Leeds, have been used to test out the principle of end user searching and the results compared with the searching expertise of a MEDLARS indexer. Original program development was in PASCAL, but a rule-based approach, which is independent of a particular programming language, has been developed for search term and frame selection adopting a 'blackboard' philosophy in tracing the process of selection. Work is progressing on an implementation using the expert systems programming language PROLOG, which has been found a very suitable language for representing rules and provides a ready made rule interpreter. It is suggested that this approach is superior in terms of retrieval performance compared with alternative approaches to end-user searching which fail to exhibit detailed knowledge regarding the subject matter of the search.

Journal ArticleDOI
01 Jun 1983
TL;DR: It is shown that the search effectiveness, when no relevance information is assumed, can be further enhanced by using the 2-Poisson model and when the term weights proposed in this work are used in conjunction with weights known as term significance weights, the results are very encouraging.
Abstract: The early work on the probabilistic models of retrieval assumed that the document representation is binary, indicating only the presence or absence of index terms. The 2-Poisson (TP) model which was proposed as a model of how the occurrence frequency of specialty words in a collection is distributed, has since been used to develop retrieval strategies that incorporate term frequency information. This work investigates the use of the TP model, in this context, further. It is shown that the search effectiveness, when no relevance information is assumed, can be further enhanced by using this model. Furthermore, when the term weights proposed in this work are used in conjunction with weights known as term significance weights, the results are very encouraging.

Journal ArticleDOI
01 Jan 1983
TL;DR: By using overlapping word fragments to index text, this work can combine the best features of the keyword and the full text approaches to document retrieval so as to facilitate searches on any content word.
Abstract: By using overlapping word fragments to index text, we can combine the best features of the keyword and the full text approaches to document retrieval so as to facilitate searches on any content word. The characteristics of a retrieval system based on word fragment indexing can be precisely predicted from a multinomial model of text. Controlled experiments with two different text collections indicate that such a system can be highly effective under quite general conditions.


Journal ArticleDOI
01 Jun 1983
TL;DR: Some of the ways artificial intelligence might influence the field of information retrieval are pointed out and one application is examined in more detail to discover the kind of technical problems involved in its fruitful exploitation.
Abstract: Overall, the field of information retrieval is already more aware than many other fields of the relevance of artificial intelligence (AI) [1-6]. Nonetheless there remain exciting applications of artificial intelligence that have been so far overlooked. In this paper we will point out some of the ways artificial intelligence might influence the field of information retrieval. We will then examine one application in more detail to discover the kind of technical problems involved in its fruitful exploitation.Before proceeding, it is important to interject a note of caution. While the promise of artificial intelligence is indeed bright, the time of complete fulfillment of its promise is a long way off. Of course, some of the expected contributions are shorter term than others. However, the more difficult problems will fall only after a good deal of basic research is accomplished. Artificial intelligence researchers have, in the past, been culpable of what can most charitably be described as over-optimism [7,8]. This naivete on the part of even the most respected of researchers stemmed from the profound subtleties underlying intelligent behavior. The problem is compounded by the fact that some of the most difficult of intelligent behavior (i.e. common sense) seems intuitively easy.

Journal ArticleDOI
Gerard Salton1
01 Jun 1983
TL;DR: Certain recent advances in information retrieval research are mentioned, including the formulation of new probabilistic retrieval models, and the development of automatic document analysis and Boolean query processing techniques.
Abstract: Information retrieval components are currently incorporated in several types of information systems, including bibliographic retrieval systems, data base management systems and question-answering systems. Some of the problems arising in the real-time environment in which these systems operate are briefly discussed. Certain recent advances in information retrieval research are then mentioned, including the formulation of new probabilistic retrieval models, and the development of automatic document analysis and Boolean query processing techniques.


Journal ArticleDOI
01 Jun 1983
TL;DR: The problem of the choice of search terms and how this choice may be affected by an independence assumption is addressed.
Abstract: Underlying many of the probabilistic models for information retrieval are assumptions of stochastic dependence or independence of varying degrees of severity for the index terms describing the documents. These models generally specify a matching function, that is a function which compares a query with each document. The form of that function is to a large extent determined by the particular dependence/independence assumption. For example, if the index terms are assumed to be independently distributed over both the set of relevant and non-relevant documents then the matching function will in general be linear, whereas an assumption of dependence will lead to a non-linear function.Irrespective of the form that the matching function may take it is always assumed that the search terms in the query are known. In this paper I wish to address the problem of the choice of search terms and how this choice may be affected by an independence assumption.

Journal ArticleDOI
01 Jun 1983
TL;DR: In this paper, the authors evaluate multikey search algorithms based on the use of more than a single key in locating a record for use in retrieval or update, and evaluate the performance of these algorithms.
Abstract: File structures and algorithms for multikey searching allow more than a single key to be used in locating a record for use in retrieval or update. Such algorithms are of use in many different kinds of information systems, including database systems, information retrieval systems, and pattern recognition and image processing systems. Such algorithms have received increased attention in recent years. However, they are not as well understood as those which handle single keys.Multikey algorithms are more difficult to evaluate than those based on the use of single keys. There are simply more factors to be considered. The evaluations performed for such algorithms should allow comparisons in order to be useful to a community of researchers and users. Theoretical analyses should be based on reasonable and clearly stated assumptions. Experiments should be repeatable and statistically valid whether they are based on "real" data or on randomly generated data.

Journal ArticleDOI
01 Jun 1983
TL;DR: A method of partitioning the dictionary such that the "information bearing fraction" is stored in fast memory, and the bulk in auxiliary memory, is proposed, based on transforming words into lexicographically ordered strings of distinct letters.
Abstract: A method for compressing large dictionaries is proposed, based on transforming words into lexicographically ordered strings of distinct letters, together with permutation indexes. Algorithms to generate such strings are described. Results of applying the method to the dictionaries of two databases, in Hebrew and English, are presented in detail. The main message is a method of partitioning the dictionary such that the "information bearing fraction" is stored in fast memory, and the bulk in auxiliary memory.

Journal ArticleDOI
01 Jun 1983
TL;DR: The grammar developed for Spanish is presented, and compared with the German grammar which was previously implemented and upon which it is based, and the generality of the system to deal with other natural languages shown.
Abstract: The User Specialty Languages (USL) System is an applications independent natural language interface to a Relational Database System. It provides non DP-trained people with a tool to introduce, query, manipulate and analyse the data stored in a Relational Database via natural language. USL interfaces with different languages; in the present paper the grammar developed for Spanish is presented, and compared with the German grammar which was previously implemented and upon which it is based. Their main differences are pointed out, and the generality of the system to deal with other natural languages shown.

Journal ArticleDOI
01 Jun 1983
TL;DR: The principal emphasis was to provide users with search formulation aids applicable to specific document files searchable using differing IR systems and for file construction and indexing.
Abstract: Project Information Bridge is a West German Federal Ministry for Research and Technology supported project. It has been conducted in cooperation with the Gesellschaft fur Information und Dokumentation (Society for Information and Documentation) (GID) and the various data base hosts in West Germany. Its goal was to develop and test a working prototype of an add-on package for existing IR systems.This report covers the time period from October 1981 through March 1983 during which the prototype was successfully developed and tested. The principal emphasis was to provide users with search formulation aids applicable to specific document files searchable using differing IR systems. Aids for thesaurus construction for natural language words and for file construction and indexing have not at this date been tested. Should tests be completed prior to the conference date, information concerning these topics will be presented at the conference.

Journal ArticleDOI
01 Jun 1983
TL;DR: The RESEDA project is concerned with the construction of AI Information Retrieval systems working on databases containing biographical data and expresses the possibility of a new causal relationship, within the base, between an "episode" provided explicitly by the user and one or more "episodes" that the system retrieves by applying an inference procedure of the type "hypothesis".
Abstract: The RESEDA project is concerned with the construction of AI Information Retrieval systems working on databases containing biographical data. There exist in RESEDA two fundamental ways of retrieving information requested by a user. In the first case, the information we wish to obtain is data which already exists in the base. This data can be obtained by direct match with the "search model" corresponding to the user's question. If this is not possible, we can still try to get an answer by using the inference procedures of the "transformation" type. The second method retrieves information which, in contrast, is created ex nihilo by the search procedure itself. It expresses, in fact, the possibility of a new causal relationship, within the base, between an "episode" provided explicitly by the user and one or more "episodes" that the system retrieves by applying an inference procedure of the type "hypothesis".

Journal ArticleDOI
01 Jun 1983

Journal ArticleDOI
01 Sep 1983
TL;DR: It's a real pleasure to have the opportunity to speak at this important conference and the research and the ideas you are discussing this week will have a profound influence on the authors' society during the coming decades.
Abstract: It's a real pleasure to have the opportunity to speak at this important conference. The research and the ideas you are discussing this week will have a profound influence on our society during the coming decades. The societal and economic changes, both now and in the future, brought about by information systems and computers are attracting the attention of the media, the general public, and even the Congress. As you may know, "High Technology" is one of the most frequently heard buzzwords around Washington these days. If you could retrieve, from a data base of this year's congressional speeches, those speeches containing the key words "High Tech," I'd bet you'd find at least one speech by each member of Congress. In my own case, you'd probably get a list of every speech I made.

Journal ArticleDOI
01 Jun 1983
TL;DR: This study examined structures created by UNIX users to organize their files within a hierarchical directory scheme, and examined the relation between structure and command usage.
Abstract: The structures in which users store their files facilitate retrieval by enabling users to deduce a file's contents from its place in the organization. This study examined structures created by UNIX users to organize their files within a hierarchical directory scheme, and examined the relation between structure and command usage. Users' difficulties in managing the complexity of a hierarchical structure limited the amount of information about files that these structures contained. Tree complexity increased in a negatively accelerating function with the number of files. Users who grouped their files into few directories arranged in shallow trees could navigate through the tree easily, but they sacrificed information: directory names were less specific, and users made more command errors. More sophisticated users created deeper trees. They were able to manage more files but also made extensive use of navigation aids.

Journal ArticleDOI
01 Jun 1983
TL;DR: A number of experiments are presented which utilise this principle to construct fixed length keys from pertinent fields in order to locate and retrieve unique records as well as clusters with lexically homogeneous information.
Abstract: A principle of information science states that the entropy of a set of symbols is maximised when the probability of occurrence of each becomes the same. This paper presents the results of a number of experiments which utilise this principle to construct fixed length keys from pertinent fields in order to locate and retrieve unique records as well as clusters with lexically homogeneous information. Each key incorporates codes derived by various positional selection methods and their discriminating strength proves to be well over 95%.

Journal ArticleDOI
01 Jun 1983
TL;DR: Approaches to both concept learning, in the form of Generalization-Based Memory, and powerful, robust text processing achieved by Memory-Based Understanding are discussed.
Abstract: Natural language processing techniques developed for Artificial Intelligence programs can aid in constructing powerful information retrieval systems in at least two areas. Automatic construction of new concepts allows a large body of information to be organized compactly and in a manner that allows a wide range of queries to be answered. Also, using natural language processing techniques to conceptually analyze the documents being stored in a system greatly expands the effectiveness of queries about given pieces of text. However, only robust conceptual analysis methods are adequate for such systems. This paper will discuss approaches to both concept learning, in the form of Generalization-Based Memory, and powerful, robust text processing achieved by Memory-Based Understanding. These techniques have been implemented in the computer systems IPP, a program that reads, remembers and generalizes from news stories about terrorism, and RESEARCHER, currently in the prototype stage, that operates in a very different domain (technical texts, patent abstracts in particular).

Journal ArticleDOI
01 Jun 1983
TL;DR: This paper deals with statistical databases that are generated from statistical surveys and that reside in organizations which perform a large number of surveys--some of which are repetitive.
Abstract: This paper deals with statistical databases that are generated from statistical surveys and that reside in organizations which perform a large number of surveys--some of which are repetitive. Examples of such organizations are Federal statistical agencies such as the Energy Information Administration, Bureau of Labor Statistics, National Center for Educational Statistics, National Center for Health Statistics, etc; state governments that have bureaus or departments that collect such data; and marketing research departments of most large consumer-oriented companies. Computer processing has provided a powerful tool for storing, manipulating, and analyzing statistical survey data. However, in addition to these advantages, computing has created a major problem in that most data analysts and users have lost touch with the data and their generation. They no longer have the feel and sense for the data that once was possible. In this paper we present an approach to database design that will directly attack this problem and enhance the usefulness of such databases as well.