Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 1980"

PDF

Open Access

Proceedings Article•DOI•

Probabilistic models of indexing and searching

[...]

Stephen Robertson, C. J. van Rijsbergen, M. F. Porter

23 Jun 1980

TL;DR: There is a considerable body of related work by Salton, Yu and associates on automatic indexing using within-document frequencies of terms.

...read moreread less

Abstract: for example, Robertson and Sparck Jones, 1976; van Rijsbergen, 1977; Harper and van Rijsbergen, 1978), and the work done in the USA on automatic indexing using within-document frequencies of terms (notably by Bookstein and Swanson, 1974, 1975; Harter, 1975a, b; Bookstein and Kraft, 1977). (There is a considerable body of related work by Salton, Yu and associates

...read moreread less

366 citations

Proceedings Article•DOI•

The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases

[...]

Chris D. Paice

23 Jun 1980

TL;DR: The production of well-constructed abstracts is an artificial intelligence problem, and therefore unlikely to be either feasible or worthwhile until well into the future: the alternative of picking sentences from here and there in a document is a rather unattractive proposition.

...read moreread less

Abstract: Considering the important part played by abstracts in the traditional information services, the possibility of producing abstracts by computer has not received very much attention. There are perhaps two main reasons for this. First, it appears tha t the production of well-constructed abstracts is an artificial intelligence problem, and therefore unlikely to be either feasible or worthwhile until well into the future: the alternative of picking sentences from here and there in a document is a rather unattractive proposition. Second, the cost of key-punching complete texts for input to an abstracting program can hardly be justified especially since the program will then in effect discard most of the text which has been so laboriously prepared. It now appears that the first of these objections is exaggerated --reasonable-looking abstracts can often be produced by quite 'unintelligent' programs -whi le with advances in technology the second problem should soon disappear. We should be ready to take advantage of this when it happens. Early work in this field concentrated on the extracting problem: that is to say, with finding sentences which could be extracted from a text to convey a good idea of its subject matter. Luhn (1958) write a program which looked for sentences containing clusters of 'key words': that is, the most frequent noncommonplace words in the text. The clusters were weighted according to their size and density, and those sentences containing the most highly weighted clusters were selected. At about the same time Baxendale (1958) drew attention to the fact that the position of a sentence within a text has a bearing on its importance: for instance, she showed that in 85 per cent of a sample of 200 paragraphs the 'topic' sentence was the first, while in another 7 per cent it was the last. Extending this idea, we can understand that the first few and the last few paragraphs of a document are likely to give a strong indication of its overall subject: the pages in between usually contain a lot of detail, which is not of much value taken out of its context. During the 1960s the most important work was carried out by Edmundson (1969), who studied four extracting methods, both individually and in all possible combinations. All four methods involved the assignment of weights to sentences, and the subsequent selection of sentences with the highest weights. The location method weighted sentences if they occurred in preferred positions

...read moreread less

119 citations

Proceedings Article•DOI•

A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment

[...]

Terry Noreault, Michael J. McGill, Matthew B. Koll

23 Jun 1980

TL;DR: It is felt, but does need further examination, that while the absolute effectiveness of any ranking algorithms may vary with the environment, the relative effectiveness of the ranking algorithms will be invariant.

...read moreread less

Abstract: This chapter reports on the results of a study of the effectiveness of ranking algorithms. Also reported here will be some unexpected findings relating to the performance of document representations and searcher differences. These findings are a by-product of the evaluation of the ranking algorithm. The goal of the study was to evaluate ranking algorithms so that generalisations about their effectiveness could be made. Many different ranking algorithms have been suggested (Sager and Lockemann, 1976). Evaluation of the effectiveness of these algorithms has been conducted under differing experimental conditions. These differences in the evaluation conditions have made comparisons of ranking algorithms uncertain. This study evaluated the effectiveness of the ranking algorithms using a single database, common user population, and common sets of queries and relevance judgements. This approach allowed the relative effectiveness of the ranking algorithms to be determined. It is felt, but does need further examination, that while the absolute effectiveness of any ranking algorithms may vary with the environment, the relative effectiveness of the ranking algorithms will be invariant.

...read moreread less

80 citations

Journal Article•DOI•

Lexical relations: enhancing effectiveness of information retrieval systems

[...]

Edward A. Fox¹•Institutions (1)

Cornell University¹

01 Dec 1980

TL;DR: The SMART type of information retrieval system is described and the applicability of the above-mentioned lexicon to such a system is discussed, and the list of lexical relations included in the ECD is expanded and organized to be more effective for retrieval, partially along the lines suggested by Evens and Smith.

...read moreread less

Abstract: One of the essential features of the "Meaning Text" model (MTM) developed by I. A. Mel'chuk et. al. is the special lexicon or ECD ('explanatory and combinatory' dictionary). This component can be thought of as a collection independent thesaurus, and can be applied to improve the effectiveness of an information retrieval system.After outlining the MTM and related work, this paper briefly describes the SMART type of information retrieval system. Applicability of the above-mentioned lexicon to such a system is discussed. In particular, the list of lexical relations included in the ECD is expanded and organized to be more effective for retrieval, partially along the lines suggested by Evens and Smith.Finally, an experimental analysis of the utility of lexical relations in an information retrieval system is discussed. It is shown that lexical relations generally enhance system performance. When all lexical relations are considered in the comparison, the resulting performance is shown, by statistical methods, to make a significant improvement (up to 16.5% at a single recall level); when all lexical relations except for antonyms are considered, the improvement is even greater (up to 20.2% at a single recall level).

...read moreread less

68 citations

Proceedings Article•DOI•

Conceptual information retrieval

[...]

Roger C. Schank, Janet L. Kolodner, Gerald DeJong

23 Jun 1980

TL;DR: The problems involved in building an intelligent information retrieval system based on a model of human understanding and memory retrieval are described, and the CyFr system, which implements those solutions are proposed and implemented.

...read moreread less

Abstract: : If we want to build intelligent information retrieval systems, we will have to give them the capabilities of understanding Natural language, automatically organizing and reorganizing their memories, and using intelligent heuristics for searching their memories. These systems will have to analyze and understand both new text and Natural Language queries. In answering questions, they will have to direct memory search to reasonable places. This requires good organization of both the conceptual content of text and knowledge necessary for understanding those texts and accessing memory. The CYRUS and FRUMP systems (Kolodner (1978), Schank and Kolodner (1979), Dejong (1979)) comprise an information retrieval system called CyFr. Together, they have the analysis and retrieval capabilities mentioned above. FRUMP analysis news stories from the UPI wire for their conceptual content, and produces summaries of those stories. It sends summaries of stories about important people to CYRUS automatically adds those stories to its memory, and can then retrieve that information to answer question posed to it in natural language. This paper describes the problems involved in building such an intelligent system. It proposes solutions to some of those problems bases on recent research in Artificial Intelligence and Natural Language processing, and describes the CyFr system, which implements those solutions. The solutions we propose and implement are based on a model of human understanding and memory retrieval. (Author)

...read moreread less

53 citations

Proceedings Article•DOI•

Measurement-theoretical investigation of the MZ-metric

[...]

P. Bollmann, V. S. Cherniavsky

23 Jun 1980

TL;DR: There is no sound basis for comparing different measures or for selecting in some given applicational environment one of the many measures suggested to date, and leaning upon measurement theory, without unambiguous emphasis on the intuitive viewpoint about quality, amounts to justifying one formalism by another.

...read moreread less

Abstract: Information retrieval has a long and rich tradition of measuring and evaluating. Many measures have been suggested for evaluating performance and quality of information retrieval systems. There are many examples where large-scale experiments have been carried out aiming at evaluating and comparing different systems. At the same time, it is well known that different formal measures reflect different intuitive ideas of quality and have different properties. They often contradict each other and they are often incompatible. However, there is as yet no systematic way to describe and investigate the relation between intuitive ideas of quality, on the one hand, and measures representing formally those ideas of quality, on the other. Without a clear understanding of this relation there is no sound basis for comparing different measures or for selecting in some given applicational environment one of the many measures suggested to date. Having no justifiable criteria for comparing different measures and for selecting one of them, users apply formal criteria such as the measure having one number as value, or having a maximum or minimum (Swets, 1969). Such decisions, as well as judgements about relative quality of information retrieval systems based on those decisions, are not convincing:, one system cannot be declared to be better than another one just because some formal mechanism assigns to the first system a symbol which in a formal ordering lies higher than the symbol assigned to the second system. Information systems are conceived for practical use and their quality is a practical matter rather than a formal one. This situation has given and still gives rise to protests. Users try to overcome it in two ways. The first way leads to requiring the measure applied to have certain measurement-theoretic properties (van Rijsbergen, 1974). The second way tries to justify measures by connecting them to the practical application of the information system evaluated by means of this measure (Lancaster, 1968). Both ideas seem sound. However, leaning upon measurement theory, without unambiguous emphasis on the intuitive viewpoint about quality, amounts to justifying one formalism by another. This way the problem of justification is postponed rather than solved. The other idea, trying to connect measures with practical applications, of the system measured, would be precisely what is

...read moreread less

42 citations

Proceedings Article•DOI•

Problems in the simulation of bibliographic retrieval systems

[...]

Jean Tague, Michael L. Nelson, Harry Wu

23 Jun 1980

TL;DR: This chapter is concerned with the second type of simulation, which is based on a probabilistic model of the document representatives, search statements, the search process and the characteristics of the search output of a bibliographic retrieval system.

...read moreread less

Abstract: The difficulty and expense of achieving valid and reliable tests of bibliographic retrieval systems has been attested to by many investigators. If tests are carried out with large operational systems, there are difficulties in experimentally controlling and modifying the variables. On the other hand, results from small experimental systems tend to be unreliable. An alternative approach to determining general relationships among the variables of a bibliographic or document retrieval system is computer simulation, both of the database and of the search process. Such a simulation is based on a probabilistic model of the document representatives, search statements, the search process and the characteristics of the search output. Before such models are used to determine actual relationships, however, they must be validated by comparison with existing databases, query sets and search outputs. Two types of investigations may be aided by bibliographic retrieval system simulation: (1) studies of efficient ways to store and retrieve bibliographic data, and (2) studies of effective ways to represent bibliographic items (documents) and queries (users' information needs). The former is concerned with analysing and comparing different data structures and access algorithms, the latter with analysing and comparing different document indexing and search statement formulation methods. The bibliographic retrieval systems models for the two types of investigations will not, in general, be the same. The two model types might be categorised as physical models and logical models, respectively. This chapter is concerned with the second type of simulation. Its purpose is:

...read moreread less

39 citations

Journal Article•DOI•

Review of "Information Retrieval: Computational and Theoretical Aspects, by H. S. Heaps", Academic Press Inc.

[...]

Richard K. Wiersba¹•Institutions (1)

Cleveland State University¹

01 Jan 1980

TL;DR: The author of this book has done an excellent job of achieving one portion of his two-fold purpose: to introduce the computer science student to some of the basic problems of information retrieval and to describe the techniques required to develop suitable computer programs.

...read moreread less

Abstract: The author of this book has done an excellent job of achieving at least one portion of his two-fold purpose: "...to introduce the computer science student to some of the basic problems of information retrieval and to describe the techniques required to develop suitable computer programs, ... (and) ... to describe the general structure of the relevant computer programs so that basic design considerations may be understood by information officers and librarians will find this text palatable."

...read moreread less

35 citations

Proceedings Article•DOI•

A term weighting model based on utility theory

[...]

Gerard Salton, Harry Wu

23 Jun 1980

TL;DR: The easiest way of introducing distinctions among classes of retrieved items is to use weighted instead of binary index terms to identify queries and documents.

...read moreread less

Abstract: 2.1 Binary and weighted retrieval In information retrieval it is customary to represent each stored record and each information request by sets of content identifiers, or terms. The terms attached to the items may be assigned automatically or chosen manually; in either case, the terms used for a given item collectively represent the information content of the item. In conventional retrieval systems it is not customary to assign weights to the terms to designate term importance. Instead a term is either assigned to an item or it is not: when assigned, the term may be assumed to carry a weight of 1; otherwise, it carries a weight of 0. In standard retrieval a document is retrieved when it contains all the terms specified in the query. The use of unweighted terms is advantageous in the sense that the indexing operation-that is, the assignment of content identifiers to the items of a collection-is relatively simple. In that case, it is not necessary to consider the degree to which a given term may be useful to represent the content of an item: any term that appears at least marginally relevant is assigned to the corresponding item; the term is rejected when it is clearly extraneous. This kind of binary indexing simplifies the input processing; the retrieval operations, on the other hand, may become complicated by the fact that in a binary indexing system the documents retrieved in response to a given query are indistinguishable from one another. All retrieved items are treated as equally 'close' to the query, because the number of terms assigned jointly to the query and the retrieved items is the same for all items. This leads to the retrieval of potentially large classes of items that are difficult to deal with by the system user. In an interactive retrieval environment where the previously retrieved items are often used to generate improved query formulations, it is particularly important to introduce distinctions among various classes of retrieved items-f o r example, by first bringing to the users' attention those items that appear most 'relevant' to a given query. The easiest way of introducing distinctions among classes of retrieved items is to use weighted instead of binary index terms to identify queries and documents. ~In such a situation it becomes possible to compute a similarity measure between a given query and each stored record as a function of the 1980 TOC …

...read moreread less

27 citations

Proceedings Article•DOI•

Representation of knowledge in a legal information retrieval system

[...]

Carole D. Hafner

23 Jun 1980

TL;DR: The research described here explores the use of subject area knowledge in an 'expert' document retrieval system, and the goals of the research are to characterise the semantics of information retrieval requests, and to develop methods for representing and using subject areaknowledge in computer retrieval systems.

...read moreread less

Abstract: Recent interest in computer representation of knowledge has led to the development of 'expert' computer assistants for tasks such as medicaldiagnosis, technical instruction, and problem solving in restricted domains (Brown and Burton, 1975; Davis, Buchanan and Shortliffe, 1977; Goldstein and Roberts, 1977). Each of these systems is based on a semantic model of a specific subject area, along with some general methods for using subject area knowledge to understand and respond to users' requests. The emphasis of these systems is not on 'solving' problems by computer, but rather on helping a human problem solver organise and apply a complex body of knowledge. The research described here explores the use of subject area knowledge in an 'expert' document retrieval system. The goals of the research are to characterise the semantics of information retrieval requests, and to develop methods for representing and using subject area knowledge in computer retrieval systems. The Legal Research System (LRS) is a knowledge-based computer retrieval system, intended to be used by lawyers and legal assistants to retrieve information about court decisions (cases) and laws passed by legislatures (statutes). The subject of its knowledge is Negotiable Instruments Law, an area of Commerical Law that deals with cheques and promissory notes (White and Summers, 1972; Speidel, Summers and White, 1974). The current implementation of the system (Hafner, 1978) has a database of about 200 statutes from the Uniform Commerical Code (American Law Institute, 1972) and 200 related cases. In LRS four kinds of knowledge about legal concepts and relationships are represented: functional knowledge, structural knowledge, semantic knowledge and factual knowledge. In this chapter the motivation for including each kind of knowledge is discussed, the computer representation of each kind of knowledge is described and examples of the use of each kind of knowledge in LRS are presented. The next section gives a very brief overview of current legal retrieval systems, both manual and automated. Subsequent sections describe the representation of knowledge in LRS, and the use of this knowledge to understand and interpret user queries.

...read moreread less

25 citations

Proceedings Article•DOI•

Browsing through databases

[...]

Andrew J. Palay, Mark S. Fox

23 Jun 1980

TL;DR: This work demonstrates how the use of do.~in dependent knowledge can reduce the combinatorios of learning structural descriptions, usimg as an example the creation of alternative pronunciations from examples of spoken words.

...read moreread less

Abstract: SYMPOSIa33 Knowledge-guided learning o r structural descriptions A. Fox, M.S. B, Reddy, D. R. We demonstrate how the use of do.~in dependent knowledge can reduce the combinatorios of learning structural descriptions, usimg as an example the creation of alternative pronunciations from examples of spoken words. Briefly, certain learning problems (Winston9 1970; Fox & Hayes-Roth~ 1976) can be solved by presenting to a learning program exemplars (tralni~ data) representative of a class° The program constructs a characteristic representation (CR) of the class that best fits the training data. Learning can be viewed as search in the space of representations. Applied complex domalns the search is highly combinatorial due to the: i) Number of alternative CRSo 2) Size of training set. 3) Size of the exemplars. help hack next mark return top display comment goto find

...read moreread less

Proceedings Article•DOI•

Retrieving time information from natural-language texts

[...]

Lynette Hirschman

23 Jun 1980

TL;DR: This paper presents a 'time specialist' program which accepts input about time relations in a LISP-like form; from this input, the 'specialist' comp~tes time relations, checks for inconsistencies in the input and answers questic~ns about time Relations between the events given in theinput.

...read moreread less

Abstract: An understanding of time relations is central to processing information contained in a narrative. A typical narrative is concerned with the relative ordering or progression of events over time, and the information that one would like to retrieve is often of the type 'What happened after event x?' or 'Did event x precede event y?' Determination of causality also requires a knowledge of time relations, since an event x can cause an event y only if it precedes event y in time. There has been considerable interest in the processing of time information among researchers in artificial intelligence. Several systems have been designed which compute time relations from a structured input of time specifications. Kahn and Gorry (1977) present a 'time specialist' program which accepts input about time relations in a LISP-like form; from this input, the 'specialist' comp~tes time relations, checks for inconsistencies in the input and answers questic~ns about time relations between the events given in the input. The speciali~' provides several different ways of organising the time

...read moreread less

Proceedings Article•DOI•

Methods for the administration of textual data in database systems

[...]

H.-J. Schek

23 Jun 1980

TL;DR: Observations indicate that, nowadays, there is a trend towards an integrated database management and information retrieval system (DBMIRS): interactive usage of a DBMS grows and query modification-often with a thesaurus-and feedback evaluation are classical fields in IR.

...read moreread less

Abstract: Research and development for textual data administration and for formatted database systems have traditionally been done separately. As a consequence, several features have been investigated and realised independently. For example, concurrent access to and update of information by many users is not provided in an information retrieval system (IRS). In fact, update of information during regular operating has long been considered unnecessary in an IRS. On the other hand, a database management system (DBMS) provides a 'multi-user environment', but in essence only for the administration of keyed or numerical (i.e. formatted) data and not for textual data. Textual data fields contain objects such as names, addresses, descriptive information, chemical formulas, etc. Such fields occur very often in database records, where they are administered in the same manner as numerical data. Common functions in an IRS such as retrieving a text via a number of character strings contained in it are supported neither by the available database query languages nor by the internal indexing techniques of a DBMS, and notions such as contents clustering or relevance, which are very important in an IRS, are disregarded in a DBMS. The following observations indicate that, nowadays, there is a trend towards an integrated database management and information retrieval system (DBMIRS): (1) Interactive usage of a DBMS grows. In addition to the regularly running transactions of well-defined application programs we have the spontaneously generated ad hoe transactions generated by interactive users in a problem solving model depending on the analysis of the results they have obtained so far (Blaser and Schauer, 1978; Erbe et al., 1980). This is exactly what we observe as common practice in IR. Query modification-often with a thesaurus-and feedback evaluation are classical fields in IR. In order to support interactive users of a DBMS, one tries to offer multi-attribute retrieval as well as partial range and nearest-neighbour searches for multi-dimensional numerical data. These functions, whose 218

...read moreread less

Proceedings Article•DOI•

Information technology and the science of information

[...]

B.C. Brookes

23 Jun 1980

TL;DR: This chapter questions the continued usefulness of what I call the 'Cranfield paradigm' and suggests that it has now served its purpose and that it is time to move on.

...read moreread less

Abstract: The importance of these meeting s of computer and information scientists stems from the fact that, over the years, our common topic oflR has attracted the most analytically gifted and theoretically inclined minds in the whole information business. It was from this particular meeting ground that new and fruitful theories of information seemed most likely to arise. Yet, speaking as an information scientist interested in the development of theory, and while fully acknowledging the advances made by members of this group to information technology, I have to say that I have been disappointed in the theoretical contributions to information science that have emerged from the study of IR. So, in this chapter, I question the continued usefulness of what I call the 'Cranfield paradigm'. I suggest that it has now served its purpose and that it is time we moved on. As briefly as I can, I will try to explain why I think so. But I also outline a possible alternative paradigm to take its place. I would like to think that, in discussions of IR, computer and information scientists meet on level terms. I would like to think that together we were building a bridge between two strong bastions of theory, using IR as a stepping stone, with equal contributions from each side. But all I can see is a one-sided building effort, a cantilever reaching out from CS towards IS but finding little response. That would not matter were it not for my doubts that the cantilever will ever reach firm ground. There is no doubt about the strength of the computer scientists' position. They stand firmly and confidently on a highly sophisticated technology which, in turn, is based on a powerful physical science, while both the technology and the science are closely allied with versatile and highly productive analytical techniques. This powerful combination of forces has already achieved great successes in the world. On the other hand, information science has not yet established for itself any theoretical coherence. It remains largely a com-monsense practical activity heavily dependent on the use of the computer. And many of those who profess information science have in fact migrated, in spirit if not in name, to where the action is-to computer science. So I am critical of both groups-of computer scientists for continuing too long in one direction and of information scientists for not helping them enough. I have to …

...read moreread less

Proceedings Article•DOI•

'Memex' as an image of potentiality in information retrieval research and development

[...]

Linda C. Smith

23 Jun 1980

TL;DR: The issue of non-use is taken up again as part of the discussion of personal information systems under category (5) because few bench researchers have bothered to investigate them.

...read moreread less

Abstract: s, papers, conference proceedings all the bibliographic apparatus that he knew in the 1940s. If he stumbles upon the Journal of the American Society for Information Science or other journals of the same genre, he will read that computers permit rapid identification of desired documents and that microforms pack entire libraries into briefcases. In conversation with his old colleagues, however, he will learn that these wonders have been announced for years and that few bench researchers (or anyone else for that matter) have bothered to investigate them. The issue of non-use is taken up again as part of the discussion of personal information systems under category (5).

...read moreread less

Proceedings Article•DOI•

Information retrieval theory and design based on a model of the user's concept relations

[...]

Matthew B. Koll

23 Jun 1980

TL;DR: This chapter reports the development and testing of a theory of IR based on this system-as· model (SAM) view, an expansion of the present model used in IR research.

...read moreread less

Abstract: Viewing information retrieval (IR) systems as models of human assessment of the similarity between requests and documents (SRD)t contributes to development of theory for IR and can aid in development of IR systems. This chapter reports the development and testing of a theory of IR based on this system-as· model (SAM) view;:. Implications for IR research and development are then considered. The SAM theory of IR is an expansion of the present model used in IR research. That model, and its expansion into a theory of IR, are described in the two sections below§.

...read moreread less

Proceedings Article•DOI•

A probabilistic algorithm for nearest neighbour searching

[...]

Stephen F. Weiss

23 Jun 1980

TL;DR: A deterministic nearest-neighbour search algorithm is presented that is faster than the O(N) search and achieves the same results as a full search, but is not crippled in high-dimensional spaces.

...read moreread less

Abstract: for nearest 2~1.~ 1[rntroducfio~ In this chapter we examine a probabilistic approach to information retrieval. We assume that documents and queries are represented as binary vectors. The value in a particular position of a vector indicates the presence (1) or absence (0) of the concept associated with that position. For a given query q, documents are ranked for retrieval on the basis of a similarity or distance measure calculated from the document and query vectors ~. This retrieval model is really just a special case of the nearest-neighbour problem. That is, given a set of N points in n space, and a distinguished point q, find the m points that lie nearest q according to some distance measure. In our retrieval model the documents are the points, the query is q, n is the total number of concepts in the system and 'nearest' is synonymous with 'greatest similarity'. In the discussion that follows we shall concentrate on the case where rn = 1. Cases where m > 1 are straightforward generalisations. The standard way of doing a nearest-neighbour search is to examine all the documents, calculate the similarity measure for each and then select the m best. This requires O(N) time for a collection of N documents, which may be prohibitively expensive and time-consuming for large N, especially when interactive response is required. The optimal nearest-neighbour algorithm (Friedman, Bentley and Finkel, 1977) requires only O(logN) time but is unusable if the dimensionality of the space is high. Specifically, the optimal algorithm has a multiplicative constant of approximately 1.6 n, where n is the dimension of the space. Information retrieval systems typically have hundreds or even thousands of concepts, and in such situations 1.6 n log N is much larger than N, even when N is very large. In this chapter we begin by presenting briefly a deterministic nearest-neighbour search algorithm that is faster than the O(N) search and achieves the same results as a full search, but is not crippled in high-dimensional spaces. We then present a modification to the basic algorithm that allows the user to specify a maximum tolerable level of error (which may be zero). This tolerance * We shall assume throughout this chapter that the similarity measure used has range [0,1], where 1 indicates maximal similarity and 0 indicates minimal similarity.

...read moreread less

Proceedings Article•DOI•

An associative file store using fragments for run-time indexing and compression

[...]

R. M. Lea, E. J. Schuegraf

23 Jun 1980

TL;DR: It is clear that information retrieval is a form of set processing in which a 'query' usually comprises a set of search terms, often called a 'profile', which is designed to retrieve relevant records from the data file.

...read moreread less

Abstract: and citation, whereas records describing office documents mainly comprise descriptors of their contents. A 'query' usually comprises a set of search terms, often called a 'profile', which is designed to retrieve relevant records from the data file. Search terms specify key values for comparison with the key words or descriptors of the stored records. Multi-key profiles allow queries to be expressed in combinational logic. Depending on the sophistication of the retrieval facility, search keys may comprise complete words, word-stems or word fragments (viz. leftand/or right-truncated words) or text fragments (viz. character substrings crossing word boundaries). Fragments offer maximum flexibility by supporting unrestricted search keys in 'free-text' retrieval systems. Further sophistications include the use of 'character masking' and allow 'weights' to be ascribed to search keys so that only those records scoring above a defined 'threshold' are retrieved. It is clear that information retrieval is a form of set processing in which a

...read moreread less

Proceedings Article•DOI•

A comparison of two weighting schemes for Boolean retrieval

[...]

Abraham Bookstein

23 Jun 1980

TL;DR: Boolean retrieval logic is the basis of most operating information retrieval systems (IRSs) and, given software that is widely available, such systems can be easily implemented to permit of an efficient search.

...read moreread less

Abstract: Boolean retrieval logic is the basis of most operating information retrieval systems (IRSs). There are many reasons why this type of system has been so attractive. For example, it allows users to issue requests in which the topics Of interest and the relations between them are clearly and precisely stated; the user has considerable flexibility in formulating his request; and the request can be reformulated, as convenient, to equivalent requests that will retrieve the identical set of documents (Bookstein and Cooper, 1976). Further, it is relatively easy to learn how to use such a system, and, given software that is widely available, such systems can be easily implemented to permit of an efficient search, even of rather large files. For these reasons, a number of intrinsic weaknesses inherent in these systems are often overlooked. A very serious constraint of Boolean systems is the necessity of associating a number of index terms with each document. The problem is that it is often unclear whether a given index term is appropriate for a document b o t h the decision to include the term and the decision to omit it might result in retrieval errors: false drops in the first case, lost relevant documents in the second. The issuer of a request is similarly constrained either to include a term in his request or to leave it out. It is not possible for a patron to include two terms, while indicating that one is more important than the other. The above weaknesses have encouraged the development of alternative approaches, such as the use of vector models (Salton, 1968), which permit the patron and the indexer to differentiate index terms by weight. A user of such a system, however, cannot indicate how the terms logically relate to one another. Others have created multi-stage systems, in which a standard Boolean retrieval process first retrieves a set of documents; these documents are then processed by an independent weighting mechanism that assigns to each retrieved document a value representing the importance of the terms by which the document is indexed (Noreault, Koll and McGill, 1977). Unfortunately, such hybrid methods are subject to inconsistencies, in that two logically equivalent requests can retrieve different sets of documents (Bookstein, 1978).

...read moreread less

Proceedings Article•DOI•

The fact database: a system based on inferential methods

[...]

D. R. McGregor, J. R. Malone

23 Jun 1980

TL;DR: The 'natural-language' systems present a better interface but so far have to be specifically tuned for each application area (Waltz, 1979) and it seems to us that a better 'standard interface' is needed.

...read moreread less

Abstract: Computer systems frequently display a lack of flexibility and common sense a deficiency which has been noted in the popular mythology of humorous 'computer stories'. Surely research should be directed towards overcoming this basic failing? A system should be usable by ordinary people, without their requiring a lengthy training or familiarisation period. Other pieces of everyday technology telephones, televisions, cameras, motor-cars have succeeded in this respect. The condition is a hard one for computer systems to meet without losing flexibility, as it implies an ability to converse in either a natural language or a powerful yet simple formal language. Present-day practical systems have tended to attempt the latter. Query by Example (Zloof, 1977) and Prestel (Viewdata) are two outstanding rival contenders. Query by Example has the greater power but requires greater expertise from the user, who must master the fundamentals of the relational calculus. In addition, he must somehow know the names and meanings of the files and of the fields in each file. In any practical system there may be a great number of these, and getting the system to present the right names (or getting the user to remember them) may be difficult. Viewdata is simpler, but simplicity is gained at the cost of flexibility. Information is presented in a tree structure; dialogue is computer-driven with the use of 'menus' which are presented to the user. Successful interaction depends totally on the user's understanding the restricted vocabulary presented in the menus and adapting his interactions to use it exclusively. Nonetheless, the system does appear to have broken the barrier to usage by the 'man in the street'. The 'natural-language' systems present a better interface but so far have to be specifically tuned for each application area (Waltz, 1979). It seems to us that a better 'standard interface' is needed. Databases must do some of the necessary adaptation if information systems are to be linked up on a world-wide basis, and many of the 'users' are to be other computers not human beings. Such systems ought to be multilingual, able to retrieve immediately in one language what has just been inserted in another. Semantic

...read moreread less

Proceedings Article•DOI•

Message extraction through estimation of relevance

[...]

Christopher Landauer, Clinton P. Mah

23 Jun 1980

TL;DR: Experiments with the METER system have shown that associative retrieval offers a unique capability for obtaining information in response to queries produced by the analyst, and this method offers the most assistance precisely in the cases that are difficult to handle in any other way.

...read moreread less

Abstract: In both the fast-access and the large-volume ends of the information processing spectrum the end user may be called an information analyst. His task as part of an information processing system is to provide the insights and to ask the right questiofis of the system. The computer should perform all of the routine analysis and comparison of documents. The interaction between the analyst and the computer must make it easy for the analyst to ask the necessary questions and to interpret the computer 's response as an answer. Experiments with the METER system have shown that associative retrieval offers a unique capability for obtaining information in response to queries produced by the analyst. In fact, associative methods offer the most assistance precisely in the cases that are difficult to handle in any other way namely, when there is a large amount of unformatted English text. Associative methods are als0 useful when the volume or volatility of the data precludes any detailed knowledge of its contents. In this case the analyst will have to rely on the responses to questions in order to gain any knowledge of specific events. Associative retrieval methods allow an analyst to obtain that kind of information even without knowledge of specifics. With any Boolean or keyword system, an analyst must have a more detailed knowledge of vocabulary in order to obtain a comparable response. The METER system was designed with several goals in mind. Specifically, the system was to exploit the methods of associative retrieval in an effective and inexpensive fashion, and to allow a naive user (that is, someone unfamiliar with the exact content of the database) to access useful information with minimal effort. Our success with these particular goals far exceeded our expectations in the light of the huge amount of research work already completed in the area. The particular implementation we built was expected to keep pace with a database of up to 20 000 messages that arrive continuously at a maximum rate of 4000-5000 per day. The system must have decent response times (one or two minutes) with five simultaneous users, and almost 24 hour access. The system was required to run on a DEC PDP11/45 or 11/70 without special hardware. As a tool for information analysis, the METER system was designed in a

...read moreread less

Proceedings Article•DOI•

A model of a document-clustering-based information retrieval system with a Boolean search request formulation

[...]

Tadeusz Radecki

23 Jun 1980

TL;DR: In the subsequent sections the clustering of document representations and search request formulations will be identified with the clustered of documents and queries, respectively.

...read moreread less

Abstract: One of the most essential parameters of any information retrieval system is the time taken to retrieve answers to particular queries submitted to it. This quantity is especially important for information systems with large-sized document collections and/or in cases when immediate response to the user's query is required (for example, in on-line information retrieval systems). As a result of investigations aimed at shortening the retrieval time, a number of information retrieval methods have been developed (see, for example, Salton, 1968, 1971, 1975; van Rijsbergen, 1979). Among them one can distinguish a class of numerous information retrieval methods based on the clustering of document representations. In these cases the mutual similarity between document representations, determined in a direct way, is used to cluster the document representations. An alternative competitive class utilises previously created clusters of search request formulations for clustering the document representations. For simplicity, in the subsequent sections the clustering of document representations and search request formulations will be identified with the clustering of documents and queries, respectively. Information retrieval methods of the latter type developed some years ago (Lesser, 1966; Salton, 1968, 1975; Worona, 1971; Yu, 1974) can be applied only to those information systems in which both search request formulations and document representations are sets of

...read moreread less

Proceedings Article•DOI•

Comparative analysis of hardware versus software text search

[...]

Peter Kracsony, Gerald Kowalski, Arnold Meltzer

23 Jun 1980

TL;DR: The ideal way to search the data is to find all documents containing userdefined character strings, so users need to increase the accuracy of their searches.

...read moreread less

Abstract: Current technology has made it economical to create large textual databases for a variety of specialised applications (for example, newspapers, legal databases, medical databases, etc.). These databases contain a wide variety of information, some of which gains importance long after the arrival of the data. The ideal way to search the data is to find all documents containing userdefined character strings. Specific capabilities users need to increase the accuracy of their searches are:

...read moreread less

Proceedings Article•DOI•

Establishing a basis for mapping natural-language statements onto a database query language

[...]

Lawrence J. Mazlack, Richard A. Feinauer

23 Jun 1980

TL;DR: This project investigated an aspect of natural-language processing which can be applied to the analysis of database queries by developing a basis for processing English-language requests for the retrieval of information from an existing DBMS package.

...read moreread less

Abstract: The rapid decrease in the cost of computer storage has made the cost of storing on-line information progressively cheaper. Database management systems (DBMS) have been developed to increase the ease and effectiveness of information storage and retrieval. These DBMS usually use semi-formal language statements to store and retrieve information. The untrained person cannot easily access machine-stored information, because access requires a specialised knowledge of the methodology and semi-formal language used to store the information. Both high-level programming languages and DBMS query languages lay claim to be similar to human-(that is, natural-) language statements and thus easy to use. However, the reality is that they are only usable by highly trained people. Thus, access to machine-stored information is limited by a communication barrier. A compounding difficulty is that the ultimate users of the computer-stored information may not reasonably be expected to become computer-language-proficient in addition to their other duties. What is needed is an interface between a person's normal written communication medium-a natural language-a n d the more formal language normally used when accessing information from a computer. This project investigated an aspect of natural-language processing which can be applied to the analysis of database queries. In particular, it focused on issues involved in informal/formal language mapping by developing a basis for processing English-language requests for the retrieval of information from an existing DBMS package. The investigation concentrated on forming a mechanistic basis to be used to locate semantic constituents of a request which would ultimately support the eventual matching of these constituents against templates and larger templates called scripts or concept case frames. The concept case frames are to be filled in with information from the user's request. Eventually, a data query language request is to be generated from the collected information. The theoretical issues focus on the analysis of query content in a topic-constrained environment. The principal question is the role of non-parsing techniques in the recognition of semantic components of the query. It is

...read moreread less

Journal Article•DOI•

Abstracts: Chosen by G. Salton from current issues of journals in the retrieval area

[...]

G. Salton

01 Aug 1980

TL;DR: The invention provides a method and a device for mass-producing a magnetic recording medium which has excellent wear and corrosion resistance and uniform quality along the longitudinal and transverse directions of an elongated base body.

...read moreread less

Abstract: The invention provides a method and a device for mass-producing a magnetic recording medium which has excellent wear and corrosion resistance and uniform quality along the longitudinal and transverse directions of an elongated base body, wherein the elongated base body in a vacuum chamber is fed to continuously deposit a ferromagnetic material thereon, while a gas is sprayed in the vicinity of an incident angle control section of a mask for controlling an angle of incidence of a vapor flow from a vapor source to the elongated base body. The angle of incidence is kept constant to prevent deposition of the vapor material on the incident angle control section, and simultaneously, a uniform oxide film is formed on the ferromagnetic layer using a gas containing oxygen.

...read moreread less

Journal Article•DOI•

Information retrieval activities at the School of Information Studies Syracuse University

[...]

Michael J. McGill

01 Apr 1980

TL;DR: The School of Information Studies of Syracuse University was known as the School of Library Science until 1974, when its name change signalled a shift in emphasis from the traditional study of the library and library activities to a concern with the needs, uses, acquisition, organization, storage and retrieval of information.

...read moreread less

Abstract: The School of Information Studies of Syracuse University was known as the School of Library Science until 1974. This name change signalled a shift in emphasis from the traditional study of the library and library activities to a concern with the needs, uses, acquisition, organization, storage and retrieval of information. It also indicated a shift toward a research environment. Since 1974 the School has moved from essentially no outside research income to a fairly stable base of approximately $500, 000 per year. Research activities now include topics such as the study of ranking algorithms in information retrieval systems; the development and study of a health information sharing project; a study of the user-search intermediary question negotiation process; and a study of the impact of the representation of documents in an information retrieval system.

...read moreread less

Journal Article•DOI•

MEDUS/A: a user-oriented database management system for medical research

[...]

Chester King¹, Larry B. Goldstein¹, Gail Stocker¹•Institutions (1)

Harvard University¹

01 Dec 1980

TL;DR: TheMEDUS/A, a general-purpose DBMS, enables clinical and public health researchers and their staff to define a database and enter, query, and retrieve their data for analysis without a programmer.

...read moreread less

Abstract: MEDUS/A was designed by members of the Health Systems Project at the Harvard School of Public Health. As a general-purpose DBMS, it enables clinical and public health researchers and their staff to define a database and enter, query, and retrieve their data for analysis without a programmer. The system has simultaneously supported a variety of projects in the Harvard Medical Area since 1976 and is ready for release outside of Harvard in a new version written in Standard MUMPS. Though designed for medicine, the system is sufficiently general for other fields.

...read moreread less

Proceedings Article•DOI•

A backend machine architecture for information retrieval

[...]

Amar Mukhopadhyay

23 Jun 1980

TL;DR: This chapter is a preliminary study on an ongoing research effort devoted to the development of backend machine architecture for large textual databases or information retrieval systems.

...read moreread less

Abstract: A textual database can be loosely defined to be a group of related documents, each containing an essentially unstructured string of characters and symbols, which describe some information in English or any other high-level natural language on a specific subject matter by use of a set of words, phrases and sentences that depend very much on the subject matter and the intended use of the document. Such databases cover a wide range of applications, viz. libraries, newspapers, medical diagnostics, abstracts of papers and dissertations, legal case reports, military and intelligence reports, etc. With the advent of highdensity memory technology and the availability of computerised typesetting and machine reading technology, an explosion in the growth of such databases for a variety of applications may be anticipated. This chapter is a preliminary study on an ongoing research effort devoted to the development of backend machine architecture for large textual databases or information retrieval systems. Search and retrieval operations on textual databases are complicated by the very nature of information that they contain. The text databases show a wide variation in formatting, and have a large number of oddities, redundancies, non-informational words and context-dependent as well as spelling ambiguities; the range of potential query is unrestricted. There is no data model that is applicable to develop a structured approach to search and retrieval, as is the case for conventional databases (viz. relational or hierarchical). Conventional machine architectures and software systems perform search and retrieval operations on such databases using a combination of inversion of text to produce an inverted list and sequential scan on the secondary storage media. These systems are inherently slow, because the machines do not have build-in hardware to do high-speed pattern matching, searching, sorting or retrieval operations. Furthermore, the phenomenon of a 'yon Neuman bottleneck' between the CPU and the main memory and the data transportation problem over a bandwidth-limited channel which uses complicated navigational procedures to locate data on serial-access bulk storage add to this slow performance and inefficiency. Furthermore, the inverted file system may add as much as 300 per cent storage overhead and needs rather time-

...read moreread less

Journal Article•DOI•

Some ideas for estimating the number of relevant documents

[...]

Robert T. Dattola

01 Aug 1980

TL;DR: A model is proposed for estimating the total number of relevant documents in a collection for a given query, where x represents document rank and y represents precision, and the equation with the best fit satisfying certain constraints is used.

...read moreread less

Abstract: A model is proposed for estimating the total number of relevant documents in a collection for a given query The total number of relevant documents is needed in order to compute recall values for use in evaluating document retrieval systems If x represents document rank and y represents precision, then one of the following functions is fit to the points obtained by plotting precision vs document rank after each retrieved document:1 y = AeBx exponential2 y = AxB power3 y = A - B/x hyperbolic4 y = 1/(A + Bx) hyperbolic5 y = x/(A + Bx) hyperbolicThat equation with the best fit satisfying certain constraints is used to estimate the total number of relevant documents for any given query Experimental comparisons of this best fit are made with random sampling methods

...read moreread less