scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1991"


Journal ArticleDOI
Gerard Salton1
30 Aug 1991-Science
TL;DR: The text analysis problem is examined, and modern approaches leading to the identification and retrieval of selected text items in response to search requests are discussed.
Abstract: Recent developments in the storage, retrieval, and manipulation of large text files are described. The text analysis problem is examined, and modern approaches leading to the identification and retrieval of selected text items in response to search requests are discussed.

661 citations


Journal ArticleDOI
01 Jul 1991
TL;DR: Network representations show promise as mechanisms for inferring probable relationships between documents and queries and have been used in information retrieval since at least the early 1960s.
Abstract: Network representations have been used in information retrieval since at least the early 1960’s. Networks have been used to support diverse retrieval functions, including browsing [38], document clustering [7], spreading activation search [4], support for multiple search strategies [11], and representation of user knowledge [27] or document content [40]. Recent work suggests that significant improvements in retrieval performance will require techniques that, in some sense “understand” the content of documents and queries [9, 43] and can be used to infer probable relationships between documents and queries. In this view, information retrieval is an inference or evidential reasoning process in which we estimate the probability that a user’s information need, expressed as one or more queries, is met given a document as “evidence.” Network representations show promise as mechanisms for inferring these kinds of relationships [4, 12].

653 citations


01 Feb 1991
TL;DR: A new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks, is introduced, suggesting that the poor statistical characteristics of a syntactic indexing phrase representation negate its desirable semantic characteristics.
Abstract: This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role of manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improvements in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its desirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurrence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probabilistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, including dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church's stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.

428 citations


Patent
17 Jul 1991
TL;DR: In this article, a methodology for retrieving textual data objects in a multiplicity of languages is disclosed, where data objects are treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in each language under consideration.
Abstract: A methodology for retrieving textual data objects in a multiplicity of languages is disclosed. The data objects are treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in each language under consideration. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.

393 citations


Proceedings ArticleDOI
01 Sep 1991
TL;DR: The results show that using phrases in this way can improve performance, and that phrases that are automatically extracted from a natural language query perform nearly as well as manually selected phrases.
Abstract: Both phrases and Boolean queries have a long history in information retrieval, particularly in commercial systems. In previous work, Boolean queries have been used as a source of phrases for a statistical retrieval model, This work, like the majority of research on phrases, resulted in little improvement in retrieval effectiveness, In this paper, we describe an approach where phrases identified in natural language queries are used to build structured queries for a probabilistic retrieval model. Our results show that using phrases in this way can improve performance, and that phrases that are automatically extracted from a natural language query perform nearly as well as manually selected phrases.

306 citations


Journal ArticleDOI
TL;DR: This article demonstrates that the similar terms identified by cooccurrence data in a query expansion system tend to occur very frequently in the database that is being searched.
Abstract: Term cooccurrence data has been extensively used in document retrieval systems for the identification of indexing terms that are similar to those that have been specified in a user query: these similar terms can then be used to augment the original query statement. Despite the plausibility of this approach to query expansion, the retrieval effectiveness of the expanded queries is often no greater than, or even less than, the effectiveness of the unexpanded queries. This article demonstrates that the similar terms identified by cooccurrence data in a query expansion system tend to occur very frequently in the database that is being searched. Unfortunately, frequent terms tend to discriminate poorly between relevant and nonrelevant documents, and the general effect of query expansion is thus to add terms that do little or nothing to improve the discriminatory power of the original query.

261 citations


Journal ArticleDOI
TL;DR: In the new method, compound queries composed of keywords with and, or and/or not are processed, and the learning method has been modified to allow fuzzy judgements as well as compound queries.

168 citations


Proceedings ArticleDOI
01 Sep 1991
TL;DR: An air separation process using pressure swing adsorption techniques, for providing high purity oxygen using beds of molecular sieve carbon and zeolite molecular sieves, is employed.

118 citations


Journal ArticleDOI
TL;DR: The development of five computational models of online document retrieval led to the design of an “intelligent” document-based retrieval system, and the broader implications of this system are discussed.
Abstract: Two studies were conducted to investigate the cognitive processes involved in online document-based information retrieval. These studies led to the development of five computational models of online document retrieval. These models were then incorporated into the design of an “intelligent” document-based retrieval system. Following a discussion of this system, we discuss the broader implications of our research for the design of information retrieval systems.

111 citations


01 Jan 1991
TL;DR: An overview of a retrieval model that is based on probabilistic inference networks is given and simplifications that allow to buid and evaluate networks efficiently, even with very large collections are described.
Abstract: Probabilistic inference techniques have been shown to significantly improve retrieval performance when compare to conventional retrieval models, but their use can be prohibitely expensive for large collections. We give an overview of a retrieval model that is based on probabilistic inference networks and describe simplifications that allow to buid and evaluate networks efficiently, even with very large collections

97 citations


Journal ArticleDOI
TL;DR: To improve retrieval effectiveness it is suggested that online catalogues should cater for both matching and contextual approaches to searching, and a holistic approach is adopted encompassing both catalogue use and browsing at the shelves for catalogue users and non‐users.
Abstract: The second half of a ‘before and after’ study to evaluate the impact of an online catalogue on subject searching behaviour is reported. A holistic approach is adopted encompassing both catalogue use and browsing at the shelves for catalogue users and non‐users. Verbal and non‐verbal data were elicited from searchers using a combined methodology including talk‐aloud technique, observation and a screen logging facility. An extensive qualitative analysis was carried out correlating expressed topics, search formulation strategies and documents retrieved at the shelves. The online catalogue environment does not appear to have increased the extent of subject searching nor the use of the bibliographic tool. The manual precis index supported a contextual approach for broad and more interactive search formulations whereas the opac encouraged a matching approach and narrow formulations with fewer but user generated formulations. The success rate of the online catalogue was slightly better than that of the manual tools but fewer items were retrieved at the shelves. Non‐users of the bibliographic tools seemed to be just as successful. To improve retrieval effectiveness it is suggested that online catalogues should cater for both matching and contextual approaches to searching. Recent research indicates that a more interactive process could be promoted by providing query expansion through a combination of searching aids for matching, for search formulation assistance and for structured contextual retrieval.

Journal ArticleDOI
TL;DR: It is reported that clusters of co-relevant documents obtain increasingly similar descriptions when a genetic algorithm is used to adapt subject descriptions so that documents become more effective in matching relevant queries and failing to match nonrelevant queries.
Abstract: Information retrieval systems have used clustering of documents and queries to improve both retrieval efficiency and retrieval effectiveness. Normally, clustering involves grouping together static descriptions of documents by their similarity to each other, though user-based clustering suggests that usage patterns concerning co-relevance can form a basis for clustering. This article reports that clusters of co-relevant documents obtain increasingly similar descriptions when a genetic algorithm is used to adapt subject descriptions so that documents become more effective in matching relevant queries and failing to match nonrelevant queries. As a result of the increased similarity, clustering algorithms can more accurately group documents into useful clusters. The findings of this work were reached through simulation experiments. © 1991 John Wiley & Sons, Inc.

Proceedings ArticleDOI
01 Sep 1991
TL;DR: Comparison of FERRET’s retrieval performance on a collection of 1065 astronomy texts using 22 sample user queries with a standard boolean keyword query system showed that precision increased from 35 to 48 percent, and recall more than doubled, from 19.4 to 52.4 percent.
Abstract: FERRET is a full text, conceptual information retrieval system that uses a partial understanding of its texts to provide greater precision and recall performance than keyword ,search techniques, It uses a machine-readable dictionary to augment its lexical knowledge and a variant of genetic learning to extend its script database. Comparison of FERRET’s retrieval performance on a collection of 1065 astronomy texts using 22 sample user queries with a standard boolean keyword query system showed that precision increased from 35 to 48 percent, and recall more than doubled, from 19.4 to 52.4 percent. l%is paper describes the FERRET system’s architecture, parsing and matching abilities, and focuses on the use of the the Webster’s Seventh dictionary to increase the system’s lexical coverage.

Journal ArticleDOI
30 Aug 1991-Science
TL;DR: An approach is outlined for the retrieval of natural language texts in response to available search requests and for the recognition of content similarities between text excerpts that appears to outperform other currently available methods.
Abstract: An approach is outlined for the retrieval of natural language texts in response to available search requests and for the recognition of content similarities between text excerpts. The proposed retrieval process is based on flexible text matching procedures carried out in a number of different text environments and is applicable to large text collections covering unrestricted subject matter. For unrestricted text environments this system appears to outperform other currently available methods.

Journal ArticleDOI
TL;DR: An approach to the document-retrieval problem that aims to increase the efficiency and effectiveness of document- retrieval systems by exploiting the semantic contents of the documents is presented.
Abstract: An approach to the document-retrieval problem that aims to increase the efficiency and effectiveness of document-retrieval systems by exploiting the semantic contents of the documents is presented. The document retrieval problem is delineated, and conceptual document modeling basics and requirements are discussed. An experimental system, the Multimedia Office Server (Multos), which implements some of the document-model concepts described, is presented. >


Proceedings ArticleDOI
Gerard Salton1
01 Sep 1991
TL;DR: The discussion covers aspects of the Smart system design and examines the past and future significance of some of the research conducted in the Smart environment.
Abstract: The Smart project in automatic text retrieval was started in 1961. It is the oldest, continuously running research project in information retrieval. The panel members are all major contributors to the Smart system work. The discussion covers aspects of the Smart system design and examines the past and future significance of some of the research conducted in the Smart environment.

Proceedings ArticleDOI
21 May 1991
TL;DR: The Naval Ocean Systems Center (NOSC) has conducted the third in a series of evaluations of English text analysis systems, intended to advance understanding of the merits of current text analysis techniques, as applied to the performance of a realistic information extraction task.
Abstract: The Naval Ocean Systems Center (NOSC) has conducted the third in a series of evaluations of English text analysis systems. These evaluations are intended to advance our understanding of the merits of current text analysis techniques, as applied to the performance of a realistic information extraction task. The latest one is also intended to provide insight into information retrieval technology (document retrieval and categorization) used instead of or in concert with language understanding technology. The inputs to the analysis/extraction process consist of naturally-occurring texts that were obtained in the form of electronic messages. The outputs of the process are a set of templates or semantic frames resembling the contents of a partially formatted database.

Proceedings ArticleDOI
01 May 1991
TL;DR: There is a story of a Vermont justice of the peace before whom a suit was brought by one farmer against another for breaking a chum, and the justice took time to consider, and then said that he had looked through the statutes and could find nothing about chums, and gave judgment for the defendant.
Abstract: There is a story of a Vermont justice of the peace before whom a suit was brought by one farmer against another for breaking a chum. The justice took time to consider, and then said that he had looked through the statutes and could find nothing about chums, and gave judgment for the defendant. The same state of mind is shown in all our common digests and textbooks. Applications of rudimentary rules of contract or tort are tucked away under the head of Railroads or Telegraphs or go to swell treatises on historical subdivisions, such as Shipping or Equity, or are gathered under the arbitrary title which is thought likely to appeal to the practical mind, such as Mercantile law. (Holmes 1897, p. 59.)

Patent
08 Nov 1991
TL;DR: In this article, a classification tree data storing means in which the data of plural classification trees 12 and 13 in which respective nodes 11 linked to corresponding documents are connected with each other in tree-like manner for document classification and management are stored.
Abstract: PURPOSE:To correct lack of collective retrieval function of the data that satisfies a certain condition, which is a defect of a hyper text in which plural document data are stored and, by specifying a key word in a displayed document, related documents are retrieved and then retrieving result is displayed. CONSTITUTION:A classification tree data storing means in which the data of plural classification trees 12 and 13 in which respective nodes 11 linked to corresponding documents are connected with each other in tree-like manner for document classification and management are stored is provided. in addition, a retrieval processing means which, with a specified arbitrary node 11 as a starting point for retrieval, retrieves the document belonging to the three below that node for the specified classification trees 12 or 13 is provided. So, at retrieval using a hyper text function, collective retrieval, within proper range, can be made with ease with the data satisfying certain conditions.

Patent
Akio Okazaki1
19 Feb 1991
TL;DR: In this paper, an interface adapted for receiving information composed of symbols, characters, and diagrams representing a spatial positional relationship and used as a retrieval request is presented, which comprises an entry device for entering a two-dimensional image, and a retrieval requests interpreting device for recognizing elements forming the entered image, such as symbols and characters.
Abstract: An interface, adapted for receiving information composed of symbols, characters, and diagrams representing a spatial positional relationship and used as a retrieval request, comprises an entry device for entering a two-dimensional image, and a retrieval request interpreting device for recognizing elements forming the entered image, such as symbols, characters, and diagrams, and determining a two-dimensional spatial relationship between elements so as to determine retrieval conditions. Information retrieval is executed on the basis of the retrieval conditions determined by the retrieval request interpreting device. Thus, a user is allowed to enter a retrieval request in two-dimensional image form, namely, in an easily intelligible form and in a timesaving manner. This enables makes it possible to retrieve easily sophisticated information as well.

01 Jan 1991
TL;DR: A software implementation architecture for text retrieval systems that facilitates functional modularization, a mix-and-match combination of module implementations and a deenition of inter-module protocols is presented.
Abstract: For almost all aspects of information access systems it is still the case that their optimal composition and functionality is hotly debated. Moreover, diierent application scenarios put diierent demands on individual components. It is therefore of the essence to be able to quickly build systems that permit exploration of diierent designs and implementation strategies. This paper presents a software implementation architecture for text retrieval systems that facilitates (a) functional modularization (b) mix-and-match combination of module implementations and (c) deenition of inter-module protocols. We show how an object-oriented approach easily accommodates this type of architecture. The design principles are exempliied by code examples in Common Lisp. Taken together these code examples constitute an operational retrieval system. The design principles and protocols implemented have also been instantiated in a large scale retrieval prototype in our research laboratory.

Journal ArticleDOI
Joost Kircz1
TL;DR: The extent to which modern indexing and information retrieval research meets the needs and requirements of different types of readers is criticised and a plea is made to start an argumentational analysis of the corpus of scientific articles.
Abstract: In this paper the extent to which modern indexing and information retrieval research meets the needs and requirements of different types of readers is criticised. A review of the stagnation in this field gives evidence for the need for a radically different approach. The main problem is identified as the assumption that knowledge contained in a scientific article can be represented by a semantic network only, and therefore can be manipulated by formal logic approaches. Complementary to this, a plea is made to start an argumentational analysis of the — highly structured — corpus of scientific articles (mainly in physics). Such an analysis might lead to an argumentational syntax which will also enable the non‐expert to browse through large quantities of electronically stored articles. A first attempt at such an approach is given. Furthermore the possible use of the Standard General Markup Language (sgml) approach in relation to a hypertext environment for a possible application is discussed.

Proceedings ArticleDOI
01 Sep 1991
TL;DR: This work extends the concept of query-document similarity by recognizing basic entity properties (attributes) which appear in text and extends query- document similarity using the linguistic concept of thematic roles, based on the database concept of semantic modeling.
Abstract: Current information retrieval systems focus on the use of keywords to respond to user queries. We propose the additional use of surface level knowledge in order to improve the accuracy of information retrieval. Our approach is based on the database concept of semantic modeling (particularly entities and relationships among entities). We extend the concept of query-document similarity by recognizing basic entity properties (attributes) which appear in text. We also extend query-document similarity using the linguistic concept of thematic roles. Thematic roles allow us to recognize relationship properties which appear in text. We include several examples to illustrate our approach. Test results which support our approach are reported. The test results concern searching documents and using their contents to perform the intelligent task of answering a question.

Book ChapterDOI
01 Jun 1991
TL;DR: Previous research on machine learning in IR systems is surveyed and promising areas for future research at the intersection of these two fields are discussed.
Abstract: Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of documents and user queries has made feedback and adaptation important characteristics of IR systems. In this paper we briefly survey previous research on machine learning in IR systems and discuss promising areas for future research at the intersection of these two fields.

Proceedings ArticleDOI
01 Sep 1991
TL;DR: A genetic algorithm for MDAP is developed and the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters are studied.
Abstract: Information retrieval is the selection of documents that are potentially relevant to a user’s information need. Given the vast volume of data stored in modern information retrieval systems, searching the document database requires vast computational resources. To meet these computational demands, various researchers have developed parallel information retrieval systems. As efficient exploitation of parallelism demands fast access to the documents, data organization and placement significantly affect the total processing time. We describe and evaluate a data placement strategy for distributed memory, distributed 1/0 multicomputers. Initially, a formal description of the Multiprocessor Document Allocation Problem (MDAP) and a proof that MDAP is NP Complete are presented. A document allocation algorithm for MDAP based on Genetic Algorithms is developed. This algorithm assumes that the documents are clustered using any one of the many clustering algorithms. We define a cost function for the derived allocation and evaluate the performance of our algorithm using this function. As part of the experimental analysis, the effects of varying the number of documents and their distribution across the clusters as well the exploitation of various differing architectural interconnection topologies are studied. We also experiment with the several parameters common to Genetic Algorithms, e.g., the probability of mutation and the population size. 1.0 Introduction An efficient multiprocessor information retrieval system must maintain a low system response time and require relatively little storage overhead. As the volume of stored data continues to increase daily, the multiprocessor engines must likewise scale to a large number of processors. This demand for system scalability necessitates a distributed memory architecture as a large number of processors is not currently possible in a sharedmemory configuration. A distributed memory system, however, introduces the problem Perrrkion to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appaar, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0-89791 -448 -1/91 /0009 /0230 . ..$1 .50 Hava Tova Siegelmann Dept. of Computer Science Rutgers University New Brunswick, NJ 08903 associated with the proper placement of data onto the given architecture. We refer to this problem as the Multiprocessor Document Allocation Problem (MDAP), a derivative of the Mapping Problem originally described by Bokhari [Bok8 1]. We assume a clustered document database. A clustered approach is taken since an index file organization can introduce vast storage overhead (up to roughly 300% according to Haskin [Has8 1]) and a full-text or signature analysis technique results in lengthy search times. In this context, a proper solution to MDAP is any mapping of the documents onto the processors such that the average cluster diameter is kept to a minimum while still providing for an even document distribution across the nodes. To achieve a significant reduction in the total query processing time using parallelism, the allocation of data among the processors should be distributed as evenly as possible and the interprocessor communication among the nodes should be minimized. Achieving such an allocation is NP Complete. Thus, it is necessary to use heuristics to obtain satisfactory mappings, which may indeed be suboptimal. Genetic Algorithms [DeJ89, G0189, Gre85, Gre87, H0187, Rag87] approximate optimal solutions to computationally intractable problems. We develop a genetic algorithm for MDAP and examine the effects of varying the communication cost matrix representing the interprocessor communication topology and the uniformity of the distribution of documents to the clusters. 1.1 Mapping Problem Approximations As the Mapping Problem and some of its derivatives are NP complete, heuristic algorithms are commonly employed to approximate the optimal solutions. Some of these approaches [Bok81, B0188, Lee87] deal, in some manner, This work was partially supported by grants from DCS, Inc. under contract number 5-35071 and the Center for Innovative Technology under contract number 5-34042.

Patent
Toshiaki Mori1, 俊明 森
27 Sep 1991
TL;DR: In this paper, the authors proposed a method to easily classify a structured document in accordance with the purpose of a user by using structure element information as information on the classification of the document.
Abstract: PURPOSE:To easily classify a structured document in accordance with the purpose of a user. CONSTITUTION:A retrieval condition designation means 120 designates structure element information as information on the retrieval of a document, and a classified attribute designation means 140 designates structure element information as information on the classification of the document. A structured document retrieval means 130 retrieves a structured document group based on structure element information as information on the retrieval of the document from plural structured documents stored in a structured document storage means 110 and transfers it to a structured document classification means 150. It classifies the transferred structured document group based on structure element information as information on the classification of the document. The classified result is displayed on a document display means 160.

Journal ArticleDOI
TL;DR: A method for estimating precision or retrieval quality without examining individual database documents is provided, and the conditions are examined under which the user should stop searching when either using or not using relevance feedback.
Abstract: The performance of information retrieval, hypertext linkage, and text filtering systems may be measured by using historical data or by estimating performance using Bayesian probabilistic or artificial intelligence methods. The measurement of performance is necessary to evaluate document retrieval systems, electronic mail filters, office information systems, and, in general, retrieval from databases when the searcher has incomplete information about the characteristics of the records to be retrieved. We provide a method for estimating precision or retrieval quality without examining individual database documents. This method requires knowledge of only the query or expressed information need and a set of database parameters constant for all queries. The concepts of historic and expected precision are examined. The analytic expected precision measure is used to examine the performance of a system using relevance feedback to increase the accuracy of parameter estimates. Use of a precision-document graph instead of the commonly used precision-recall graph is examined, and several uses of the precision-document graph for a computer-human interface are suggested, including its use as a graphic aid assisting users in deciding when to stop searching. An economic model of information retrieval and the stopping problem is provided, and the conditions are examined under which the user should stop searching when either using or not using relevance feedback.

Journal ArticleDOI
TL;DR: A prototype distributed information retrieval system was designed and built using a distributed architecture and using statistical ranking techniques to help provide better service for the end user.
Abstract: Centralized systems continue to dominate the information retrieval market, with increased competition from CD-ROM based systems. As more large organizations begin to implement office automation systems, however, many will find that neither of these types of retrieval systems will satisfy their requirements, especially those requirements involving easy integration into other systems and heavy usage by casual end users. A prototype distributed information retrieval system was designed and built using a distributed architecture and using statistical ranking techniques to help provide better service for the end user. The distributed architecture was shown to be a feasible alternative to centralized or CD-ROM information retrieval, and user testing of the ranking methodology showed both widespread user enthusiasm for this retrieval technique and very fast response times (on the order of one second for 300 megabytes of data).

Proceedings ArticleDOI
01 Sep 1991
TL;DR: In the early days of computer-based retrieval, information retrieval systems worked with document representations consisting of simple lists of keywords, but today the information retrieved may be from full-text or hypertext or even hypermedia systems.
Abstract: What should the components of a model of an information retrieval system be? The answer to this question is not easy, because the term ‘information retrieval system’ has had a changing referent. In the early days of computer-based retrieval, information retrieval systems worked with document representations consisting of simple lists of keywords. Today, the information retrieved may be from full-text or hypertext or even hypermedia systems.