scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1990"


Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


Journal ArticleDOI
TL;DR: It is argued here that the advanced information retrieval research community is missing an opportunity to design systems that are in better harmony with the actual preferences of many users—sophisticated systems that provide an optimal combination of searcher control and system retrieval power.
Abstract: Many users of online and other automated information systems want to take advantage of the speed and power of automated retrieval, while still controlling and directing the steps of the search themselves. They do not want the system to take over and carry out the search entirely for them. Yet the objective of much of current theory and experimentation in information retrieval systems and interfaces is to design systems in which the user has either no or only reactive involvement with the search process. It is argued here that the advanced information retrieval research community is missing an opportunity to design systems that are in better harmony with the actual preferences of many users—sophisticated systems that provide an optimal combination of searcher control and system retrieval power. The user may be provided effective means of directing the search if capabilities specific to the information retrieval process, that is, strategic behaviors normally associated with information searching, are incorporated into the interface. There are many questions concerning (1) the degree of user vs. system involvement in the search, and (2) the size, or chunking, of activities; that is, how much and what type of activity the user should be able to direct the system to do at once. These two dimensions are analyzed and a number of configurations of system capability that combine user and system control are presented and discussed. In the process, the concept of the information search stratagem is introduced, and particular attention is paid to the provision of strategic, as opposed to purely procedural capabilities for the searcher. Finally, certain types of user-system relationship are selected as deserving particular attention in future information retrieval system design, and arguments are made to support the recommendations.

383 citations


Patent
07 Nov 1990
TL;DR: In this article, a dictionary of context vectors provides a context vector for each word stem in the dictionary, and a normalized summary vector is stored for each document by combining the context vectors of the words remaining in the document after uninteresting words are removed.
Abstract: A method for storing and searching documents also useful in disambiguating word senses and a method for generating a dictionary of context vectors. The dictionary of context vectors provides a context vector for each word stem in the dictionary. A context vector is a fixed length list of component values corresponding to a list of word-based features, the component values being an approximate measure of the conceptual relationship between the word stem and the word-based feature. Documents are stored by combining the context vectors of the words remaining in the document after uninteresting words are removed. The summary vector obtained by adding all of the context vectors of the remaining words is normalized. The normalized summary vector is stored for each document. The data base of normalized summary vectors is searched using a query vector and identifying the document whose vector is closest to that query vector. The normalized summary vectors of each document can be stored using cluster trees according to a centroid consistent algorithm to accelerate the searching process. Said searching process also gives an efficient way of finding nearest neighbor vectors in high-dimensional spaces.

361 citations


Book
01 Apr 1990
TL;DR: This work has shown that language and Representation are the central problem in Information Retrieval and the nature of scientific theory, and the principal formal models used in information retrieval are language and representation.
Abstract: The Nature of Information Retrieval. The Principal Formal Models of Information Retrieval. The Evaluation of Information Retrieval Systems. Language and Representation: The Central Problem in Information Retrieval. Communication and Adaptation. Information Retrieval and the Nature of Scientific Theory. Conclusion. Index.

277 citations


Patent
30 Jul 1990
TL;DR: A document storage and retrieval system stores a document body in the form of an image, storing text information in a form of a character code string for retrieval, and executing a retrieval with reference to the text information, followed by displaying a document image relating thereto on a retrieval terminal according to the retrieval result as mentioned in this paper.
Abstract: A document storage and retrieval system stores a document body in the form of an image, storing text information in the form of a character code string for retrieval, and executing a retrieval with reference to the text information, followed by displaying a document image relating thereto on a retrieval terminal according to the retrieval result. Such a form of the system is available for retrieving the full contents of a document and also for displaying the document body printed in a format easy to read straight in the form of an image.

160 citations


Journal ArticleDOI
TL;DR: To show the feasibility ofStatistically based ranked retrieval of records using keywords, research was done to produce very fast search techniques using these ranking algorithms, and to test the results against large databases with many end users.
Abstract: Statistically based ranked retrieval of records using keywords provides many advantages over traditional Boolean retrieval methods, especially for end users. This approach to retrieval, however, has not seen widespread use in large operational retrieval systems. To show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. The results show not only response times on the order of 1 and 1/2 seconds for 806 megabytes of text, but also very favorable user reaction. Novice users were able to consistently obtain good search results after 5 minutes of training. Additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. These techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. © 1990 John Wiley & Sons, Inc.

142 citations


Patent
Yasushi Ogawa1
25 May 1990
TL;DR: A document retrieval system which includes a keyword connection table making section, document accuracy calculating section, a document sorting section, and a learning control section is described in this paper. But, it is limited to the use of keyword connections.
Abstract: A document retrieval system which includes a keyword connection table making section, a document accuracy calculating section, a document sorting section and a learning control section. The document accuracy calculating section calculates a document accuracy for each of the output documents in a prescribed manner by reference to a keyword connection table file. The document sorting section sorts the output documents in downward sequential order of the document accuracy. The learning control section serves to modify the weight of each keyword connection in a prescribed manner after the sorted output documents are given responsive to a query by a user, allowing the user make an evaluation on whether each document accuracy of the output documents is in conformity with the query. The document retrieval system is capable of providing the user with multiple choices from a numerical value between 0 and 1 in terms of a real number in making an evaluation on whether each document accuracy of the output documents is actually in conformity with the query.

114 citations


Journal ArticleDOI
TL;DR: A rule‐governed derivation of an indexing phrase from the text of a document is, in Wittgenstein's sense, a practice, rather than a mental operation explained by reference to internally represented and tacitly known rules.
Abstract: A rule‐governed derivation of an indexing phrase from the text of a document is, in Wittgenstein's sense, a practice, rather than a mental operation explained by reference to internally represented and tacitly known rules. Some mentalistic proposals for theory in information retrieval are criticised in light of Wittgenstein's remarks on following a rule. The conception of rules as practices shifts the theoretical significance of the social role of retrieval practices from the margins to the centre of enquiry into foundations of information retrieval. The abstracted notion of a cognitive act of ‘information processing’ deflects attention from fruitful directions of research.

106 citations


Patent
Morita Tetsuya1
05 Oct 1990
TL;DR: In this paper, a document retrieval system includes an inputting unit for inputting a retrieval condition including one or a plurality of keywords and a weight value for each keyword, an operating unit having first factors corresponding to relationship values, each relationship value being defined as a degree of the relationship between two keywords out of keywords which are predetermined in the document retrieval systems, and second factors correspond to importance values.
Abstract: A document retrieval system includes an inputting unit for inputting a retrieval condition including one or a plurality of keywords and a weight value for each keyword, an operating unit having first factors corresponding to relationship values, each relationship value being defined as a degree of the relationship between two keywords out of keywords which are predetermined in the document retrieval system and second factors corresponding to importance values, each importance value being defined as a degree of importance of a keyword in each one of a plurality of documents which are predetermined in the document retrieval system, the operation unit generating a relevance value, which represents a degree of relevance in satisfying a user's requirement, for each of the documents on the basis of the retrieval condition input from the inputting unit, the first factors and the second factors, and an outputting unit for outputting the relevance value for each of the documents as a retrieval result.

102 citations


Journal ArticleDOI
Young Whan Kim1, Jin H. Kim
TL;DR: The proposed model computes the conceptual distance between a query and an object and both are indexed with weighted terms from a hierarchical thesaurus by allowing the index term and the edge of the HCG to be weighted.
Abstract: This paper discusses a knowledge based information retrieval model with hierarchical thesaurus. The model computes the conceptual distance between a query and an object and both are indexed with weighted terms from a hierarchical thesaurus. The hierarchical thesaurus is represented by a hierarchical‐concept graph (HCG) in which nodes represent concepts and directed edges represent generalisation relationships. Rada et al. have developed a similar model. However, their model considered only a binary indexing scheme and revealed some counter‐intuitive results. Our proposed model extends theirs by allowing the index term and the edge of the HCG to be weighted. A new concept mapping method is devised to overcome Rada's counter‐intuitive results. In addition, a scheme for allowing Boolean operators in user queries is provided with a formula for computing conceptual distance from negated index terms. Experimental results have shown that our model simulates human performance more closely than Rada's model.

77 citations


Patent
01 Oct 1990
TL;DR: An apparatus for document browsing specifically for document retrieval systems is described in this article, which enables users to see multiple document pages on the same screen at the same time in a first mode and to see a bundle of pages on a screen in a second mode.
Abstract: An apparatus for document browsing, specifically for document retrieval systems. The browsing apparatus enables users to see multiple document pages on the same screen at the same time in a first mode and to see a bundle of pages on a screen in a second mode. The images shown on the screen are produced internally according to the user's commands. The pages may be flipped in either direction and selected pages may be marked for later printing instructions.

Journal ArticleDOI
Udo Hahn1
TL;DR: This paper introduces a parser which is based on the conceptual knowledge of its domain and is organized as a collection of distributed lexicalized grammar modules (word experts) which communicate through message-passing.
Abstract: The rapid proliferation of full-text databases poses serious problems to the natural language processing components of information retrieval systems Not taking text-level phenomena of written natural language discourse into account causes a marked decrease of performance for many text information system applications Consequently, appropriate text parsing facilities must be capable of recognizing the rich internal structure of full-texts on lower levels of text connectivity as well as on the global organizational level of text coherence This paper introduces such a parser which is based on the conceptual knowledge of its domain and is organized as a collection of distributed lexicalized grammar modules (word experts) which communicate through message-passing Emphasis is put on text grammatical specifications which state formal conditions for recognizing higher-order text constituents and their coherent configuration on the global level of textual macro organization


Journal ArticleDOI
TL;DR: The results obtained show improvements in the level of retrieval effectiveness, which demonstrate that the approach of using a semantic theory of natural language and logic in document retrieval systems is a valid one.
Abstract: This paper introduces a logical-linguistic model of document retrieval systems and describes an implementation of a system called SILOL which is based on this model. SILOL uses a shallow semantic translation of natural language texts into a first order predicate representation in performing a document indexing and retrieval process. Some preliminary experiments have been carried out to test the retrieval effectiveness of this system. The results obtained show improvements in the level of retrieval effectiveness, which demonstrate that the approach of using a semantic theory of natural language and logic in document retrieval systems is a valid one.

Journal ArticleDOI
TL;DR: A taxonomy is developed that characterizes a range of misconceptions users have when performing subject-based search in an online catalog system andotheses about the causes of the misconceptions are suggested.
Abstract: We report results of an investigation where thirty subjects were observed performing subject-based search in an online catalog system. The observations have revealed a range of misconceptions users have when performing subject-based search. We have developed a taxonomy that characterizes these misconceptions and a knowledge representation which explains these misconceptions. Directions for improving search performance are also suggested.

Patent
Morita Tetsuya1
06 Jun 1990
TL;DR: In this paper, a document retrieval system includes a storage for storing keyword relationships which indicate relationship values of keywords and relations of the keywords and registered documents, an input part for designating a retrieval condition including one or plurality of designated keywords, where the retrieval condition determines a registered document which is to be retrieved from the storage.
Abstract: A document retrieval system retrieves a registered document from a document database responsive to a designated retrieval condition including one or a plurality of designated keywords. The document retrieval system includes a storage for storing keyword relationships which indicate relationship values of keywords and relations of the keywords and registered documents, an input part for designating a retrieval condition including one or plurality of designated keywords, where the retrieval condition determines a registered document which is to be retrieved from the storage, a selector for selecting a plurality of keyword relationships based on the retrieval condition and for converting the selected keyword relationships into analog signals, an analog operation circuit for calculating a relevance of document based on the analog signals, and a converter for converting the calculated relevance of document into a digital value.

Journal ArticleDOI
TL;DR: The paper describes the logic engineering techniques which are being used, provides a progress report on the design of the OSM, and outlines some current research and expected future developments.

Journal ArticleDOI
TL;DR: This study first identifies and extends various query/profile interaction models to provide a ground upon which the investigation of the roles of user profiles can be undertaken.
Abstract: One difficult problem in information retrieval (IR) is the proper interpretation of user queries. It is extremely hard for users to express their information needs in a specific yet exhaustive way. In an effort to alleviate this problem, two theoretical models have been proposed to utilize user characteristics maintained in the form of a user profile. Although the idea of integrating user profiles into an IR system is intuitively appealing, and the models seem viable, no research to date has established a foundation for the roles of user profiles in such a system. Aiming at the investigation of the roles of user profiles, therefore, this study first identifies and extends various query/profile interaction models to provide a ground upon which the investigation can be undertaken. From a continuum of models characterized on the basis of interaction types, metrics, and parameters, nearly 400 models are chosen to investigate the “model space.” New measures are developed based on the notion of user satisfaction/frustration. In addition, three different criteria are used to guide users in making judgments on the quality of retrieved items. Analysis of the data obtained from the experiments shows that, for a wide variety of criteria and metrics, there are always some query/profile interaction models that outperform the query alone model. In addition, preferable characteristics for different criteria are identified in terms of interaction types, parameters, and metrics.


01 Jan 1990
TL;DR: The efficiency of a p-norm retrieval is significantly improved with a new p- norm retrieval algorithm which evaluates the entire document collection in one recursive traversal of the query tree, and list pruning methods for further efficiency improvements are introduced.
Abstract: A practical information retrieval system must be easy to use by untrained users, and it must provide prompt responses to a user's search requests. In this thesis, these practical aspects of the p-norm model of information retrieval are explored. In addition, a study of theoretical properties of the p-norm model is presented. A syntactic method for generating p-norm queries from parse trees generated by the PLNLP syntactic analyzer is presented. The effectiveness of the syntactically generated queries is shown to be comparable to the effectiveness of manually constructed queries, and much better than that of statistically generated queries. The efficiency of a p-norm retrieval is significantly improved with a new p-norm retrieval algorithm which evaluates the entire document collection in one recursive traversal of the query tree. This algorithm is compared against the straightforward algorithm, which requires a traversal of the query tree for each document that is evaluated. The new algorithm is shown to be better both asymptotically and experimentally. The infinity-one model is introduced as a means of approximating the p-norm model without requiring exponentiation. Experimental results show that infinity-one retrieval is essentially as effective as p-norm retrieval, but much faster. List pruning methods for further efficiency improvements are also introduced and are shown to reduce retrieval time significantly without affecting the precision of top-ranked documents. The retrieval time of the infinity-one model with list pruning is shown to be comparable to that of pure Boolean retrieval. A theoretical study is also presented in which certain Boolean algebra properties, such as associativity, are shown to be unsatisfiable by any extended Boolean system with weak operators. The p-norm model is shown to satisfy all those properties that can be satisfied. In addition, the p-norm model is evaluated with respect to the Waller-Kraft wish list for extended Boolean systems.

Journal ArticleDOI
TL;DR: Two programs are described, INDEX and INDEXD, which locate repeated phrases in a document, gather statistical information about them, and rank them according to their value as index phrases, showing promise as the basis for a sophisticated conceptual indexing system.
Abstract: In recent years researchers have become increasingly convinced that the performance of information retrieval systems can be greatly enhanced by the use of key phrases for automatic conceptual document indexing and retrieval. In this article we describe two programs, INDEX and INDEXD, which locate repeated phrases in a document, gather statistical information about them, and rank them according to their value as index phrases. The programs show promise as the basis for a sophisticated conceptual indexing system. The simpler program, INDEX, ranks phrases in such a way that frequently occurring phrases which contain several frequently occurring words are given a high ranking. INDEXD is an extension of INDEX which incorporates a dictionary for stemming, weighting of words and validation of syntax of output phrases. Sample output of both programs is included, and we discuss plans to combine INDEXD with linguistic and artificial intelligence techniques to provide a general conceptual phrase-indexing system that can incorporate expert knowledge about a given application area. © 1990 John Wiley & Sons, Inc.

Journal ArticleDOI
TL;DR: This study was designed to test hypotheses derived from a psychological theory of remembering known as retrieval by reformulation, and to observe behavioral differences while searching the two catalogs.
Abstract: Twenty subjects were assigned information problems to solve through searching a university card catalog and twenty were assigned the same problems to solve in a comparable online catalog. The study was designed to test hypotheses derived from a psychological theory of remembering known as retrieval by reformulation, and to observe behavioral differences while searching the two catalogs. Verbal protocols were used to identify reformulations and to operationalize further the theoretical construct “reformulation.” Greater perseverance and more frequent search reformulations were associated with the online catalog, while larger retrieval sets and more favorable search assessments were associated with the card catalog. No significant differences were found on most attitudinal measures. Post hoc analyses examined include overlap of sets of retrieved items variance associated with the use of test questions. © 1990 John Wiley & Sons, Inc.

Journal ArticleDOI
Christoph Schwarz1
TL;DR: The system called COPSY (context operator syntax), which uses natural language processing techniques during fully automatic syntactic analysis of free text documents is described, which is being tested by the U.S. Department of Commerce for patent search and indexing.
Abstract: Problems encountered under the syntactic analysis of free text documents are discussed. The system called COPSY (context operator syntax), which uses natural language processing techniques during fully automatic syntactic analysis (indexing and search) of free text documents is described. Applications under real world conditions are mentioned as well as evaluation and technical aspects. Further developments in the field of thesaurus building and full-text analysis using the linguistic algorithms of the syntactic retrieval system are outlined. COPSY was developed as part of a text processing project at Siemens, called TINA (Text-Inhalts-Analyse: text content analysis)

Journal ArticleDOI
TL;DR: Both the graph and the testing result suggest that on this small database the proposed model tends to improve retrieval effectiveness, however, the structural retrieval model needs to be refined and more elaborate experiments are required in order to further confirm the findings.
Abstract: This paper describes a structural document retrieval model which has been designed based on lexical-semantic relationships between index terms and an algorithm of measuring tree-to-tree distance. In this model, documents and query statements are structurally coded in order to take into account any hierarchy or ordering among the conceptual coordinates and are structurally matched by using the algorithm that cannot be expressed in a form of equation. The proposed model has been compared to the vector retrieval model using a small database and the results have been analyzed using a precision-recall graph and a statistical test. Both the graph and the testing result suggest that on this small database the proposed model tends to improve retrieval effectiveness. However, the structural retrieval model needs to be refined and more elaborate experiments are required in order to further confirm the findings.

Journal ArticleDOI
TL;DR: A year-long project to study two groups of children aged between nine and eleven years old was undertaken in which the children compiled substantial databases of the 1881 Census material and subsequently interrogated it to gain insight into the methods young children employed to understand database information in terms of their own experience.
Abstract: A year-long project to study two groups of children aged between nine and eleven years old was undertaken in which the children compiled substantial databases of the 1881 Census material and subsequently interrogated it. The main aims of the study were to obtain information to provide guidance for teachers on the introduction of databases with young children, and to gain insight into the methods young children employed to understand database information in terms of their own experience. The children's reaction to menus and commands, and their ability to navigate around the database were noted. Attention was paid to their mental mapping and to the effectiveness with which they used the system.

Journal ArticleDOI
TL;DR: Using various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems is described, which is flexible, efficient, and, relative to the customary concordance approach, inexpensive in storage costs.
Abstract: We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, efficient, and, relative to the customary concordance approach, inexpensive in storage costs.

Journal ArticleDOI
TL;DR: A fuzzy retrieval system based on citations, which defines graded relations among documents through fuzzy graph theory, and discusses mathematical properties and their meaning in practical retrieval.

Journal ArticleDOI
TL;DR: The nature of information retrieval applications, the Z39.50 protocol, and its relationship to other OSI protocols are described, which allows a client to build queries in terms of logical information elements supported by the server.
Abstract: The nature of information retrieval applications, the Z39.50 protocol, and its relationship to other OSI protocols are described. Through Z39.50 a client system views a remote server's database as an information resource, not merely a collection of data. Z39.50 allows a client to build queries in terms of logical information elements supported by the server. It also provides a framework for transmitting queries, managing results, and controlling resources. Sidebars describe the Z39.50 Implementors Group, the Z39.50 Maintenance Agency, and international standards for OSI library application protocols.

Journal ArticleDOI
TL;DR: The approach to image description is presented and a control structure to index textual and pictorial data is discussed and a model for indexing pictorial parts of multimedia documents is proposed.
Abstract: Information retrieval (IR) systems were ongmally devel oped to process, store, search and retrieve narrative informa tion. Structured databases are therefore created to store the items contairung enough information to identify and retneve the onginal documents.This paper proposes an extension of classification and indexing to pictorial data, specifically to pictorial parts of multimedia documents. This goal is attained by adopting structural techniques for digital image description. The de scriptions of the objects (digital structures) present in the image are proposed as pictorial index terms. Such a generaliza tion has led to a uniform management of the non-homoge neous kinds of data composing the document and has allowed the outlining of a multimedia IR system raised from interdisci plinary experiences.In the paper the approach to image description is presented and a control structure to index textual and pictorial data is discussed.

Journal ArticleDOI
TL;DR: Suggestions are made for enhancing retrieval performance on UDC numbers in simple systems, and for ways in which the classification might be developed to improve automated searching.
Abstract: The Universal Decimal Classification (UDC) is able to provide a detailed description of the subject content of a document in any area. Its hierarchical and synthetic structure, which is generally reflected in its notation, should enable computer searching for hierarchically‐related subjects and for the individual facets of a complex subject. The possibilities of using these features in automated retrieval are discussed, and attention is drawn to places where the UDC falls short. A number of online catalogues, databases, and information retrieval packages are discussed in terms of their ability to allow searching on UDC numbers. The most sophisticated ones, such as ETHICS at the ETH Library, use a separate file of verbal descriptors linked to the document file through UDC numbers. Suggestions are made for enhancing retrieval performance on UDC numbers in simple systems, and for ways in which the classification might be developed to improve automated searching.