scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1993"


Proceedings ArticleDOI
01 Jul 1993
TL;DR: New approaches are described in this study for implementing selective passage retrieval systems, and identifying text passages responsive to particular user needs.
Abstract: Large collections of full-text documents are now commonly used in automated information retrieval. When the stored document texts are long, the retrieval of complete documents may not be in the users' best interest. In such circumstance, efficient and effective retrieval results may be obtained by using passage retrieval strategies designed to retrieve text excerpts of varying size in response to statements of user interest.New approaches are described in this study for implementing selective passage retrieval systems, and identifying text passages responsive to particular user needs. An automated encyclopedia search system is used to evaluate the usefulness of the proposed methods.

452 citations


Journal ArticleDOI
TL;DR: The idea of using visualization for document retrieval is introduced through a new paradigm for query response handling based on parallel queries or points of interest, which has been implemented through a visualization system called VIBE.
Abstract: The idea of using visualization for document retrieval is introduced through a new paradigm for query response handling. The paradigm is based on parallel queries or points of interest . Each point of interest is defined by a number of key terms and a display position. Documents, represented by icons, are positioned in the display based on the frequency count of word matches in the document to key terms in the points of interest. This visualization method has been implemented through a visualization system called VIBE.

290 citations


Journal ArticleDOI
TL;DR: A fuzzy linguistic model is defined, starting from an existing weighted Boolean retrieval model, a linguistic extension, formalized within fuzzy set theory, in which numeric query weights are replaced by linguistic descriptors which specify the degree of importance of the terms.
Abstract: The generalization of Boolean Information Retrieval Systems (IRS) is still an open research field; in fact, though such systems are diffused on the market, they present some limitations; one of the main features lacking in these systems is the ability to deal with the “imprecision” and “subjectivity” characterizing retrieval activity. However, the replacement of such systems would be much more costly than their evolution through the incorporation of new features to enhance their efficiency and effectiveness. Previous efforts in this area have led to the introduction of numeric weights to improve both document representation and query language. By attaching a numeric weight to a term in a query, a user can provide a quantitative description of the “importance” of that term in the documents he or she is looking for. However, the use of weights requires a clear knowledge of their semantics for translating a fuzzy concept into a precise numeric value. Our acquaintance with these problems led us to define, starting from an existing weighted Boolean retrieval model, a linguistic extension, formalized within fuzzy set theory, in which numeric query weights are replaced by linguistic descriptors which specify the degree of importance of the terms. This fuzzy linguistic model is defined and an evaluation is made of its implementation on a Boolean IRS. © 1993 John Wiley & Sons, Inc.

285 citations


Journal ArticleDOI
TL;DR: In this article, a knowledge-based extended Boolean model (kb•ebm) is proposed to evaluate weighted queries and documents effectively, and avoids the problems of the previous methods.
Abstract: There have been several document ranking methods to calculate the conceptual distance or closeness between a Boolean query and a document. Though they provide good retrieval effectiveness in many cases, they do not support effective weighting schemes for queries and documents and also have several problems resulting from inappropriate evaluation of Boolean operators. We propose a new method called Knowledge‐Based Extended Boolean Model (kb‐ebm) in which Salton's extended Boolean model is incorporated. kb‐ebm evaluates weighted queries and documents effectively, and avoids the problems of the previous methods. kb‐ebm provides high quality document rankings by using term dependence information from is‐a hierarchies The performance experiments show that the proposed method closely simulates human behaviour.

266 citations


Proceedings Article
01 Jan 1993
TL;DR: LSI is an extension of the vector retrieval method in which the dependencies between terms are explicitly taken into account in the representation and exploited in retrieval by simultaneously modeling all the interrelationships among terms and documents.
Abstract: Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton & McGill, 1983) in which the dependencies between terms are explicitly taken into account in the representation and exploited in retrieval. This is done by simultaneously modeling all the interrelationships among terms and documents. We assume that there is some underlying or "latent" structure in the pattern of word usage across documents, and use statistical techniques to estimate this latent structure. A description of terms, documents and user queries based on the underlying, "latent semantic", structure (rather than surface level word choice) is used for representing and retrieving information. One advantage of the LSI representation is that a query can be very similar to a document even when they share no words.

261 citations


Journal Article
TL;DR: It is proposed that information retireval is most properly considered as information-seeking behavior, that the central process of information retrieval is user interaction with text, and that the user is the central component of the information retrieval system.
Abstract: We present an analysis of information retrieval as an information-seeking activity, supporting people's inteactions with text. This analysis suggests that some assumptions underlying the standard model of information retrieval are inappropriate, and we suggest alternative assumptions and discuss their implications for information retrieval system design. It is proposed that information retireval is most properly considered as information-seeking behavior, that the central process of information retrieval is user interaction with text, and that the user is the central component of the information retrieval system. Possible ways to incorporate this view in the design of information retrieval systems are discussed.

191 citations


Journal ArticleDOI
TL;DR: Development of the Envision database, system software, and protocol for client-server communication builds upon work to identify and represent “ objects” that will facilitate reuse and high-level communication of information from author to reader (user).
Abstract: Project Envision aims to build a “user-centered database from the computer science literature,” initially using the publications of the Association for Computing Machinery (ACM) Accordingly, we have interviewed potential users, as well as experts in library, information, and computer science—to understand their needs, to become aware of their perception of existing information systems, and to collect their recommendations Design and formative usability evaluation of our interface have been based on those interviews, leading to innovative query formulation and search results screens that work well according to our usability testing Our development of the Envision database, system software, and protocol for client-server communication builds upon work to identify and represent “objects” that will facilitate reuse and high-level communication of information from author to reader (user) All these efforts are leading not only to a usable prototype digital library but also to a set of nine principles for digital libraries, which we have tried to follow, covering issues of representation, architecture, and interfacing © 1993 John Wiley & Sons, Inc

157 citations


Patent
29 Jan 1993
TL;DR: A computer system and process with special application as a computer assisted new drug application in which a bar code reader is used to read command bar codes to manipulate user interface software and document retrieval bar code to retrieve electronic documents as mentioned in this paper.
Abstract: A computer system and process with special application as a computer assisted new drug application in which a bar code reader is used to read command bar codes to manipulate user interface software and document retrieval bar codes to retrieve electronic documents

153 citations


Journal ArticleDOI
TL;DR: This paper explored children's information retrieval behavior using an online public access catalog (OPAC) in an elementary school library and reported the overall patterns of children's behavior that influence success and breakdown in information retrieval as well as findings about the intentions, moves, plans, strategies, and search terms of children in grades one through six.
Abstract: This article reports research that explored children's information retrieval behavior using an online public access catalog (OPAC) in an elementary school library. The study considers the impact of a variety of factors including user characteristics, the school setting, interface usability, and information access features on children's information retrieval success and breakdown. The study reports the overall patterns of children's behavior that influence success and breakdown in information retrieval as well as findings about the intentions, moves, plans, strategies, and search terms of children in grades one through six. © 1993 John Wiley & Sons, Inc.

146 citations


Patent
Ogawa Yasutsugu1
17 Nov 1993
TL;DR: A document retrieval system includes a query converter for converting the retrieval condition designated by the user into a query which has a predetermined normal form in which keywords and at least one type of logical operation out of logical operations AND, OR and NOT are connected, a bibliographical information indicator for indicating a relation between each of said registered documents and keywords and a keyword connection table having relationship values, each relationship values representing the degree of relationship between each two keywords as discussed by the authors.
Abstract: A document retrieval system retrieves one or a plurality of registered documents from a document database responsive to retrieval conditions designated by a user. The document retrieval system includes a query converter for converting the retrieval condition designated by the user into a query which has a predetermined normal form in which keywords and at least one type of logical operation out of logical operations AND, OR and NOT are connected, a bibliographical information indicator for indicating a relation between each of said registered documents and keywords and a keyword connection table having relationship values, each of the relationship values representing the degree of relationship between each two keywords. The document retrieval system also includes a selector for referring the inverted file and the keyword connection so as to select one or a plurality of registered documents which satisfy the query, and an outputting circuit for outputting one or a plurality of registered documents selected by the selecting means.

137 citations


Journal Article
TL;DR: A series of studies explored the effects of domain expertise and search expertise in hypertext or full-text CD-ROM databases to investigate how highly interactive electronic access to primary information affects information seeking.

Proceedings ArticleDOI
01 Jan 1993
TL;DR: A probabilistic mode of the database and queries is introduced and a set of design tradeoffs over a range of hardware configurations and new parallel query processing strategies are proposed.
Abstract: The impact on query processing performance of various physical organizations for inverted lists is compared. A probabilistic mode of the database and queries is introduced. Simulation experiments determine which variables most strongly influence response time and throughput. This leads to a set of design tradeoffs over a range of hardware configurations and new parallel query processing strategies. >

Journal ArticleDOI
TL;DR: A blackboard-based document management system that uses a neural network spreading-activation algorithm which lets users traverse multiple thesauri is discussed, and the system's query formation; the retrieving, ranking and selection of documents; and thesaurus activation are described.
Abstract: A blackboard-based document management system that uses a neural network spreading-activation algorithm which lets users traverse multiple thesauri is discussed. Guided by heuristics, the algorithm activates related terms in the thesauri and converges of the most pertinent concepts. The system provides two control modes: a browsing module and an activation module that determine the sequence of operations. With the browsing module, users have full control over which knowledge sources to browse and what terms to select. The system's query formation; the retrieving, ranking and selection of documents; and thesaurus activation are described. >

Journal Article
TL;DR: The focus of the effort is the development of SPECIALIST, an experimental natural language processing system for the biomedical domain that includes a broad coverage parser supported by a large lexicon, modules that provide access to the extensive Unified Medical Language System Knowledge Sources, and a retrieval module that permits experiments in information retrieval.
Abstract: This paper describes efforts to provide access to the free text in biomedical databases. The focus of the effort is the development of SPECIALIST, an experimental natural language processing system for the biomedical domain. The system includes a broad coverage parser supported by a large lexicon, modules that provide access to the extensive Unified Medical Language System (UMLS) Knowledge Sources, and a retrieval module that permits experiments in information retrieval. The UMLS Metathesaurus and Semantic Network provide a rich source of biomedical concepts and their interrelationships. Investigations have been conducted to determine the type of information required to effect a map between the language of queries and the language of relevant documents. Mappings are never straightforward and often involve multiple inferences.

Proceedings ArticleDOI
01 Jul 1993
TL;DR: Using structured queries, the character-based indexing performed retrieval as well as, or slightly better, than the word-based system, which has practical significance since the character's speed is considerably faster than the traditional word- based indexing.
Abstract: A series of Japanese full-text retrieval experiments were conducted using an inference network document retrieval model. The retrieval performance of two major indexing methods, character-based and word-based, were evaluated. Using structured queries, the character-based indexing performed retrieval as well as, or slightly better, than the word-based system. This result has practical significance since the character-based indexing speed is considerably faster than the traditional word-based indexing. All the queries in this experiment were automatically formulated from natural language input.

Journal ArticleDOI
TL;DR: This study tests the effectiveness of a thesaurus as a search-aid in free text searching of a full text database of newspaper articles and finds that it helps to have a switching tool connecting the different names of one concept.
Abstract: Authors and searchers usually express the same things in many different ways, which causes problems in free text searching of text databases. Thus, a switching tool connecting the different names of one concept is needed. This study tests the effectiveness of a thesaurus as a search-aid in free text searching of a full text database. A set of queries was searched against a large full text database of newspaper articles. The search-aid thesaurus constructed for the test contains the usual relationships of a thesaurus, namely equivalence, hierarchical, and associative relationships. Each query was searched in five distinct modes: basic search, synonym search, narrower term search, related term search, and union of all previous searches. The basic searches contained only terms included in the original query statements. In the synonym searches, the terms of the basic search were extended by disjunction of the synonyms given by the search-aid thesaurus without modifying the overall logic of the basic search. Likewise, the basic search was extended in turn with the narrower terms and with the related terms given by the search-aid thesaurus. The last search mode included the basic terms and all the terms used in the previous searches. The searches were analyzed in terms of relative recall and precision; relative recall was estimated by setting the recall of the union search to 100%. On the average the value of relative recall was 47.2% in the basic search, compared with 100% in the union search; the average value of precision decreased only from 62.5% in the basic search to 51.2% in the union search.

Patent
27 May 1993
TL;DR: A method of registering document information in a document information retrieval system which stores document information consisting of a large number of characters for retrieval of the stored document information is discussed in this paper.
Abstract: A document information compression and retrieval system which reduces the document data amount and shortens the retrieval time when mass document information is registered and retrieved. A method of registering document information in a document information retrieval system which stores document information consisting of a large number of characters for retrieval of the stored document information. Entered document information is separated into words. Whether or not each of the words is a word to which a compressed code is assigned is determined. If not already assigned, a compressed code is assigned to the word. The words are converted into the assigned compressed codes for storing a compressed text. At output, retrieval information is accepted and converted into compressed code and stored compressed texts are searched for the compressed text matching the compressed code of the retrieval information, then the words corresponding to the compressed codes are used to expand the compressed text into original document information.

Journal ArticleDOI
TL;DR: This work confirms earlier indications from other researchers that citation searching complements searching by terms and seeks to determine the magnitude of incremental contribution of citation retrieval to MEDLINE searching.
Abstract: or partially relevant as opposed to being not relevant, and 8.4 times more likely for def- initely relevant retrievals. In the field setting, citation searching was able to add an av- erage of 24% recall to traditional subject retrieval. Term or citation searching from the open literature produced lower precision results. Attempts to identify distinguishing char- acteristics in queries which might benefit most from additional citation searches proved to be inconclusive. In spite of the obvious gain shown by citation searching, online ac- cess of citation databases has been hampered by their relative high cost. The reported work is an extension of a pilot study of the characteristics and retrieval ef- fectiveness of two subject searching modes. The two search approaches available on com- mercial bibliographic databases are semantic retrieval based on text words, assigned keywords and descriptors, and pragmatic retrieval based on citations. The earlier study was an experiment performed on a data file with narrow subject focus (Pao & Worthen, 1989). The database was constructed such that its documents content was retrievable by descrip- tors and text words as well as by cited references. Thus, direct comparison of the retrieval results was possible from parallel searches using appropriate terms and citations on iden- tical queries. Citation searching in the control setting was found to add an average of 14% of rel- evant documents to a search. This confirms earlier indications from other researchers that citation searching complements searching by terms (McCain, 1989; Pao, 1986; Salton, 1971). A logical follow-up question is whether these results could be of practical use to the online searcher. Can the searcher expect similar results when searches are done on commer- cially available databases? For example, what is the percentage of MEDLINE@ search top- ics which could benefit from a citation search? One also wonders how generalizable and how stable the findings are if a sample of real searches was collected from libraries where a wide variety of topics was searched. While only a few overlap items were found, these common documents tended to be highly relevant to the search topic. What are the odds that retrieved items derived from both types of search methods are relevant? Is a higher yield from a citation search related to specific types of topics? Obviously, knowledge of this type could be useful to the online searcher. This second study is framed around the following aims: 1. to seek confirmation of the earlier findings by testing the two retrieval modes with real searches processed on commercially available databases and by evaluating search results by requestors who posed the queries; 2. to determine the magnitude of incremental contribution of citation retrieval to MEDLINE searching;

Journal Article
TL;DR: A model for automated information retrieval in which questions posed by clinical users are analyzed to establish common syntactic and semantic patterns that are used to develop a set of general-purpose questions called generic queries is described.
Abstract: This paper describes a model for automated information retrieval in which questions posed by clinical users are analyzed to establish common syntactic and semantic patterns. The patterns are used to develop a set of general-purpose questions called generic queries. These generic queries are used in responding to specific clinical information needs. Users select generic queries in one of two ways. The user may type in questions, which are then analyzed, using natural language processing techniques, to identify the most relevant generic query; or the user may indicate patient data of interest and then pick one of several potentially relevant questions. Once the query and medical concepts have been determined, an information source is selected automatically, a retrieval strategy is composed and executed, and the results are sorted and filtered for presentation to the user. This work makes extensive use of the National Library of Medicine's Unified Medical Language System (UMLS): medical concepts are derived from the Metathesaurus, medical queries are based on semantic relations drawn from the UMLS Semantic Network, and automated source selection makes use of the Information Sources Map. The paper describes research currently under way to implement this model and reports on experience and results to date.

Journal ArticleDOI
TL;DR: An expert system for online search assistance automatically reformulates queries to improve the search results, and ranks the retrieved passages to speed the identification of relevant information.
Abstract: Unfamiliarity with search tactics creates difficulties for many users of online retrieval systems. User observations indicate that even experienced searchers use vocabulary incorrectly and rarely reformulate their queries. To address these problems, an expert system for online search assistance was developed. This prototype automatically reformulates queries to improve the search results, and ranks the retrieved passages to speed the identification of relevant information. Users' search performance using the expert system was compared with their search performance on their own, and their search performance using an online thesaurus. The following conclusions were reached: (1) The expert system significantly reduced the number of queries necessary to find relevant passages compared with the user searching alone or with the thesaurus. (2)The expert system puced marinally significant improvemen in precision compared with e user searching on their own. There was no significant differnce in e call achieved b e thre system configurations. (3) Overall, the expert system ranked relevand passages above irrelevant passages

Journal ArticleDOI
01 Jan 1993
TL;DR: A fuzzy-set-based scheme for construction of efficient problem solving systems of the two kinds of problems, namely object-querying and class-queries, as exemplified by information retrieval systems and expert systems are developed.
Abstract: The problem-solving strategy applied in knowledge-based systems may often be characterized as classification. Central to classification is computation of the degree to which an object is an instance of a given class (concept, category). Two kinds of problems, namely object-querying and class-querying, as exemplified by, respectively, information retrieval systems and expert systems, are distinguished. In the first kind, the problem is to identify the objects (e.g. documents) to which a given concept (the query) applies. In the second kind, the problem is to identify the concepts (categories) that apply to a given object (the observation). A fuzzy-set-based scheme for construction of efficient problem solving systems of the two kinds is developed. The problem of vocabulary mismatch in information retrieval is considered, and the scheme is proposed as a solution to this problem. The knowledge base applies a term-centered representation form called a fuzzy relational thesaurus. To avoid recomputation of deductive information in problem-solving tasks, the deductive closure of the knowledge base is derived at the outset. This closure is computed in O(n/sup 3/) time. >

Proceedings ArticleDOI
01 Jul 1993
TL;DR: It is discovered that the knowledge about relevance among queries and documents can be used to obtain empirical connections between query terms and the canonical concepts which are used for indexing the content of documents.
Abstract: This paper describes a unique example-based mapping method for document retrieval. We discovered that the knowledge about relevance among queries and documents can be used to obtain empirical connections between query terms and the canonical concepts which are used for indexing the content of documents. These connections do not depend on whether there are shared terms among the queries and documents; therefore, they are especially effective for a mapping from queries to the documents where the concepts are relevant but the terms used by article authors happen to be different from the terms of database users. We employ a Linear Least Squares Fit (LLSF) technique to compute such connections from a collection of queries and documents where the relevance is assigned by humans, and then use these connections in the retrieval of documents where the relevance is unknown. We tested this method on both retrieval and indexing with a set of MEDLINE documents which has been used by other information retrieval systems for evaluations. The effectiveness of the LLSF mapping and the significant improvement over alternative approaches was evident in the tests.

Patent
24 Aug 1993
TL;DR: In this paper, a document storage and retrieval system is provided with means for storing a document body in the form of image, and storing text information in a character code string for retrieval, means for executing a retrieval with reference to the text information, and means for displaying a document image relating thereto on a retrieval terminal according to the retrieval result.
Abstract: A document storage and retrieval system is provided with means for storing a document body in the form of image, means for storing text information in the form of a character code string for retrieval, means for executing a retrieval with reference to the text information, and means for displaying a document image relating thereto on a retrieval terminal according to the retrieval result. Such a form of the system is available for retrieving the full contents of a document and also for displaying the document body printed in a format easy to read straight in the form of image. Accordingly, users are capable of retrieving documents with arbitrary words and also capable of reading even such a document as is complicated to include mathematical expressions and charts through a terminal in the form of image, the same as on paper. Further, the invention provides a system wherein the text information for retrieval is extracted automatically from the document image through character recognition. Since a precision of the character recognition has not been satisfactory hitherto, a visual retrieval and correction have been carried out without fail by operators. However, there is no necessity for the operators to attend therefor according to the invention. Thus, the text information for retrieval can be generated at the cost of practical time and money even in case of volumes of documents.


Proceedings Article
01 Jan 1993
TL;DR: Results indicate that progressive combination of queries leads to progressively improving retrieval performance, significantly better than that of single queries, and at least as good as the best individual query formulations.
Abstract: This study investigates the effect on retrieval performance of two methods of combination of multiple representations of TREC topics. Five separate Boolean queries for each of the 50 TREC routing topics and 25 of the TREC ad hoc topics were generated by 75 experienced online searchers. Using the INQUIRY retrieval system, these queries were both combined into single queries, and used to produce five separate retrieval results. In the former case, results indicate that progressive combination of queries leads to progressively improving retrieval performance, significantly better than that of single queries, and at least as good as the best individual query formulations. In the latter case, data fusion of the ran ked lists also led to performance better than of any single list

Proceedings Article
20 Jan 1993
TL;DR: A new probabilistic model of the database and queries is presented that leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.
Abstract: The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted index. This paper compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine which variables most strongly influence response time and throughput. This leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.

Journal ArticleDOI
01 Jul 1993
TL;DR: A new probabilistic model of the database and queries is presented that leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.
Abstract: The performance of distributed text document retrieval systems is strongly influenced by the organization of the inverted text. This article compares the performance impact on query processing of various physical organizations for inverted lists. We present a new probabilistic model of the database and queries. Simulation experiments determine those variables that most strongly influence response time and throughput. This leads to a set of design trade-offs over a wide range of hardware configurations and new parallel query processing strategies.

Proceedings Article
01 Jan 1993
TL;DR: An experimental text filtering system that uses N-gram-based matching for document retrieval and routing tasks, pointing the way for several types of enhancements, both for speed and effectiveness.
Abstract: Most text retrieval and filtering systems depend heavily on the accuracy of the text they process. In other words, the various mechanismms that they use depend on every word in the queries being correctly and completely spelled. To get around this limitation, our experimental text filtering system uses N-gram-based matching for document retrieval and routing tasks. The systems's first application was for the TREC-2 retrieval and routing task. Its performace on this task was promising, pointing the way for several types of enhancements, both for speed and effectiveness

Journal ArticleDOI
01 Jun 1993
TL;DR: A comprehensive model is proposed taking into account both of the perspectives, and combining effectively browsing and querying into a unified framework to evaluate the proposed approach in terms of retrieval effectiveness and search efficiency.
Abstract: This paper approaches the problem of information retrieval from hypertext. In this context, the retrieval process is regarded as a process of inference that can be carried out either by the user exploring the hypertext network, browsing, or by having the system exploit the hypertext network as a knowledge base, searching. In the following, a comprehensive model is proposed taking into account both of the perspectives, and combining effectively browsing and querying into a unified framework. Hypertext nodes are regarded as facts, links as rules, and the connected hypertext structure as an inference network that can be used to prove the query inferentially. Next, design and implementation issues are discussed concerning a prototype system developed to evaluate the proposed approach in terms of retrieval effectiveness and search efficiency.

Journal ArticleDOI
01 Jun 1993
TL;DR: Two different approaches to the concept-based indexing of hypermedia information are examined which were developed during the Active LibraryTM on Corrosion project: term indexing with three-dimensional index navigation and semantic hyperindexing with broad-button link navigation.
Abstract: The key to unlocking the information retrieval potential of hypertext and hypermedia systems lies in a more semantics-aware indexing of the information in the hypernetwork, and in the effective visualization and navigation of this hypermedia index structure. We briefly highlight the information retrieval issues specific to hypertext and hypermedia systems, and discuss our concept-based information retrieval model for hypertext. We then examine in detail two different approaches to the concept-based indexing of hypermedia information which we developed during the Active LibraryTM on Corrosion project: term indexing with three-dimensional index navigation and semantic hyperindexing with broad-button link navigation. Finally, we discuss how a conceptbased indexing technique could represent a significant step towards a more intelligent retrieval of hypermedia information.