scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1992"


Book ChapterDOI
01 Jan 1992
TL;DR: A retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation is described.
Abstract: As larger and more heterogeneous text databases become available, information retrieval research will depend on the development of powerful, efficient and flexible retrieval engines. In this paper, we describe a retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation. INQUERY has been used successfully with databases containing nearly 400,000 documents.

629 citations


Journal ArticleDOI
TL;DR: The proposed method was designed to disambiguate senses that are usually associated with different topics using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval.
Abstract: Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.

614 citations


Journal ArticleDOI
TL;DR: In this paper, the authors summarized the theory of psychological relevance proposed by Dan Sperber and Deirdre Wilson (1986) to explicate the relevance of speech utterances to hearers in everyday conversation.
Abstract: This article summarizes the theory of psychological relevance proposed by Dan Sperber and Deirdre Wilson (1986), to explicate the relevance of speech utterances to hearers in everyday conversation. The theory is then interpreted as the concept of relevance in information retrieval, and an extended example is presented. Implications of psychological relevance for research in information retrieval; evaluation of information retrieval systems; and the concepts of information, information need, and the information-seeking process are explored. Connections of the theory to ideas in bibliometrics are also suggested. © 1992 John Wiley & Sons, Inc.

390 citations


Journal ArticleDOI
TL;DR: It is shown how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure and it becomes possible to identify the stop words in a cullectmn by automated statistical testing.
Abstract: A stop word may be identified as a word that has the same likehhood of occurring in those documents not relevant to a query as in those documents relevant to the query. In this paper we show how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a cullectmn by automated statistical testing. We describe the nature of the statistical test as it is realized with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this tech nique is then applied to a large MEDLINE " subset in the area of biotechnology. The initial processing of this datahase involves a 310 word stop list of common non-content terms. Our technique is then applied and 75% of the remaining terms are identified as stop words. We compare retrieval with and without the removal of these stop words and find that of the top twenty documents retrieved in response to a random query docume...

281 citations


Journal ArticleDOI
TL;DR: The use of stemming on Slovene-language documents and queries is reported, and it is demonstrated that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing.
Abstract: There have been several studies of the use of stemming algorithms for conflating morphological variants in free-text retrieval systems. Comparison of stemmed and nonconflated searches suggests that there are no significant increases in the effectiveness of retrieval when stemming is applied to English-language documents and queries. This article reports the use of stemming on Slovene-language documents and queries, and demonstrates that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing; similar comments apply to the use of manual, right-hand truncation. A comparison is made with stemming of English versions of the same documents and queries and it is concluded that the effectiveness of a stemming algorithm is determined by the morphological complexity of the language that it is designed to process. © 1992 John Wiley & Sons, Inc.

163 citations


Proceedings Article
01 Jan 1992
TL;DR: The Smart project at Cornell University, using a completely automatic approach for both routing and ad-hoc experiments, performed extremely well in the first Text Retrieval Conference.
Abstract: The Smart project at Cornell University, using a completely automatic approach for both routing and ad-hoc experiments, performed extremely well in the first Text Retrieval Conference. The basic ad-hoc approach uses local/global matching to achieve its results. A global match ensures that each retrieved document uses the same vocabulary as the query; a local match then attempts to guarantee some local part of the document (eg, a paragraph or sentence) focuses on the query algorithm is used for routing experiments

154 citations


Journal ArticleDOI
Gerard Salton1
TL;DR: An attempt is made to review the state of retrieval evaluation and to separate certain misgivings about the design of retrieval tests from conclusions that can legitimately be drawn from the evaluation results.
Abstract: Substantial misgivings have been voiced over the years about the methodologies used to evaluate information retrieval procedures, and about the credibility of many of the available test results. In this note, an attempt is made to review the state of retrieval evaluation and to separate certain misgivings about the design of retrieval tests from conclusions that can legitimately be drawn from the evaluation results.

151 citations


Proceedings Article
01 Jan 1992
TL;DR: Describes the Latent Semantic Indexing approach, an extension of the vector retrieval method and the use of singular-value decomposition applied to the TREC collection.
Abstract: Describes the Latent Semantic Indexing (LSI) approach, an extension of the vector retrieval method (e.g., Salton & McGill, 1983) and the use of singular-value decomposition (SUP) applied to the TREC collection. The existing LSI/SVD software was used for analyzing the training and test collectioans, and for query processing and retrieval

127 citations


Proceedings ArticleDOI
IJsbrand Jan Aalbersberg1
01 Jun 1992
TL;DR: This paper focuses on a relevance feedback technique that allows easily understandable and manageable user interfaces, and at the same time provides high-quality retrieval results.
Abstract: Although relevance feedback techniques have been investigated for more than 20 years, hardly any of these techniques has been implemented in a commercial full-text document retrieval system. In addition to pure performance problems, this is due to the fact that the application of relevance feedback techniques increases the complexity of the user interface and thus also the use of a document retrieval system. In this paper we concentrate on a relevance feedback technique that allows easily understandable and manageable user interfaces, and at the same time provides high-quality retrieval results. Moreover, the relevance feedback technique introduced unifies as well as improves other well-known relevance feedback techniques.

124 citations


Journal ArticleDOI
Gerda Ruge1
TL;DR: A description of the hyperterm system REALIST (REtrieval Aids by LInguistics and STatistics) and in more detail a description of its semantic component is given.
Abstract: A description of the hyperterm system REALIST (REtrieval Aids by LInguistics and STatistics) and in more detail a description of its semantic component is given. We call a hyperterm system a system that contains different kinds of term relations. The semantic component of REALIST generates semantic term relations such as synonyms. It takes as input a free-text database and generates as output term pairs that are semantically related with respect to their meanings in the database. This is done in two steps. In the first step an automatic syntactic analysis provides linguistical knowledge about the terms of the database. In the second step this knowledge is compared by statistical similarity computation. Various experiments with different similarity measures are described. These experiments are not standard recall and precision examinations, but direct evaluations of the term pairs. Beyond the new linguistic term association method and its good results, another important point of this paper is to show the value of direct term pair evaluation.

123 citations


Journal ArticleDOI
TL;DR: Personal name‐matching techniques may be included in name authority work, information retrieval, or duplicate detection, with some applications matching on name only, and others combining personal names with other data elements in record linkage techniques.
Abstract: The study reported in this article was commissioned by the Getty Art History Information Program (AHIP) as a background investigation of personal name-matching programs in fields other than art history, for purposes of comparing them and their approaches with AHIP's Synoname™ project. We review techniques employed in a variety of applications, including art history, bibliography, genealogy, commerce, and government, providing a framework of personal name characteristics, factors in selecting matching techniques, and types of applications. Personal names, as data elements in information systems, vary for a wide range of legitimate reasons, including cultural and historical traditions, translation and transliteration, reporting and recording variations, as well as typographical and phonetic errors. Some matching applications seek to link variants, while others seek to correct errors. The choice of matching techniques will vary in the amount of domain knowledge about the names that is incorporated, the sources of data, and the human and computing resources required. Personal name-matching techniques may be included in name authority work, information retrieval, or duplicate detection, with some applications matching on name only, and others combining personal names with other data elements in record linkage techniques. We discuss both phonetic- and pattern-matching techniques, reviewing a range of implemented and proposed name-matching techniques in the context of these factors. © 1992 John Wiley & Sons, Inc.

Patent
28 Feb 1992
TL;DR: In this paper, a component character table is created in which characters occurring in each of the condensed texts are registered without duplication, and a text body search is executed for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through the component characters table search and the condensed text search.
Abstract: High-speed full document retrieval method and system capable of providing result of retrieval within practically acceptable short search time. Upon registration of documents in a document database, condensed texts are created by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species and by checking mutual inclusion relations existing among the fragmental character strings. A component character table is created in which characters occurring in each of the condensed texts are registered without duplication. The condensed texts and the component character table are registered in the data base together with the texts of the documents to be registered. Upon retrieval of a document containing a search term designated by a user, a component character table search is first executed to extract those documents which contain all species of characters constituting the search term by consulting the component character table, and subsequently a condensed text search is executed by consulting the condensed texts of the documents. Finally, a text body search is executed for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through the component character table search and the condensed text search.

Journal ArticleDOI
15 Oct 1992
TL;DR: This note is the first of four papers in this issue describing the ongoing work connected with the DARPA TIPSTER Project, and the next papers by three of the contractors involved in the project provide some details on the systems involved, and some of the initial results.
Abstract: This note is the first of four papers in this issue describing the ongoing work connected with the DARPA TIPSTER Project. The note provides an overview of the project, and the next papers by three of the contractors involved in the project provide some details on the systems involved, and some of the initial results.The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Defense Advanced Research Projects Agency (DARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections. The first two-year phase of the program is concerned with the development of algorithms for document retrieval, document routing, and data extraction that are both domain and language independent. A call for proposals was made in June of 1990, and contracts for the six participating groups were let in the fall of 1991. Three meetings have been held so far, with the first results presented in September of 1992.There are two separate, but connected parts of TIPSTER. The first part of the project, document detection, is concerned with retrieving relevant documents" from very large (3 gigabyte) collections of documents, both in a routing environment, and in an adhoc retrieval environment. The routing environment is similar to the document filtering or profile searches currently done in libraries, where a query topic is constant, and the documents are viewed as the incoming stream of publications. The adhoc part of the project is similar to the standard search done against static collections.The second part of the TIPSTER project is concerned with data extraction. Here it is assumed that there is a much smaller set of documents, presumed to be mostly relevant to a topic, and the goal is to extract information to fill a database. This database could then be used for many applications, such as question-answering systems, report writing, or data analysis. The data extraction part of TIPSTER is being done by groups using natural language understanding techniques, and this part will not be described in this issue.

BookDOI
Paul S. Jacobs1
01 Jul 1992
TL;DR: This chapter discusses Text Representation for Intelligent Text Retrieval: A Classification-Oriented View, and Intelligent High-Volume Text Processing Using Shallow, Domain-Specific Techniques.
Abstract: Contents: P.S. Jacobs, Introduction: Text Power and Intelligent Systems. Part I:Broad-Scale NLP. J.R. Hobbs, D.E. Appelt, J. Bear, M. Tyson, D. Magerman, Robust Processing of Real-World Natural-Language Texts. Y. Wilks, L. Guthrie, J. Guthrie, J. Cowie, Combining Weak Methods in Large-Scale Text Processing. G. Hirst, M. Ryan, Mixed-Depth Representations for Natural Language Text. D.D. McDonald, Robust Partial-Parsing Through Incremental, Multi-Algorithm Processing. Corpus-Based Thematic Analysis. Part II:"Traditional" Information Retrieval. W.B. Croft, H.R. Turtle, Text Retrieval and Inference. K.S. Jones, Assumptions and Issues in Text-Based Retrieval. D.D. Lewis, Text Representation for Intelligent Text Retrieval: A Classification-Oriented View. G. Salton, C. Buckley, Automatic Text Structuring Experiments. Part III:Emerging Applications. C. Stanfill, D.L. Waltz, Statistical Methods, Artificial Intelligence, and Information Retrieval. P.J. Hayes, Intelligent High-Volume Text Processing Using Shallow, Domain-Specific Techniques. Y.S. Maarek, Automatically Constructing Simple Help Systems from Natural Language Documentation. M.A. Hearst, Direction-Based Text Interpretation as an Information Access Refinement.

Journal ArticleDOI
TL;DR: Research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records indicates that if the best method for a particular case can be determined, then up to 86% of the new records may be correctly classified.
Abstract: This article presents the results of research into the automatic selection of Library of Congress Classification numbers based on the titles and subject headings in MARC records. The method used in this study was based on partial match retrieval techniques using various elements of new records (i.e., those to be classified) as “queries,” and a test database of classification clusters generated from previously classified MARC records. Sixty individual methods for automatic classification were tested on a set of 283 new records, using all combinations of four different partial match methods, five query types, and three representations of search terms. The results indicate that if the best method for a particular case can be determined, then up to 88% of the new records may be correctly classified. The single method with the best accuracy was able to select the correct classification for about 46% of the new records.

Proceedings ArticleDOI
28 Jun 1992
TL;DR: A prototype information retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval and has displayed capabilities that appear to make it superior to the purely statistical base.
Abstract: We developed a prototype information retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. The backbone of our system is a statistical retrieval engine which performs automated indexing of documents, then search and ranking in response to user queries. This core architecture is augmented with advanced natural language processing tools which are both robust and efficient. In early experiments, the augmented system has displayed capabilities that appear to make it superior to the purely statistical base.

Journal ArticleDOI
TL;DR: Investigation of the degree to which variations in relevance judgments affect the evaluation of retrieval performance found that in no case was there a noticeable or material difference in retrieval performance due to variation in relevance judgment.
Abstract: The relevance judgments used to evaluate the performance of information retrieval systems are known to vary among judges and to vary under certain conditions extraneous to the relevance relationship between queries and documents. The study reported here investigated the degree to which variations in relevance judgments affect the evaluation of retrieval performance. Four sets of relevance judgments were used to test the retrieval effectiveness of six document representations. In no case was there a noticeable or material difference in retrieval performance due to variations in relevance judgment. Additionally, for each set of relevance judgments, the relative performance of the six different document representations was the same. Reasons why variations in relevance judgments may not affect recall and precision results were examined in further detail.

Proceedings ArticleDOI
Xia Lin1
19 Oct 1992
TL;DR: An information retrieval frame work that promotes graphical displays, and that will make documents in the computer visualizable to the searcher, is described and two simulation results of using a Kohonen feature map to generate map displays for information retrieval are presented.
Abstract: Visualization for the document space is an important issue for future information retrieval systems. This article describes an information retrieval framework that promotes graphical displays which will make documents in the computer visualizable to the searcher. As examples of such graphical displays, two simulation results of using Kohonen's feature map to generate map displays for information retrieval are presented and discussed. The map displays are a mapping from a high-dimensional document space to a two-dimensional space. They show document relationships by various visual cues such as dots, links, clusters and areas as well as their measurement and spatial arrangement. Using the map displays as an interface for document retrieval systems, we will provide the user with richer visual information to support browsing and searching.

Journal ArticleDOI
TL;DR: A design and implementation project based on a two-level conceptual architecture for the construction of a hypertext environment for interacting with large textual databases and an outline is presented of the characteristics of a prototype, named HYPERLINE, of thehypertext environment.
Abstract: This paper presents a design and implementation project based on a two-level conceptual architecture for the construction of a hypertext environment for interacting with large textual databases. The conceptual architecture has been proposed to be used for a semantic representation of the informative content of a collection of documents and for the organisation of the document collection itself. The hypertext environment is based on a set of functions that permits one to exploit the potential capabilities of the two-level architecture. Those functions are presented in detail. The paper reports some results of a more general project whose final goal is the definition of a new model for information retrieval: a model with information retrieval capabilities embedded within a hypertext environment. Finally, an outline is presented of the characteristics of a prototype, named HYPERLINE, of the hypertext environment. This prototype has been developed by the Information Retrieval Service of the European Space Agency (ESA-IRS).


Proceedings Article
01 Jan 1992
TL;DR: In this paper, an adaptive method using genetic algorithms to modify user queries, based on relevance judgments, was adapted for the Text Retrieval Conference (TREC) and shown to be applicable to large text collections, where more relevant documents are presented to users in the genetic modification.
Abstract: We have been developing an adaptive method using genetic algorithms to modify user queries, based on relevance judgments. This algorithm was adapted for the Text Retrieval Conference (TREC). The method is shown to be applicable to large text collections, where more relevant documents are presented to users in the genetic modification. The algorithm also shows some interesting phenomena, such as parallel searching. Further studies are planned to adjust the system parameters to improve its effectiveness

Proceedings ArticleDOI
01 Jun 1992
TL;DR: This paper describes an approach to complex object retrieval using a probabilistic inference net model and an implementation of this approach using a loose coupling of an object-oriented database system (IRIS) and a text retrieval system based on inference nets (INQUERY).
Abstract: Document management systems are needed for many business applications. This type of system would combine the functionality of a database system, (for describing, storing and maintaining documents with complex structure and relationships) with a text retrieval system (for effective retrieval based on full text). The retrieval model for a document management system is complicated by the variety and complexity of the objects that are represented. In this paper, we describe an approach to complex object retrieval using a probabilistic inference net model, and an implementation of this approach using a loose coupling of an object-oriented database system (IRIS) and a text retrieval system based on inference nets (INQUERY). The resulting system is used to store long, structured documents and can retrieve document components (sections, figures, etc.) based on their contents or the contents of related components. The lessons learnt from the implementation are discussed.


Journal Article
TL;DR: Results supported the hypothesis that there is a relationship between the completeness of the end user's mental model and both error behavior and total number of successful searches.

Proceedings Article
01 Jan 1992
TL;DR: This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collection and results indicate that the particular expansion technique used here improved the performance of some queries, but degrades theperformance of other queries.
Abstract: This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collection. The retrieval results indicate that the particular expansion technique used here improved the performance of some queries, but degrades the performance of other queries. The variability of the method is attributable to two main factors: the choice of concepts that are expanded and the confounding effects expansion has on cencept weights. Addressing these problems will require both a better method for determining the important concepts of a text and a better method for determining the correct sense of an ambiguous word

Journal ArticleDOI
TL;DR: An evaluation of three methods for the expansion of natural language queries in ranked-out put retrieval systems based on term co-oc currence data, on Soundex codes, and on a string similarity measure suggests there is no significant differ ence in retrieval effectiveness between any of these methods and unexpanded searches.
Abstract: This paper reports an evaluation of three methods for the expansion of natural language queries in ranked-out put retrieval systems. The methods are based on term co-oc currence data, on Soundex codes, and on a string similarity measure. Searches for 110 queries in a database of 26,280 titles and abstracts suggest that there is no significant differ ence in retrieval effectiveness between any of these methods and unexpanded searches.

Journal ArticleDOI
TL;DR: In this article, the results of retrieval tests using a variety of these search methods in the CHESHIRE experimental online catalog system were described and compared with the results obtained using the traditional Boolean search methods of conventional online catalog systems.
Abstract: Research on the use and users of online catalogs conducted in the early 1980s found that subject searches were the most common form of online catalog search. At the same time, many of the problems experienced by online catalog users have been traced to difficulties with the subject access mechanisms of the online catalog. A stream of research has concentrated on appplying retrieval techniques derived from information retrieval techniques derived from information retrieval (IR) research to replace the Boolean search methods of conventional online catalog systems. This study describes the results of retrieval tests using a variety of these search methods in the CHESHIRE experimental online catalog system.

Journal ArticleDOI
TL;DR: This paper describes a process whereby a morpho-syntactic analysis of phrases or user queries is used to generate a structured representation of text to evaluate the effectiveness or quality of the matching and scoring of phrases.
Abstract: The application of automatic natural language processing techniques to the indexing and the retrieval of text information has been a target of information retrieval researchers for some time. Incorporating semantic-level processing of language into retrieval has led to conceptual information retrieval, which is effective but usually restricted in its domain. Using syntactic-level analysis is domain-independent, but has not yet yielded significant improvements in retrieval quality. This paper describes a process whereby a morpho-syntactic analysis of phrases or user queries is used to generate a structured representation of text. A process of matching these structured representations is then described that generates a metric value or score indicating the degree of match between phrases. This scoring can then be used for ranking the phrases. In order to evaluate the effectiveness or quality of the matching and scoring of phrases, some experiments are described that indicate the method to be quite useful. Ultimately the phrasematching technique described here would be used as part of an overall document retrieval strategy, and some future work towards this direction is outlined.


Journal Article
TL;DR: This study examines the experiences of novice end users with a multifile, full-text information retrieval system and develops a regression model of the relative contribution of computer, system, and sybject area knowledge to search success.