scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 1995"


Journal ArticleDOI
TL;DR: A lexical match between words in users’ requests and those in or assigned to documents in a database helps retrieve textual materials from scientific databases.
Abstract: Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users’ requests and those in or assigned to documents in a database. ...

1,630 citations


Journal ArticleDOI
10 Feb 1995-Science
TL;DR: A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Abstract: A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure. When an existing document is used as an exemplar, the completeness and accuracy with which topically related documents are retrieved is comparable to that of the best existing systems. The results of a formal evaluation are discussed, and examples are given using documents in English and Japanese.

630 citations


Patent
17 May 1995
TL;DR: In this article, a document search method using a plurality of databases available from one or more servers using one or multiple search engines is described. Butler et al. present a method to rank documents using a local computation from uniform data, where all documents are ranked consistently as if coming from a single database.
Abstract: A document search method using a plurality of databases available from one or more servers using one or more search engines. For each database, the number of records is determined and reported, as well as frequency of search query term occurances or hits, together with identification of database records corresponding to the hits. Reports from a plurality of databases are furnished to a user terminal, a client, where client software computes a relevance score for each record based upon the number of records in the database, the number of records having at least one hit and the number of hits for each record. This local computation from uniform data allows all documents to be ranked consistently as if coming from a single database.

342 citations


Journal ArticleDOI
01 May 1995
TL;DR: The second Text Retrieval Conference (TREC-2) was held in August, 1993, and was attended by about 150 people involved in 31 participating groups as discussed by the authors, with a large variation of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching.
Abstract: The second Text Retrieval Conference (TREC-2) was held in August, 1993, and was attended by about 150 people involved in 31 participating groups. The goal of the conference was to bring research groups together to discuss their work on a new large test collection. There was a large variation of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. As results had been run through a common evaluation package, groups were able to compare the effectiveness of different techniques, and discuss how differences between the systems affected performance.

318 citations


Patent
Gregory J. Wolff1
13 Jan 1995
TL;DR: A document retrieval and accessing system in which documents are provided with links to other documents is described in this paper, where the selection of one or more of the links causes the corresponding documents to be retrieved and sent to the requesting party.
Abstract: A document retrieval and accessing system in which documents are provided with links to other documents. Selection of one or more of the links causes the corresponding documents to be retrieved and sent to the requesting party. Then retrieved documents may also include links to yet even more documents.

317 citations


Journal ArticleDOI
01 May 1995
TL;DR: Both projects found that the best method of combination often led to results that were better than the best performing single query, and the combined results from the two projects have also been combined by data fusion.
Abstract: We report on two studies in the TREC-2 program that investigated the effect on retrieval performance of combination of multiple representations of TREC topics. In one of the projects, five separate Boolean queries for each of the 50 TREC routing topics and 25 of the TREC ad hoc topics were generated by 75 experienced online searchers. Using the INQUERY retrieval system, these queries were both combined into single queries, and used to produce five separate retrieval results for each topic. In the former case, progressive combination of queries led to progressively improving retrieval performance, significantly better than that of single queries, and at least as good as the best individual single-query formulations. In the latter case, data fusion of the ranked lists also led to performance better than that of any single list. In the second project, two automatically produced vector queries and three versions of a manually produced P-norm extended Boolean query for each routing and ad hoc topic were compared and combined. This project investigated six different methods of combination of queries, and the combination of the same queries on different databases. As in the first project, progressive combination led to progressively improving results, with the best results, on average, being achieved by combination through summing of retrieval status values. Both projects found that the best method of combination often led to results that were better than the best performing single query. The combined results from the two projects have also been combined by data fusion. The results of this procedure show that combining evidence from completely different systems also leads to performance improvement.

296 citations


Journal ArticleDOI
TL;DR: This article presents three popular methods: the connectionist Hopfield network; the symbolic ID3/ID5R; and evolution-based genetic algorithms, which are promising in their ability to analyze user queries, identify users' information needs, and suggest alternatives for search.
Abstract: Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades In the 1980s, knowledge-based techniques also made an impressive contribution to “intelligent” information retrieval and indexing More recently, information science researchers have turned to other newer artificial-intelligence-based inductive learning techniques including neural networks, symbolic learning, and genetic algorithms These newer techniques, which are grounded on diverse paradigms, have provided great opportunities for researchers to enhance the information processing and retrieval capabilities of current information storage and retrieval systems In this article, we first provide an overview of these newer techniques and their use in information science research To familiarize readers with these techniques, we present three popular methods: the connectionist Hopfield network; the symbolic ID3/ID5R; and evolution-based genetic algorithms We discuss their knowledge representations and algorithms in the context of information retrieval Sample implementation and testing results from our own research are also provided for each technique We believe these techniques are promising in their ability to analyze user queries, identify users' information needs, and suggest alternatives for search With proper user-system interactions, these methods can greatly complement the prevailing full-text, keyword-based, probabilistic, and knowledge-based techniques © 1995 John Wiley & Sons, Inc

283 citations


Proceedings Article
11 Sep 1995
TL;DR: al. as discussed by the authors presented gGlOSS, a generalized glossary-of-servers server that keeps statistics on the available databases to estimate which databases are the potentially most useful for a given query.
Abstract: As large numbers of text databases have become available on the Internet, it is getting harder to locate the right sources for given queries. In this paper we present gGlOSS, a generalized Glossary-Of-Servers Server, that keeps statistics on the available databases to estimate which databases are the potentially most useful for a given query. gGlOSS extends our previous work, which focused on databases using the boolean model of document retrieval, to cover databases using the more sophisticated vector-space retrieval model. We evaluate our new techniques using real-user queries and 53 databases. Finally, we further generalize our approach by showing how to build a hierarchy of gGlOSS brokers. The top level of the hierarchy is so small it could be widely replicated, even at end-user workstations.

275 citations


Patent
15 Sep 1995
TL;DR: In this article, a method and apparatus for identifying textual documents and multi-media files corresponding to a search topic is presented, where a single search query corresponding to the search topic was received.
Abstract: A method and apparatus for identifying textual documents and multi-media files corresponding to a search topic. A plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored. The document records have text information fields associated therewith, each of the text information fields representing text from one of the plurality of textual documents. The multi-media records have multi-media information fields for representing only digital video or audio information and associated text fields, each of the associated text fields representing text associated with one of the multi-media information fields. A single search query corresponding to the search topic is received. The single search query is preferably in a natural language format. An index database is searched in accordance with the single search query to simultaneously identify document records and multi-media records related to the single search query. The index database has a plurality of search terms corresponding to terms represented by the text information fields and the associated text fields. The index database also includes a table for associating each of the document and multi-media records with one or more of the search terms. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and the multi-media records identified by the index database search. Text corresponding to the search topic is retrieved by selecting entries from the search result list representing document records to be retrieved, and then retrieving text represented by the text information fields associated with the selected document records. Digital video or audio information corresponding to the search topic is retrieved by selecting entries from the search result list representing selected multi-media records to be retrieved, and then retrieving digital video or audio information represented by multi-media information fields associated with the selected multi-media records.

256 citations


Journal ArticleDOI
TL;DR: A modified technique is presented that attempts to match the likelihood of retrieving a document of a certain length to thelihood of documents of that length being judged relevant, and it is shown that this technique yields significant improvements in retrieval effectiveness.
Abstract: In the TREC collection -a large full-text experimental text collection with widely varying document lengths -we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal probability, will not optimally retrieve useful documents from such a collection. We present a modified technique that attempts to match the likelihood of retrieving a document of a certain length to the likelihood of documents of that length being judged relevant, and show that this technique yields significant improvements in retrieval effectiveness.

215 citations


Journal ArticleDOI
01 May 1995
TL;DR: Improvements to the probabilistic information retrieval system based upon a Bayesian inference network model are described, including transforming forms-based specifications of information needs into complex structured queries, automatic query expansion, automatic recognition of features in documents, relevance feedback, and simulated document routing.
Abstract: INQUERY is a probabilistic information retrieval system based upon a Bayesian inference network model. This paper describes recent improvements to the system as a result of participation in the TIPSTER project and the TREC-2 conference. Improvements include transforming forms-based specifications of information needs into complex structured queries, automatic query expansion, automatic recognition of features in documents, relevance feedback, and simulated document routing. Experiments with one- and two-gigabyte document collections are also described.

Proceedings ArticleDOI
01 Jan 1995
TL;DR: Quantitative experiments demonstrate that Information Retrieval methods developed for searching text archives can accurately retrieve multimedia data, given suitable subtitle transcriptions, and can be used to rapidly locate interesting areas within an individual news broadcast.
Abstract: Recent years have seen a rapid increase in the availability and use of multimedia applications. These systems can generate large amounts of audio and video data which can be expensive to store and unwieldy to access. The Video Mail Retrieval (VMR) project at Cambridge University and Olivetti Research Limited (ORL), Cambridge, UK, is addressing these problems by developing systems to retrieve stored video material using the spoken audio soundtrack [1, 16]. Speci cally, the project focuses on the content-based location, retrieval, and playback of potentially relevant data. The primary goal of the VMR project is to develop a video mail retrieval application for the Medusa multimedia environment developed at ORL. Previous work on the VMR project demonstrated practical retrieval of audio messages using speech recognition for content identi cation [8, 4]. Because of the limited number of available audio messages, a much larger archive of television news broadcasts (along with accompanying subtitle transcriptions) is currently being collected. This will serve as a testbed for new methods of storing and accessing large amounts of audio/video data. The enormous potential size of the news broadcast archive dramatically illustrates the need for ways of automatically nding and retrieving information from the archive. Quantitative experiments demonstrate that Information Retrieval (IR) methods developed for searching text archives can accurately retrieve multimedia data, given suitable subtitle transcriptions. In addition, the same techniques can be used to rapidly locate interesting areas within an individual news broadcast. Although large multimedia archives will be more common in the future, today they require a specialised and highperformance hardware infrastructure. The work presented here relies on the the Medusa system developed at ORL, which includes distributed, high-capacity multimedia repositories. This paper begins with an overview of the ORL Medusa technology. Subsequent sections describe the collection and storage of a BBC television broadcast news archive, a retrieval methodology for location of potentially relevant sections in response to users' requests, and a graphical user interface for content-based retrieval and browsing of news f g

Journal ArticleDOI
Yiyu Yao1
TL;DR: A new measure of system performance is suggested based on the distance between user ranking and system ranking that only uses the relative order of documents and therefore confirms to the valid use of an ordinal scale measuring relevance.
Abstract: The notion of user preference is adopted for the representation, interpretation, and measurement of the relevance or usefulness of documents. User judgments on documents may be formally described by a weak order (i.e., user ranking) and measured using an ordinal scale. Within this framework, a new measure of system performance is suggested based on the distance between user ranking and system ranking. It only uses the relative order of documents and therefore confirms to the valid use of an ordinal scale measuring relevance. It is also applicable to multilevel relevance judgments and ranked system output. The appropriateness of the proposed measure is demonstrated through an axiomatic approach. The inherent relationships between the new measure and many existing measures provide further supporting evidence

Patent
IJsbrand Jan Aalbersberg1
11 Jan 1995
TL;DR: In this article, a user interface for a full-text document retrieval computerized system comprises a display with a words window in which each query word is displayed by means of a distinctive representation uniquely associated with each displayed word.
Abstract: A user interface for a full-text document retrieval computerized system comprises a display with a words window in which each query word is displayed by means of a distinctive representation uniquely associated with each displayed word. In a subsequent results window, each document header or title or representation is accompanied by an indicator which employs the same distinctive representation to directly indicate to the user the relative contributions of the individual query words to each listed document. In a preferred embodiment, the distinctive representation is integrated with an associated weight first indicator in a words window, and in the results window the distinctive representations are also integrated with an associated weight second indicator. The distinctive representation can take several forms, such as by a different color or by means of hatching or shading or by displayed icons.

Journal ArticleDOI
TL;DR: The authors introduce a classification tree to manage the relationships among different classes of layout structures and propose a method to recognize the layout structures of multi-kinds of table-form document images.
Abstract: Many approaches have reported that knowledge-based layout recognition methods are very successful in classifying the meaningful data from document images automatically. However, these approaches are applicable to only the same kind of documents because they are based on the paradigm that specifies the structure definition information in advance so as to be able to analyze a particular class of documents intelligently. In this paper, the authors propose a method to recognize the layout structures of multi-kinds of table-form document images. For this purpose, the authors introduce a classification tree to manage the relationships among different classes of layout structures. The authors' recognition system has two modes: layout knowledge acquisition and layout structure recognition. In the layout knowledge acquisition mode, table-form document images are distinguished according to this. Classification tree and then the structure description trees which specify the logical structures of table-form documents are generated automatically. While, in the layout structure recognition mode, individual item fields in the table-form document images are extracted and classified successfully by searching the classification tree and interpreting the structure description tree. >

Proceedings ArticleDOI
Azer Bestavros1
25 Oct 1995
TL;DR: This work proposes a hierarchical demand-based replication strategy that optimally disseminates information from its producer to servers that are closer to its consumers, and shows that by disseminating the most popular documents on servers closer to clients, network traffic could be reduced considerably, while servers are load-balanced.
Abstract: Research on replication techniques to reduce traffic and minimize the latency of information retrieval in a distributed system has concentrated on client-based caching, whereby recently/frequently accessed information is cached at a client (or at a proxy thereof) in anticipation of future accesses. We believe that such myopic solutions-focussing exclusively on a particular client or set of clients-are likely to have a limited impact. Instead, we offer a solution that allows the replication of information to be done on a global supply/demand basis. We propose a hierarchical demand-based replication strategy that optimally disseminates information from its producer to servers that are closer to its consumers. The level of dissemination depends on the relative popularity of documents, and on the expected reduction in traffic that results from such dissemination. We used extensive HTTP logs to validate an analytical model of server popularity and file access profiles. Using that model we show that by disseminating the most popular documents on servers closer to clients, network traffic could be reduced considerably, while servers are load-balanced. We argue that this process could be generalized to provide for an automated server-based information dissemination protocol that will be more effective in reducing both network bandwidth and document retrieval times than client-based caching protocols.

Journal ArticleDOI
TL;DR: The proposed algorithmic approach presents a viable option for efficiently traversing large‐scale, multiple thesauri (knowledge network) and can be adopted for automatic, multiple‐thesauri consultation.
Abstract: This paper presents a framework for knowledge discovery and concept exploration. In order to enhance the concept exploration capability of knowledge-based systems and to alleviate the limitations of the manual browsing approach, we have developed two spreading activation-based algorithms for concept exploration in large, heterogeneous networks of concepts (e.g., multiple thesauri). One algorithm, which is based on the symbolic AI paradigm, performs a conventional branch-and-bound search on a semantic net representation to identify other highly relevant concepts (a serial, optimal search process). The second algorithm, which is based on the neural network approach, executes the Hopfield net parallel relaxation and convergence process to identify “convergent” concepts for some initial queries (a parallel, heuristic search process). Both algorithms can be adopted for automatic, multiple-thesauri consultation. We tested these two algorithms on a large text-based knowledge network of about 13,000 nodes (terms) and 80,000 directed links in the area of computing technologies. This knowledge network was created from two external thesauri and one automatically generated thesaurus. We conducted experiments to compare the behaviors and performances of the two algorithms with the hypertext-like browsing process. Our experiment revealed that manual browsing achieved higher-term recall but lower-term precision in comparison to the algorithmic systems. However, it was also a much more laborious and cognitively demanding process. In document retrieval, there were no statistically significant differences in document recall and precision between the algorithms and the manual browsing process. In light of the effort required by the manual browsing process, our proposed algorithmic approach presents a viable option for efficiently traversing large-scale, multiple thesauri (knowledge network). © 1995 John Wiley & Sons, Inc.

Proceedings Article
01 Apr 1995
TL;DR: A statistical analysis of the TREC-3 data shows that performance differences across queries is greater thanperformance differences across participants runs.
Abstract: A statistical analysis of the TREC-3 data shows that performance differences across queries is greater than performance differences across participants runs. Generally, groups of runs which do not differ significantly at lerge, sometimes accounting for over half the runs. Correlation among the various performance measures is high.

Patent
24 Mar 1995
TL;DR: In this article, a system for retrieval of documents in a client-server environment is described, which provides compatibility between an attribute-based document display system and diverse query languages within remote document repositories.
Abstract: A system for retrieval of documents in a client-server environment is disclosed. The system provides compatibility between an attribute based document display system and diverse query languages within remote document repositories. The system includes a local process running on a client module, and a remote process running within each document repository. Each remote process is designed for the particular model of computer used for the server. Each remote process executes a System Query Language (SQL) used by a particular database program running on the server. A particular server may have several database programs implemented thereon, and each database program has a dedicated remote process, where the remote process is matched to the particular database program. The local process on the user's workstation launches inquiries in a first format on the network. Each remote process receiving an inquiry translates the received inquiry into the System Query Language required by its server and its database program. When the database program returns a response to the System Query language inquiry, the remote process translates the response into the first format, and returns the response to the local process by transmitting a reply over the network.

Journal ArticleDOI
Howard R. Turtle1
TL;DR: An introduction to text retrieval is provided and the main research related to the retrieval of legal materials is surveyed.
Abstract: The ability to find relevant materials in large document collections is a fundamental component of legal research. The emergence of large machine-readable collections of legal materials has stimulated research aimed at improving the quality of the tools used to access these collections. Important research has been conducted within the traditional information retrieval, the artificial intelligence, and the legal communities with varying degrees of interaction between these groups. This article provides an introduction to text retrieval and surveys the main research related to the retrieval of legal materials.

Journal ArticleDOI
TL;DR: This work proposes an approach based on a completely different assumption: ‘a term is a possible world’ which enables the exploitation of term‐term relationships which are estimated using an information theoretic measure.
Abstract: The evaluation of an implication by Imaging is a logical technique developed in the framework of modal logic. Its interpretation in the context of a ‘possible worlds’ semantics is very appealing for ir. In 1989, Van Rijsbergen suggested its use for solving one of the fundamental problems of logical models of IR: the evaluation of the implication d → q (where d and q are respectively a document and a query representation). Since then, others have tried to follow that suggestion proposing models and applications, though without much success. Most of these approaches had as their basic assumption the consideration that ‘a document is a possible world’. We propose instead an approach based on a completely different assumption: ‘a term is a possible world’. This approach enables the exploitation of term‐term relationships which are estimated using an information theoretic measure.

Proceedings ArticleDOI
TL;DR: This research explores the interaction of linguistic and photographic information in an integrated text/image database by utilizing linguistic descriptions of a picture coordinated with pointing references to the picture to extract information useful in two aspects: image interpretation and image retrieval.
Abstract: This research explores the interaction of linguistic and photographic information in an integrated text/image database. By utilizing linguistic descriptions of a picture (speech and text input) coordinated with pointing references to the picture, we extract information useful in two aspects: image interpretation and image retrieval. In the image interpretation phase, objects and regions mentioned in the text are identified; the annotated image is stored in a database for future use. We incorporate techniques from our previous research on photo understanding using accompanying text: a system, PICTION, which identifies human faces in a newspaper photograph based on the caption. In the image retrieval phase, images matching natural language queries are presented to a user in a ranked order. This phase combines the output of (1) the image interpretation/annotation phase, (2) statistical text retrieval methods, and (3) image retrieval methods (e.g., color indexing). The system allows both point and click querying on a given image as well as intelligent querying across the entire text/image database.

Journal ArticleDOI
TL;DR: A series of experiments conducted using a specific implementation of an inference network based probabilistic retrieval model to study the retrieval effectiveness of combining manaul and automatic index representations in queries and documents indicate that significant benefits in retrieval effectiveness can be obtained through combined representations.
Abstract: Results from research in information retrieval suggest that significant improvements in retrieval effectiveness could be obtained by combining results from multiple index representations and query strategies. Recently, an inference network based probabilistic retrieval model has been proposed, which views information retrieval as an evidential reasoning process in which multiple sources of evidence about document and query content are combined to estimate the relevance probabilities. In this paper we report a series of experiments we conducted using a specific implementation of this model to study the retrieval effectiveness of combining manaul and automatic index representations in queries and documents. The results indicate that significant benefits in retrieval effectiveness can be obtained through combined representations.

Journal ArticleDOI
01 May 1995
TL;DR: A knowledge-based approach for fuzzy information retrieval is proposed, where interval queries and weighted-interval queries are allowed for document retrieval, and knowledge is represented by a concept matrix.
Abstract: A knowledge-based approach for fuzzy information retrieval is proposed, where interval queries and weighted-interval queries are allowed for document retrieval. In this paper, knowledge is represented by a concept matrix, where the elements in a concept matrix represent relevant values between concepts. The implicit relevant values between concepts are inferred by the transitive closure of the concept matrix based on fuzzy logic. The proposed method is more flexible than previous methods due to the fact that it has the capability to deal with interval queries and weighted-interval queries. >

Journal ArticleDOI
TL;DR: Research in the probabilistic theory of information retrieval involves the construction of mathematical models based on statistical assumptions, including the so-called Binary Independence model, which has been seriously misapprehended.
Abstract: Research in the probabilistic theory of information retrieval involves the construction of mathematical models based on statistical assumptions. One of the hazards inherent in this kind of theory construction is that the assumptions laid down maybe inconsmtent in unanticipated ways with the data to which they are applied. Another hazard is that the stated assumptions may not be those on which the derived modeling equations or resulting experiments are actually based. Both kinds of mistakes have been made m past research on probabihstic reformation retrieval. One consequence of these errors is that the statistical character of certain probabilistic IR models, including the so-called Binary Independence model, has been seriously misapprehended

Proceedings ArticleDOI
01 Jul 1995
TL;DR: A new structured query optimization technique is presented which is implemented in an inference network-based information retrieval system and Experimental results show that query evaluation time can be reduced by more than half with little impact on retrieval effectiveness.
Abstract: Information retrieval systems are being challenged to manage larger and larger document collections. In an effort to provide better retrieval performance on large collections, more sophisticated retrieval techniques have been developed that support rich, structured queries. Structured queries are not amenable to previously proposed optimization techniques. Optimizing execution, however, is even more important in the context of large document collections. We present a new structured query optimization technique which we have implemented in an inference network-based information retrieval system. Experimental results show that query evaluation time can be reduced by more than half with little impact on retrieval effectiveness.

Journal ArticleDOI
01 May 1995
TL;DR: The TREC programme is reviewed as an evaluation exercise, the methods of indexing and retrieval being tested within it in terms of the approaches to system performance factors these represent, and the test results are analyzed.
Abstract: This paper discusses the Text REtrieval Conferences (TREC) programme as a major enterprise in information retrieval research. It reviews its structure as an evaluation exercise, characterises the methods of indexing and retrieval being tested within it in terms of the approaches to system performance factors these represent; analyses the test results for solid, overall conclusions that can be drawn from them; and, in the light of the particular features of the test data, assesses TREC both for generally applicable findings that emerge from it and for directions it offers for future research.

Journal ArticleDOI
01 May 1995
TL;DR: In this paper, the authors consider division of documents into parts as a solution to the problem of the range of document sizes and show that, for databases with long documents, use of document parts can improve the quality of the information presented to the user.
Abstract: Management and retrieval of large volumes of text can be expensive in both space and time. Moreover, the range of document sizes in a large collection such as TREC presents difficulties for both the retrieval mechanism and the user. We consider division of documents into parts as a solution to the problem of the range of document sizes, and show that, for databases with long documents, use of document parts can improve the quality of the information presented to the user. We also describe the compressed text database system we use to manage the TREC collection; the compressed inverted files with which it is indexed; and the techniques we use to evaluate the TREC queries, both on whole documents and on document parts.

Journal ArticleDOI
TL;DR: This article summarizes the evaluation studies that have been done with SAPHIRE, highlighting the lessons learned and laying out the challenges ahead to all medical information retrieval efforts.
Abstract: Information retrieval systems are being used increasingly in biomedical settings, but many problems still exist in indexing, retrieval, and evaluation. The SAPHIRE Project was undertaken to seek solutions for these problems. This article summarizes the evaluation studies that have been done with SAPHIRE, highlighting the lessons learned and laying out the challenges ahead to all medical information retrieval efforts. © 1995 John Wiley & Sons, Inc.

Proceedings ArticleDOI
01 Jul 1995
TL;DR: An approach is developed that provides a framework to achieve both scalability and full integration of IR and RDBMS technology and validate the cooperative indexing scheme and suggest alternatives to further improve performance.
Abstract: The full integration of information retrieval (IR) features into a database management system (DBMS) has long been recognized as both a significant goal and a challenging undertaking. By full integration we mean: i) support for document storage, indexing, retrieval, and update, ii) transaction semantics, thus all database operations on documents have the ACID properties of atomicity, consistency, isolation, and durability, iii) concurrent addition, update, and retrieval of documents, and iv) database query language extensions to provide ranking for document retrieval operations. It is also necessary for the integrated offering to exhibit scaleable performance for document indexing and retrieval processes, To identify the implementation requirements imposed by the desired level of integration, we layered a representative IR application on Oracle Rdb and then conducted a number of database load and document retrieval experiments. The results of these experiments suggest that infrastructural extensions are necessary to obtain both the desired level of IR integration and scaleable performance. With the insight gained from our initial experiments, we developed an approach, called cooperative indexing, that provides a framework to achieve both scalability and full integration of IR and RDBMS technology. Prototype implementations of system-level extensions to support cooperative indexing were evaluated with a modified version of Oracle Rdb. Our experimental findings validate the cooperative indexing scheme and suggest alternatives to further improve performance.