Topic
Document retrieval
About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.
Papers published on a yearly basis
Papers
More filters
•
TL;DR: This paper introduces a factorized model for this new task that optimizes the top-ranked items returned for the given query and user and reports empirical results where it outperforms several baselines.
Abstract: Retrieval tasks typically require a ranking of items given a query. Collaborative filtering tasks, on the other hand, learn to model user's preferences over items. In this paper we study the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task. This setup differs from the standard collaborative filtering one in that we are given a query x user x item tensor for training instead of the more traditional user x item matrix. Compared to document retrieval we do have a query, but we may or may not have content features (we will consider both cases) and we can also take account of the user's profile. We introduce a factorized model for this new task that optimizes the top-ranked items returned for the given query and user. We report empirical results where it outperforms several baselines.
40 citations
••
TL;DR: The present report deals with the evaluation of a variety of automatic indexing and retrieval procedures incorporated into the SMART automatic document retrieval system.
Abstract: The generation of effective methods for the evaluation of information retrieval systems and techniques is becoming increasingly important as more and more systems are designed and implemented. The present report deals with the evaluation of a variety of automatic indexing and retrieval procedures incorporated into the SMART automatic document retrieval system. The design of the SMART system is first briefly reviewed. The document file, search requests, and other parameters affecting the evaluation system are then examined in detail, and the measures used to assess the effectiveness of the retrieval performance are described. The main test results are given and tentative conclusions are reached concerning the design of fully automatic information systems.
40 citations
05 Jul 2012
TL;DR: A novel framework is introduced, in which evaluation is done in an extrinsic, and query-dependent manner but without depending on relevance judgments, which is expected to be helpful for the task of optimizing the configuration of ASR systems for the transcription of (large) speech collections for use in Spoken Document Retrieval.
Abstract: Spoken Document Retrieval (SDR) is usually implemented by using an Information Retrieval (IR) engine on speech transcripts that are produced by an Automatic Speech Recognition (ASR) system. These transcripts generally contain a substantial amount of transcription errors (noise) and are mostly unstructured. This thesis addresses two challenges that arise when doing IR on this type of source material: i. segmentation of speech transcripts into suitable retrieval units, and ii. evaluation of the impact of transcript noise on the results of an IR task.
It is shown that intrinsic evaluation results in different conclusions with regard to the quality of automatic story boundaries than when (extrinsic) Mean Average Precision (MAP) is used. This indicates that for automatic story segmentation for search applications, the traditionally used (intrinsic) segmentation cost may not be a good performance target. The best performance in an SDR context was achieved using lexical cohesion-based approaches, rather than the statistical approaches that were most popular in story segmentation benchmarks.
For the evaluation of speech transcript noise in an SDR context a novel framework is introduced, in which evaluation is done in an extrinsic, and query-dependent manner but without depending on relevance judgments. This is achieved by making a direct comparison between the ranked results lists of IR tasks on a reference and an ASR-derived transcription. The resulting measures are highly correlated with MAP, making it possible to do extrinsic evaluation of ASR transcripts for ad-hoc collections, while using a similar amount of reference material as the popular intrinsic metric Word Error Rate.
The proposed evaluation methods are expected to be helpful for the task of optimizing the configuration of ASR systems for the transcription of (large) speech collections for use in Spoken Document Retrieval, rather than the more traditional dictation tasks.
40 citations
•
01 Jan 1996TL;DR: In this article, distance-based relevance scoring (spans) is used to identify promising information servers in the context of the ad hoc retrieval task and lightweight probe queries are shown to be an effective method for identifying promising information server in the latter task.
Abstract: A number of experiments conducted within the framework of the TREC-5 conference and using the Parallel Document Retrieval Engine (PADRE) are reported. Several of the experiments involve the use of distance-based relevance scoring (spans). This scoring method is shown to be capable of very good precision-recall performance, provided that good queries are described and evaluated in the context of the adhoc retrieval task. Span queries are also applied to processing a larger (4.5 gigabytes) collection, to retrieval over OCR-corrupted data and to a database merging task. Lightweight probe queries are shown to be an effective method for identifying promising information servers in the context of the latter task. New tehniques for automatically generating more conventional weighted-terms queries from short topic descriptions have also been devised and are evaluated
40 citations
•
TL;DR: Etude comparative des avantages et des inconvenients du langage naturel and du vocabulaire controle pour la recherche documentaire automatise.
Abstract: Etude comparative des avantages et des inconvenients du langage naturel et du vocabulaire controle pour la recherche documentaire automatise L'auteur analyse egalement la pertinence de l'une ou l'autre methode, dans le contexte de systemes experts ou de banques de donnees en texte integral
40 citations