scispace - formally typeset
Open AccessDissertation

Optimizing the construction of information retrieval test collections

TLDR
A probabilistic model is developed that provides accurate relevance judgments with a smaller number of labels collected per document, and should assist research institutes and commercial search engines to construct test collections where there are large document collections and large query logs, but where economic constraints prohibit gathering comprehensive relevance judgments.
Abstract
We consider the problem of optimally allocating a limited budget to acquire relevance judgments when constructing an information retrieval test collection. We assume that there is a large set of test queries, for each of which a large number of documents need to be judged. However, the available budget only permits to judge a subset of them. We begin by developing a mathematical framework for query selection as a mechanism for reducing the cost of constructing information retrieval test collections. The mathematical framework provides valuable insights into properties of the optimal subset of queries. These are that the optimal subset of queries should be least correlated with one another, but have a strong correlation with the rest of queries. In contrast to previous work, which is mostly retrospective, our mathematical framework does not assume that relevance judgments are available a priori, and hence is designed to work in practice. The mathematical framework is then extended to accommodate both the query selection and document selection approaches to arrive at a unified budget allocation method that prioritizes query-document pairs and selects a subset of them with the highest priority scores to be judged. The unified budget allocation is formulated as a convex optimization, thereby permitting efficient solution and providing a flexible framework to incorporate various optimization constraints. Once a subset of query-document pairs are selected, crowdsourcing can be used to collect associated relevance judgments. While the labels provided by crowdsourcing are relatively inexpensive, they vary in quality, introducing noise into the relevance judgments. To deal with noisy relevance judgments, multiple labels for a document are collected from different assessors. It is common practice in information retrieval to use majority voting to aggregate multiple labels. In contrast, we develop a probabilistic model that provides accurate relevance judgments with a smaller number of labels collected per document. We demonstrate the effectiveness of our cost optimization approach on three experimental data, namely: (i) various TREC tracks, (ii) a web test collection of an online search engine, and (iii) crowdsourced data collected for the INEX 2010 Book Search track. Our approach should assist research institutes, e.g. National Institute and Standard Technology (NIST), and commercial search engines, e.g. Google and Bing, to construct test collections where there are large document collections and large query logs, but where economic constraints prohibit gathering comprehensive relevance judgments.

read more

Citations
More filters

Comparing Metrics across TREC and NTCIR : The Robustness to System Bias

Tetsuya Sakai
TL;DR: Nine metrics are examined in a more realistic setting, by reducing the number of pooled systems, to show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision, Q-measure and normalised Discounted Cumulative Gain are not necessarily superior to the original metrics in terms of discriminative power.
References
More filters
Proceedings Article

A study of cross-validation and bootstrap for accuracy estimation and model selection

TL;DR: The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
Journal ArticleDOI

Cumulated gain-based evaluation of IR techniques

TL;DR: This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
Journal ArticleDOI

A generalization of sampling without replacement from a finite universe.

TL;DR: In this paper, two sampling schemes are discussed in connection with the problem of determining optimum selection probabilities according to the information available in a supplementary variable, which is a general technique for the treatment of samples drawn without replacement from finite universes when unequal selection probabilities are used.
Book

Learning to Rank for Information Retrieval

TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Proceedings ArticleDOI

Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks

TL;DR: This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks.
Related Papers (5)