scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Cumulated gain-based evaluation of IR techniques

01 Oct 2002-ACM Transactions on Information Systems (ACM)-Vol. 20, Iss: 4, pp 422-446
TL;DR: This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.
Abstract: Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.
Citations
More filters
Book
Tie-Yan Liu1
27 Jun 2009
TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

2,515 citations


Cites background from "Cumulated gain-based evaluation of ..."

  • ...Discounted cumulative gain (DCG): While the aforementioned measures are mainly designed for binary judgments, DCG [65, 66] can leverage the relevance judgment in terms of multiple ordered categories, and has an explicit position discount factor in its definition....

    [...]

Journal ArticleDOI
01 Aug 2011
TL;DR: Under the meta path framework, a novel similarity measure called PathSim is defined that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures.
Abstract: Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks.In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept of meta path-based similarity, where a meta path is a path consisting of a sequence of relations defined between different object types (i.e., structural paths at the meta level). No matter whether a user would like to explicitly specify a path combination given sufficient domain knowledge, or choose the best path by experimental trials, or simply provide training examples to learn it, meta path forms a common base for a network-based similarity search engine. In particular, under the meta path framework we define a novel similarity measure called PathSim that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures. In order to support fast online query processing for PathSim queries, we develop an efficient solution that partially materializes short meta paths and then concatenates them online to compute top-k results. Experiments on real data sets demonstrate the effectiveness and efficiency of our proposed paradigm.

1,583 citations


Cites result from "Cumulated gain-based evaluation of ..."

  • ...Then we use the measure nDCG (Normalized Discounted Cumulative Gain, with the value between 0 and 1, the higher the better) [9] to evaluate the quality of a ranking algorithm by comparing its output ranking results with the labeled ones (Table 5)....

    [...]

Book ChapterDOI
Cyril Goutte1, Eric Gaussier1
21 Mar 2005
TL;DR: A probabilistic setting is used which allows us to obtain posterior distributions on these performance indicators, rather than point estimates, and is applied to the case where different methods are run on different datasets from the same source.
Abstract: We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as well as the standard situation where competing results are obtained on the same data.

1,402 citations

Book ChapterDOI
Guy Shani1, Asela Gunawardana1
01 Jan 2011
TL;DR: This paper discusses how to compare recommenders based on a set of properties that are relevant for the application, and focuses on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms.
Abstract: Recommender systems are now popular both commercially and in the research community, where many approaches have been suggested for providing recommendations. In many cases a system designer that wishes to employ a recommendation system must choose between a set of candidate approaches. A first step towards selecting an appropriate algorithm is to decide which properties of the application to focus upon when making this choice. Indeed, recommendation systems have a variety of properties that may affect user experience, such as accuracy, robustness, scalability, and so forth. In this paper we discuss how to compare recommenders based on a set of properties that are relevant for the application. We focus on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. We describe experimental settings appropriate for making choices between algorithms. We review three types of experiments, starting with an offline setting, where recommendation approaches are compared without user interaction, then reviewing user studies, where a small group of subjects experiment with the system and report on the experience, and finally describe large scale online experiments, where real user populations interact with the system. In each of these cases we describe types of questions that can be answered, and suggest protocols for experimentation. We also discuss how to draw trustworthy conclusions from the conducted experiments. We then review a large set of properties, and explain how to evaluate systems given relevant properties. We also survey a large set of evaluation metrics in the context of the properties that they evaluate.

1,238 citations


Cites background from "Cumulated gain-based evaluation of ..."

  • ...Normalized Cumulative Discounted Gain (NDCG) [27] is a measure from information retrieval, where positions are discounted logarithmically....

    [...]

  • ...Thus, ARHR decays more slowly than R score but faster than NDCG. Online evaluation of ranking In an online experiment designed to evaluate the ranking of the recommendation list, we can look at the interactions of users with the system....

    [...]

  • ...A measure closely related to R-score and NDCG is Average Reciprocal Hit Rank (ARHR) [14] which is an un-normalized measure that assigns a utility 1/k to a successful recommendation at position k....

    [...]

  • ...NDCG is the normalized version of DCG given by NDCG = DCG DCG∗ (8.18) where DCG∗ is the ideal DCG....

    [...]

Book
16 Feb 2009
TL;DR: This text provides the background and tools needed to evaluate, compare and modify search engines and numerous programming exercises make extensive use of Galago, a Java-based open source search engine.
Abstract: KEY BENEFIT: Written by a leader in the field of information retrieval, this text provides the background and tools needed to evaluate, compare and modify search engines. KEY TOPICS: Coverage of the underlying IR and mathematical models reinforce key concepts. Numerous programming exercises make extensive use of Galago, a Java-based open source search engine. MARKET: A valuable tool for search engine and information retrieval professionals.

1,050 citations

References
More filters
Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Cumulated gain-based evaluation of ..." refers background in this paper

  • ...…to some traditional measures such as average search length [Losee 1998], expected search length [Cooper 1968], normalized recall [Rocchio 1966; Salton and McGill 1983], sliding ratio [Pollack 1968; Korfhage 1997], satisfaction frustration total measure [Myaeng and Korfhage 1990], and ranked…...

    [...]

  • ...These novel measures are akin to the average search length [brie.y, ASL; Losee 1998], sliding ratio [Korfhage 1997], and normalized recall [Pollack 1968; Salton and McGill 1983; Korfhage 1997] measures....

    [...]

  • ...These novel measures are akin to the average search length [briefly, ASL; Losee 1998], sliding ratio [Korfhage 1997], and normalized recall [Pollack 1968; Salton and McGill 1983; Korfhage 1997] measures....

    [...]

  • ...They are related to some traditional measures such as average search length [Losee 1998], expected search length [Cooper 1968], normalized recall [Rocchio 1966; Salton and McGill 1983], sliding ratio [Pollack 1968; Korfhage 1997], satisfaction—frustration— total measure [Myaeng and Korfhage 1990], and ranked half-life [Borlund and Ingwersen 1998]....

    [...]

Book
01 Jan 1971
TL;DR: Probability Theory. Statistical Inference. Contingency Tables. Appendix Tables. Answers to Odd-Numbered Exercises and Answers to Answers to Answer Questions as discussed by the authors.
Abstract: Probability Theory. Statistical Inference. Some Tests Based on the Binomial Distribution. Contingency Tables. Some Methods Based on Ranks. Statistics of the Kolmogorov-Smirnov Type. References. Appendix Tables. Answers to Odd-Numbered Exercises. Index.

10,382 citations

Journal ArticleDOI
01 Jul 2000
TL;DR: The novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Abstract: This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents This is desirable from the user point of view in modem large IR environments The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance The test was run with a best match retrieval system (In- Query I) in a text database consisting of newspaper articles The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents The differences between the query types are practically essential and statistically significant More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods

1,461 citations

Journal ArticleDOI
TL;DR: An evaluation of a large, operational full-text document-retrieval system shows the system to be retrieving less than 20 percent of the documents relevant to a particular search.
Abstract: An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in terms of the theory and practice of full-text document retrieval.

871 citations