Learning to rank using gradient descent

doi:10.1145/1102351.1102363

Home
/
Papers
/
Learning to rank using gradient descent

Proceedings Article•DOI•

Learning to rank using gradient descent

Chris J.C. Burges¹, Tal Shaked¹, Erin L. Renshaw¹, Ari Lazier¹, Matt Deeds¹, Nicole A. Hamilton¹, Greg Hullender¹ - Show less +3 more•Institutions (1)

Microsoft¹

07 Aug 2005-pp 89-96

TL;DR: RankNet is introduced, an implementation of these ideas using a neural network to model the underlying ranking function, and test results on toy data and on data from a commercial internet search engine are presented.

read less

Abstract: We investigate using gradient descent methods for learning ranking functions; we propose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Book•

Machine Learning : A Probabilistic Perspective

[...]

Kevin P. Murphy

24 Aug 2012

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

8,059 citations

Cites background from "Learning to rank using gradient des..."

...This is a special kind of neural network known as RankNet (Burges et al. 2005) (see Section 16....
[...]

Proceedings Article•

BPR: Bayesian personalized ranking from implicit feedback

[...]

Steffen Rendle¹, Christoph Freudenthaler¹, Zeno Gantner¹, Lars Schmidt-Thieme¹•Institutions (1)

University of Hildesheim¹

18 Jun 2009

TL;DR: In this article, the authors proposed a generic optimization criterion BPR-Opt for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of the problem, which is based on stochastic gradient descent with bootstrap sampling.

...read moreread less

Abstract: Item recommendation is the task of predicting a personalized ranking on a set of items (e.g. websites, movies, products). In this paper, we investigate the most common scenario with implicit feedback (e.g. clicks, purchases). There are many methods for item recommendation from implicit feedback like matrix factorization (MF) or adaptive k-nearest-neighbor (kNN). Even though these methods are designed for the item prediction task of personalized ranking, none of them is directly optimized for ranking. In this paper we present a generic optimization criterion BPR-Opt for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of the problem. We also provide a generic learning algorithm for optimizing models with respect to BPR-Opt. The learning method is based on stochastic gradient descent with bootstrap sampling. We show how to apply our method to two state-of-the-art recommender models: matrix factorization and adaptive kNN. Our experiments indicate that for the task of personalized ranking our optimization method outperforms the standard learning techniques for MF and kNN. The results show the importance of optimizing models for the right criterion.

...read moreread less

3,429 citations

Book•

Learning to Rank for Information Retrieval

[...]

Tie-Yan Liu¹•Institutions (1)

Microsoft¹

27 Jun 2009

TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.

...read moreread less

Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

...read moreread less

2,515 citations

Cites background or methods from "Learning to rank using gradient des..."

...In recent years, learning to rank has become a very hot research direction in IR, and a large number of learning-to-rank algorithms have been proposed, such as [9, 13, 14, 16, 17, 26, 29, 33, 34, 47, 49, 59, 63, 73, 78, 90, 97, 99, 102, 114, 119, 122, 129, 134, 136]....
[...]
...Example algorithms belonging to the pairwise approach include [9, 14, 16, 29, 47, 63, 97, 122]....
[...]

Book•

The Probabilistic Relevance Framework

[...]

Stephen Robertson¹, Hugo Zaragoza²•Institutions (2)

Microsoft¹, Yahoo!²

17 Dec 2009

TL;DR: This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F.

...read moreread less

Abstract: The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.

...read moreread less

2,037 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

13,246 citations

"Learning to rank using gradient des..." refers background in this paper

...two layer neural nets can approximate any bounded continuous function (Mitchell, 1997)), and since they are often faster in test phase than competing kernel methods (and test speed is critical for this application); however our cost function could equally well be applied to a variety of machine learning algorithms....
[...]
...…to explore these ideas using neural networks, since they are flexible (e.g. two layer neural nets can approximate any bounded continuous function (Mitchell, 1997)), and since they are often faster in test phase than competing kernel methods (and test speed is critical for this application);…...
[...]

Proceedings Article•

Learning with Kernels

[...]

Bernhard Schölkopf¹•Institutions (1)

Max Planck Society¹

01 Aug 2002

3,669 citations

Proceedings Article•

Signature Verification using a "Siamese" Time Delay Neural Network

[...]

Jane Bromley¹, Isabelle Guyon¹, Yann LeCun¹, E. Sackinger¹, Roopak Shah¹ - Show less +1 more•Institutions (1)

Bell Labs¹

29 Nov 1993

TL;DR: An algorithm for verification of signatures written on a pen-input tablet based on a novel, artificial neural network called a "Siamese" neural network, which consists of two identical sub-networks joined at their outputs.

...read moreread less

Abstract: This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a "Siamese" neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

...read moreread less

2,980 citations

"Learning to rank using gradient des..." refers background or methods in this paper

...4One can also view this as a weight sharing update for a Siamese-like net(Bromley et al., 1993)....
[...]
...Table 2 compares results, for the polynomial ranking function, of training on ties, assigning P = 1 for nonties and P = 0.5 for ties, using a two layer net with 10 hidden units....
[...]

Journal Article•DOI•

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons

[...]

Ralph A. Bradley, Milton E. Terry

01 Dec 1952-Biometrika

1,992 citations

"Learning to rank using gradient des..." refers background in this paper

...In (Hastie & Tibshirani, 1998) and (Bradley & Terry, 1952), the authors consider models of the following form: for some fixed set of events A1, . . . , Ak, pairwise probabilities P (Ai|Ai or Aj) are given, and it is assumed that there is a set of probabilities P̂i such that P (Ai|Ai or Aj) =…...
[...]
...In (Hastie & Tibshirani, 1998) and ( Bradley & Terry, 1952 ), the authors consider models of the following form: for some xed set of events A1; : : : ; Ak, pairwise probabilities P (AijAi or Aj) are given, and it is assumed that there is a set of probabilities ^...
[...]

Journal Article•DOI•

An efficient boosting algorithm for combining preferences

[...]

Yoav Freund¹, Raj D. Iyer, Robert E. Schapire², Yoram Singer³•Institutions (3)

Columbia University¹, Princeton University², Hebrew University of Jerusalem³

01 Dec 2003-Journal of Machine Learning Research

TL;DR: This work describes and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning, and gives theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training.

...read moreread less

Abstract: We study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. This problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the "collaborative-filtering" problem of ranking movies for a user based on the movie rankings provided by other users. In this work, we begin by presenting a formal framework for this general problem. We then describe and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning. We give theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training. We also describe an efficient implementation of the algorithm for a particular restricted but common case. We next discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. The second experiment is a collaborative-filtering task for making movie recommendations.

...read moreread less

1,889 citations

"Learning to rank using gradient des..." refers background or methods in this paper

...PRank learns using one example at a time, which is held as an advantage over pair-based methods (e.g. (Freund et al., 2003)), since the latter must learn using O(m2) pairs rather than m examples....
[...]
...In (Freund et al., 2003), results are given using decision stumps as the weak learners....
[...]
...RankBoost (Freund et al., 2003) is another ranking algorithm that is trained on pairs, and which is closer in spirit to our work since it attempts to solve the preference learning problem directly, rather than solving an ordinal regression problem....
[...]
...(Freund et al., 2003)), since the latter must learn using O(m(2)) pairs rather than m examples....
[...]