Deep Neural Networks for YouTube Recommendations

doi:10.1145/2959100.2959190

Home
/
Papers
/
Deep Neural Networks for YouTube Recommendations

Proceedings Article•DOI•

Deep Neural Networks for YouTube Recommendations

Paul Covington¹, Jay Adams¹, Emre Sargin¹•Institutions (1)

Google¹

07 Sep 2016-pp 191-198

TL;DR: This paper details a deep candidate generation model and then describes a separate deep ranking model and provides practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact.

read less

Abstract: YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

[...]

Jiaqi Ma¹, Zhe Zhao², Xinyang Yi², Jilin Chen², Lichan Hong², Ed H. Chi² - Show less +2 more•Institutions (2)

University of Michigan¹, Google²

19 Jul 2018

TL;DR: This work proposes a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data and demonstrates the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

...read moreread less

Abstract: Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

...read moreread less

545 citations

Posted Content•

DKN: Deep Knowledge-Aware Network for News Recommendation

[...]

Hongwei Wang¹, Fuzheng Zhang², Xing Xie², Minyi Guo¹•Institutions (2)

Shanghai Jiao Tong University¹, Microsoft²

25 Jan 2018-arXiv: Machine Learning

TL;DR: A deep knowledge-aware network (DKN) that incorporates knowledge graph representation into news recommendation and achieves substantial gains over state-of-the-art deep recommendation models is proposed.

...read moreread less

Abstract: Online news recommender systems aim to address the information explosion of news and make personalized recommendation for users. In general, news language is highly condensed, full of knowledge entities and common sense. However, existing methods are unaware of such external knowledge and cannot fully discover latent knowledge-level connections among news. The recommended results for a user are consequently limited to simple patterns and cannot be extended reasonably. Moreover, news recommendation also faces the challenges of high time-sensitivity of news and dynamic diversity of users' interests. To solve the above problems, in this paper, we propose a deep knowledge-aware network (DKN) that incorporates knowledge graph representation into news recommendation. DKN is a content-based deep recommendation framework for click-through rate prediction. The key component of DKN is a multi-channel and word-entity-aligned knowledge-aware convolutional neural network (KCNN) that fuses semantic-level and knowledge-level representations of news. KCNN treats words and entities as multiple channels, and explicitly keeps their alignment relationship during convolution. In addition, to address users' diverse interests, we also design an attention module in DKN to dynamically aggregate a user's history with respect to current candidate news. Through extensive experiments on a real online news platform, we demonstrate that DKN achieves substantial gains over state-of-the-art deep recommendation models. We also validate the efficacy of the usage of knowledge in DKN.

...read moreread less

510 citations

Cites methods from "Deep Neural Networks for YouTube Re..."

...• The architecture of DeepWide and YouTubeNet is similar in the news recommendation scenario, thus we can observe comparable performance of the two methods....
[...]
...For YouTubeNet, the dimension of final layer is set as 100....
[...]
...DSSM outperforms DeepWide and YouTubeNet, the reason for which might be that DSSM models raw texts directly with word hashing....
[...]
...For example, the AUC of KPCNN, DeepWide, and YouTubeNet increases by 1.1%, 1.8% and 1.1%, respectively....
[...]
...• YouTubeNet [8] is proposed to recommend videos from a large-scale candidate set in YouTube using a deep candidate generation network and a deep ranking network....
[...]

Proceedings Article•DOI•

AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

[...]

Weiping Song¹, Chence Shi¹, Zhiping Xiao², Zhijian Duan¹, Yewen Xu¹, Ming Zhang¹, Jian Tang³ - Show less +3 more•Institutions (3)

Peking University¹, University of California, Los Angeles², HEC Montréal³

03 Nov 2019

TL;DR: An effective and efficient method called the AutoInt to automatically learn the high-order feature interactions of input features and map both the numerical and categorical features into the same low-dimensional space is proposed.

...read moreread less

Abstract: Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking on an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (a.k.a. cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient method called the AutoInt to automatically learn the high-order feature interactions of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterwards, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability. Code is available at: \urlhttps://github.com/DeepGraphLearning/RecommenderSystems.

...read moreread less

463 citations

Cites background from "Deep Neural Networks for YouTube Re..."

...Predicting click-through rates is important to many Internet companies, and various systems have been developed by different companies [8–10, 15, 21, 29, 43]....
[...]

Proceedings Article•DOI•

IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models

[...]

Jun Wang¹, Lantao Yu², Weinan Zhang², Yu Gong³, Yinghui Xu³, Benyou Wang⁴, Peng Zhang⁴, Dell Zhang⁵ - Show less +4 more•Institutions (5)

University College London¹, Shanghai Jiao Tong University², Alibaba Group³, Tianjin University⁴, Birkbeck, University of London⁵

30 May 2017-arXiv: Information Retrieval

TL;DR: A unified framework takes advantage of both schools of thinking in information retrieval modelling and shows that the generative model learns to fit the relevance distribution over documents via the signals from the discriminative model to achieve a better estimation for document ranking.

...read moreread less

Abstract: This paper provides a unified account of two schools of thinking in information retrieval modelling: the generative retrieval focusing on predicting relevant documents given a query, and the discriminative retrieval focusing on predicting relevancy given a query-document pair. We propose a game theoretical minimax game to iteratively optimise both models. On one hand, the discriminative model, aiming to mine signals from labelled and unlabelled data, provides guidance to train the generative model towards fitting the underlying relevance distribution over documents given the query. On the other hand, the generative model, acting as an attacker to the current discriminative model, generates difficult examples for the discriminative model in an adversarial way by minimising its discrimination objective. With the competition between these two models, we show that the unified framework takes advantage of both schools of thinking: (i) the generative model learns to fit the relevance distribution over documents via the signals from the discriminative model, and (ii) the discriminative model is able to exploit the unlabelled data selected by the generative model to achieve a better estimation for document ranking. Our experimental results have demonstrated significant performance gains as much as 23.96% on Precision@5 and 15.50% on MAP over strong baselines in a variety of applications including web search, item recommendation, and question answering.

...read moreread less

416 citations

Cites methods from "Deep Neural Networks for YouTube Re..."

...To keep our discussion uncluered, we have chosen a basic matrix factorisation model to implement, and it would be straightforward to replace it with more sophisticated models such as factorisation machines [33] or neural networks [8], whenever needed....
[...]

Proceedings Article•DOI•

IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models

[...]

Jun Wang¹, Lantao Yu², Weinan Zhang², Yu Gong³, Yinghui Xu³, Benyou Wang⁴, Peng Zhang⁴, Dell Zhang⁵ - Show less +4 more•Institutions (5)

University College London¹, Shanghai Jiao Tong University², Alibaba Group³, Tianjin University⁴, Birkbeck, University of London⁵

07 Aug 2017

TL;DR: In this paper, a game theoretical minimax game is proposed to iteratively optimise both generative and discriminative models for document ranking, and the generative model is trained to fit the relevance distribution over documents via the signals from the discriminator.

...read moreread less

413 citations

1
2
3
4
5
6
…
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book Chapter•DOI•

I and J

[...]

William Marsden

01 Jan 2012

139,059 citations

"Deep Neural Networks for YouTube Re..." refers background in this paper

...We observe that the most important signals are those that describe a user’s previous interaction with the item itself and other similar items, matching others’ experience in ranking ads [7]....
[...]

Proceedings Article•

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

[...]

Sergey Ioffe¹, Christian Szegedy¹•Institutions (1)

Google¹

06 Jul 2015

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

...read moreread less

30,843 citations

Proceedings Article•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

24,012 citations

"Deep Neural Networks for YouTube Re..." refers background in this paper

...A key advantage of using deep neural networks as a generalization of matrix factorization is that arbitrary continuous and categorical features can be easily added to the model....
[...]

Posted Content•

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

[...]

Sergey Ioffe¹, Christian Szegedy¹•Institutions (1)

Google¹

11 Feb 2015-arXiv: Learning

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

...read moreread less

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

...read moreread less

17,184 citations

Posted Content•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

16 Oct 2013-arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

11,343 citations