scispace - formally typeset
Search or ask a question
Book ChapterDOI

Using Multi-armed Bandit to Solve Cold-Start Problems in Recommender Systems at Telco

TL;DR: This work proposes a new approach based on the multi-armed bandit algorithms to automatically recommend rate-plans for new users in the Telco industry, showing promising results.
Abstract: Recommending best-fit rate-plans for new users is a challenge for the Telco industry. Rate-plans differ from most traditional products in the way that a user normally only have one product at any given time. This, combined with no background knowledge on new users hinders traditional recommender systems. Many Telcos today use either trivial approaches, such as picking a random plan or the most common plan in use. The work presented here shows that these methods perform poorly. We propose a new approach based on the multi-armed bandit algorithms to automatically recommend rate-plans for new users. An experiment is conducted on two different real-world datasets from two brands of a major international Telco operator showing promising results.

Summary (2 min read)

1 Introduction

  • Telcos do not commonly supply a lot of services; most general they supply subscriptions, or rate-plans; either pre-paid or post-paid.
  • Comparing this to a more traditional recommender problem where a user-item matrix might be sparse; in this example the matrix will be completely sparse.
  • To solve this cold-start problem, given the fact that no prior information on the new user exists, one might think of a random recommendation of rate-plans.
  • The authors approach this by applying the multi-armed bandit algorithms.
  • Section 5 presents some experimental results and discussions.

3 Problem Definition

  • Recommending a rate-plan for a new mobile telephony user differs from traditional recommender systems.
  • Traditionally, recommender systems are in a context where users can purchase and own several products, such as books;.
  • Finally, no explicit rating for the rate-plans exist.
  • Among k (k ≥ 1) suggested rate-plans, the new user can only select one plan at any given time.
  • Below, the authors suggest to take into account the two most popular measurements which are i ) the indicator function and ii ) the correlation value.

4 Bandit Algorithms for the CSAR Problem

  • The game of the recommender system is to repeatedly pick up one of the rate-plans and suggest to a new user whenever she enters the system.
  • The ultimate goal is to maximize the cumulative reward.
  • Note that the setting in present context is slightly different from traditional MABs.
  • In fact, in the case of using the indicator function, then the non-selected rate-plans by users will get a zero reward.
  • Since the distributions of the rate-plans being selected are still unknown, the idea of using the MAB algorithms for the CSAR problem is still valid.

5 Experiments and Results

  • This section details the datasets used in the experiments; the experimental settings, in which the detail implementation of the proposed methods and of the competing algorithms are provided; and contains an analysis and discussion of the experimental results.
  • The authors use two different real-world client datasets from two brands of a major international Telco operator.
  • These two datasets were collected during the first quarter of 2013.
  • The first brand's dataset contains the descriptive features of 16 rate-plans, as well as information about the plans used by 3066 users.
  • They have each chosen a rate-plan that fits their need.the authors.

5.2 Experimental Settings

  • The first and the most naïve approach for the cold-start recommendation systems at Telcos is to choose randomly a rate-plan to recommend to a new user.
  • This algorithm is very efficient, especially, when the authors do not have any description on users and the algorithm seems to be reasonable.
  • The second trivial approach is to recommend the most popular rate-plan (Most common) to the new user.
  • This is a sensible approach in terms of the efficiency and many operators apply this.
  • The UCB algorithm estimates the value U CB tj for each plan.

5.3 Results and Analysis

  • Table 2 shows the performances of the six different approaches for the cold-start problem on the two different real-world client datasets DS1 and DS2.
  • It can be seen from the table that the random approach provided very poor results in both datasets.
  • This forces the -greedy algorithm to follow the best rate-plan (i.e. the rate-plan has the maximal average reward value) all the time.
  • The UCB gave us a surprisingly good precision and prediction results.
  • The reason is that the UCB approach has a good strategy in balancing the exploitation of the best rate-plan at a time and the exploration of other different rate-plans which are also interest for the new users.

6 Conclusions and Future Research

  • This work approaches recommending rate-plans to completely new users at Telco, without any prior information on them.
  • An experiment was conducted on two different real-world client datasets from two brands of a major international Telco operator.
  • From the experimental results, the authors observed that the UCB algorithm clearly outperforms traditional naïve approaches, as well as other classical multi-arm bandit algorithms.
  • Improving the precision and AFP would still be preferable.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Using Multi-armed Bandit to Solve Cold-start
Problems in Recommender Systems at Telco
Hai Thanh Nguyen
1
and Anders Kofod-Petersen
1,2
1
Telenor Research
7052 Trondheim, Norway
{HaiThanh.Nguyen|Anders.Kofod-Petersen}@telenor.com
2
Department of Computer and Information Science (IDI),
Norwegian University of Science and Technology (NTNU),
7491 Trondheim, Norway
anderpe@idi.ntnu.no
Abstract. Recommending best-fit rate-plans for new users is a chal-
lenge for the Telco industry. Rate-plans differ from most traditional
products in the way that a user normally only have one product at any
given time. This, combined with no background knowledge on new users
hinders traditional recommender systems. Many Telcos today use either
trivial approaches, such as picking a random plan or the most common
plan in use. The work presented here shows that these methods perform
poorly. We propose a new approach based on the multi-armed bandit
algorithms to automatically recommend rate-plans for new users. An
experiment is conducted on two different real-world datasets from two
brands of a major international Telco operator showing promising results.
Key words: multi-armed bandit, cold-start, recommender systems, tele-
com, and rate-plan.
1 Introduction
The Telco industry do not at first glance appear to be of particular interest
from a recommender system perspective. Telcos do not commonly supply a lot
of services; most general they supply subscriptions, or rate-plans; either pre-paid
or post-paid. However, recommending the optimal rate-plans for users in general,
and new users in particular can be challenging.
Suggesting a rate-plan for a new user is a typical cold-start user problem
(following the separation suggested by Park et al., [1]). This problem has also
been identified under slightly different names, such as: the new user problem [2],
the cold start problem [3] or new-user ramp-up problem [4]. However, the fact
that a customer traditionally only has one rate-plan at any given time increases
the difficulty of this problem. Comparing this to a more traditional recommender
problem where a user-item matrix might be sparse; in this example the matrix
will be completely sparse.
To solve this cold-start problem, given the fact that no prior information on
the new user exists, one might think of a random recommendation of rate-plans.

2 Nguyen and Kofod-Petersen
However, the chance that the recommended plan be accepted by the new user is
small. In fact, given n available rate-plans the probability that a random pick-up
plan is accepted is only 1/n. We say this approach has too much randomness in
its recommendations.
Another possibility for solving this problem is to use the distribution of se-
lected plans from existing users. Concretely, it is sensible to recommend the most
popular rate-plan to the new user. By doing this, we assume that there is a fixed
distribution behind the choice of rate-plans by the new users. However, in reality
and also in the experiment below, we can observe that this is not the case. We
say this method exploits too much the most popular rate-plan.
The idea now is to have a better solution to control the randomness in the
exploration of different rate-plans while keeping the exploitation of the most
popular rate-plan at a time. This is the usual dilemma between Exploitation (of
already available knowledge) versus Exploration (of uncertainty), encountered in
sequential decision making under uncertainty problems. This has been studied
for decades in the multi-armed bandit framework. The work presented here,
attempts to tackle the cold-start user problem by recommending a plan that
will appeal to the user in question, rather than the best plan. We approach this
by applying the multi-armed bandit algorithms.
The multi-armed bandit (MAB) is a classical problem in decision theory
[5,6,7]. It models a machine with K arms, each of which has a different and
unknown distribution of rewards. The goal of the player is to repeatedly pull the
arms to maximise the expected total reward. However, since the player does not
know the distribution of rewards, he needs to explore different arms and at the
same time exploit the current optimal arm (i.e. the arm with the current highest
cumulative reward).
To evaluate our MAB approach in solving the cold-start user problem at
Telco, we conduct experiments on real-world datasets and compare it with trivial
approaches, which include the random and most popular method. Experimental
results show that our proposed approach improves upon the trivial ones.
The paper is organised as follows: Section 2 gives an overview of related
works; Section 3 provides a formal definition of the cold-start problem in the
rate-plan recommender system at Telco. We describe our proposed approaches
in Section 4. Section 5 presents some experimental results and discussions. The
paper ends with a summary of our findings and a discussion on future work.
2 Related Work
Unfortunately, there are very few examples of research regarding rate-plan rec-
ommender systems for Telco, in particular with respect to the cold-start prob-
lem. Examples include, Thomas et al., who describe how to recommend best-fit
recharges for pre-paid users [8]. Soonsiripanichkul et al., employes a na¨ıve Bayes
classifier to infer which rate-plan to suggest to existing users [9]. Both use existing
data on customers’ usage patterns and do not address the cold-start problem.

Multi-armed bandits for cold-start Problems at Telcos 3
In general, one common strategy for mitigating the cold-start user problem
is to gather demographic data. It is assumed that users who share a common
background also share a common taste in products. Examples include Lekakos
and Giaglis [10], where lifestyle information is employed. This includes age, mar-
ital status and education, as well as preferences on eight television genres. The
authors report that this approach is the most effective way of dealing with the
cold-start user problem in sparse environments.
A similar thought underlies the work by Lam et al., [11] where an aspect
model (see e.g. [12]) including age, gender and job is used. This information is
used to calculate a probability model that classifies users into user groups and
the probability how well liked an item is by this user group.
Other examples of applying demographic information for mitigating the cold-
start user problem exists, e.g. [13,14,15]. All the solutions above use similar
demographic information; most commonly age, occupation and gender. Most of
the solutions ask for less than five pieces of information. Even though five is a
comparatively small number, the user must still answer these questions. Users do
generally not like to answer a lot of questions, yet expect reasonable performance
from the first interaction with the system [16].
Zigoris and Zhang [16], suggests to use a two part Bayesian model, where
the prior probability is based on the existing user population and data likeli-
hood, which is based on the data supplied by the user. Thus, when a new user
enters the system, little is know about that user and the prior distribution is the
main contributor. As the user interacts with the system the data data likelihood
becomes more and more important. This approach performs well for cold-start
users. Other similar approaches can by found in [17], suggesting a Markov mix-
ture model, and [18] who suggests a statical user language model that integrates
an individual model, a group model and a global model.
Our study differs from previous research on the cold-start problem, as no de-
mographic information is taken into account. Only the information on selected
plans of previous users is available to the recommender engine. This assumption
makes the cold-start problems even harder to solve. However, we leave the is-
sue of collecting more information from users and how to use it for cold-start
recommender systems for future works.
3 Problem Definition
Recommending a rate-plan for a new mobile telephony user differs from tradi-
tional recommender systems. Traditionally, recommender systems are in a con-
text where users can purchase and own several products, such as books; Rate-
plans are different in the sense that one user can have any number of rate-plans,
but typically only one plan at any given time. Further, the user will typically
have the same product for an extended time period. Finally, no explicit rating
for the rate-plans exist. We call this problem the Cold Start Alternative Recom-
mendation (CSAR) problem and below is its formal definition.

4 Nguyen and Kofod-Petersen
Let U = {u
1
, . . . , u
T
} be the set of T new users. Assume that we have a set
P of n rate-plans to recommend to a new user: P = {p
1
, . . . , p
n
} where each
plan p
i
(i = 1...n) is described by m features {f
1
, . . . , f
m
}, such as price, number
of included SMS, number of voice minutes included and so on. Among k (k 1)
suggested rate-plans, the new user can only select one plan at any given time.
Assume that at a given time t a new user u
t
comes and the system recom-
mends a rate-plan p
t
without any knowledge on the new user. Depending on the
user’s needs, she will accept the offer or select another rate-plan. We want to
design an algorithm that can find a best-fit rate-plan for the new user. Let need
t
be a vector described the user’s demand: need
t
= (need
t1
, . . . , need
tm
), where
each feature need
tj
corresponds to each feature f
j
of the rate-plans. If we denote
the similarity value between the recommended plan p
t
and the actual demand
of the new user u
t
by a similarity(need
t
, p
t
), then the objective when solving
the CSAR problem is to select the rate-plans p
t
that maximizes the following so
called ”cumulative reward” (Reward) over all T new users:
Reward
T
=
T
X
t=1
(similarity(need
t
, p
t
))
The CSAR problem would be easy to solve if we knew about the user’s needs
need
t
. The task then becomes straightforward by selecting the rate-plan that
provides the maximal value of the similarity(need
t
, p
t
) over all available plans.
As mentioned, it is not possible to calculate similarity(need
t
, p
t
) since need
t
is not available. We suggest to study an approximated problem to the CSAR
problem where we consider the similarity value between the recommended plan
p
t
and the actual selection of the new user p
t
. By doing this, we wish to achieve
a recommendations as close as possible to the actual choice made by the user.
The actual choice is also considered as her temporary best-fit plan. Formally, we
want to maximize the following so called ”reward”:
Reward
T
=
T
X
t=1
(similarity(p
t
, p
t
))
There are many ways to define the similarity value between two vectors p
t
and
p
t
. Below, we suggest to take into account the two most popular measurements
which are i ) the indicator function and ii) the correlation value.
Indicator function If we use the indicator function as the similarity mea-
surement, then the problem becomes to design an algorithm that predict the
rate-plan p
t
chosen by the new user. The cummulative reward now is the fol-
lowing: Reward
(1)
T
=
P
T
t=1
(1I(p
t
6= p
t
)), where 1I(p
t
6= p
t
) is an indicator func-
tion which is equal to 0 if p
t
6= p
t
and to 1, otherwise. To evaluate any al-
gorithm solving this problem, we can use the classical precision measurement:
P recision
T
=
1
T
Reward
(1)
T

Multi-armed bandits for cold-start Problems at Telcos 5
Correlation value In the second case, we study how similar is the recom-
mended rate-plan p
t
to the actual selection p
t
of the new user in terms of the
features. Generally, when a new user purchases a rate-plan, she looks at the fea-
tures describing the different rate-plans including the recommended rate-plan p
t
.
Finally she picks up a plan p
t
that we can assume is perceived as the temporary
best-fit for her. Therefore, it is sensible to choose the correlation coefficient as
a similarity measurement between plans and the task is to maximizes the fol-
lowing so called ”cumulative reward”: Reward
(2)
T
=
P
T
t=1
(Corr(p
t
, p
t
)), where
Corr(p
t
, p
t
) is the correlation value between two vectors p
t
and p
t
.
Possible correlation values can be Pearson correlation or Kendall correlation.
Since the actual demand of new users is not available at the time when they
enter, it is fair to treat all the features equally in the correlation calculation.
While solving this problem, we try to recommend a rate-plan that is sufficiently
good in terms the features and that the user will accept. Thus, classic precision
measurements are not applicable. We, therefore, define the Average-Feature Pre-
diction (AFP) as a new evaluation measurement of how much of the features of
the rate-plan chosen by T new users are predictable on average:
AF P
T
=
1
T
Reward
(2)
T
4 Bandit Algorithms for the CSAR Problem
Based on the idea of the multi-armed bandit [5,6,7], in the following we translate
the new CSAR problem into a bandit problem.
Let us consider a set P of n available rate-plans to recommend to T com-
pletely new users. Each plan is associated with an unknown distribution of being
selected by users. The game of the recommender system is to repeatedly pick
up one of the rate-plans and suggest to a new user whenever she enters the
system. The ultimate goal is to maximize the cumulative reward. As defined in
previous section, the reward for our recommender system is the similarity value
similarity(p
t
, p
t
). Note that the setting in present context is slightly different
from traditional MABs. In a traditional MAB only the reward of the selected
arm is revealed. In our case all the non-selected arms also get rewards after the
recommendation is made. In fact, in the case of using the indicator function,
then the non-selected rate-plans by users will get a zero reward. In the case of
using the correlation value, the rewards of the non-selected rate-plans will be the
correlation value between the two vectors p and p
. However, since the distri-
butions of the rate-plans being selected are still unknown, the idea of using the
MAB algorithms for the CSAR problem is still valid. The following three MAB
algorithms are being used:
-greedy [7] aims at picking up the rate-plan that is currently considered the
best (i.e. the rate-plan that has the maximal average reward) with probability
(exploit current knowledge), and pick it up uniformly at random with probability
1 (explore to improve knowledge). Typically, is varied along time so that
the plans get greedier and greedier as knowledge is gathered.

Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors performed a systematic literature review (SLR) to shed light on the new topic of Multi-Armed Bandit (MAB) in the recommendation field.
Abstract: Recommender Systems (RSs) have assumed a crucial role in several digital companies by directly affecting their key performance indicators. Nowadays, in this era of big data, the information available about users and items has been continually updated and the application of traditional batch learning paradigms has become more restricted. In this sense, the current efforts in the recommendation field have concerned about this online environment and modeled their systems as a Multi-Armed Bandit (MAB) problem. Nevertheless, there is not a consensus about the best practices to design, perform, and evaluate the MAB implementations in the recommendation field. Thus, this work performs a systematic literature review (SLR) to shed light on this new topic. By inspecting 1327 articles published from the last twenty years (2000–2020), this work: (1) consolidates an updated picture of the main research conducted in this area so far; (2) highlights the most used concepts and methods, their core characteristics, and main limitations; and (3) evaluates the applicability of MAB-based recommendation approaches in some traditional RSs’ challenges, such as data sparsity, scalability, cold-start, and explainability. These discussions and analyzes also allow us to identify several gaps in the current literature, providing a strong guideline for future research.

16 citations

Book ChapterDOI
11 May 2021
TL;DR: T-LinUCB as mentioned in this paper takes advantage of prior recommendation observations from multiple domains to initialize the new arms' parameters so as to circumvent the lack of data arising from the cold-start problem.
Abstract: Cross-domain recommendations have long been studied in traditional recommender systems, especially to solve the cold-start problem. Although recent approaches to dynamic personalized recommendation have leveraged the power of contextual bandits to benefit from the exploitation-exploration paradigm, very few works have been conducted on cross-domain recommendation in this setting. We propose a novel approach to solve the cold-start problem under the contextual bandit setting through the cross-domain approach. Our developed algorithm, T-LinUCB, takes advantage of prior recommendation observations from multiple domains to initialize the new arms’ parameters so as to circumvent the lack of data arising from the cold-start problem. Our bandits therefore possess knowledge upon starting which yields better recommendation and faster convergence. We provide both a regret analysis and an experimental evaluation. Our approach outperforms the baseline, LinUCB, and experiment results demonstrate the benefits of our model.

4 citations

Proceedings ArticleDOI
14 Aug 2022
TL;DR: A tailored Dual Alignment User Clustering (DAUC) model is proposed, which applies a sample-wise contrastive alignment and a distribution-wise adversarial alignment to eliminate the gap between active users' and cold-start users' app usage behavior.
Abstract: This paper reports our recent practice of recommending articles to cold-start users at Tencent. Transferring knowledge from information-rich domains to help user modeling is an effective way to address the user-side cold-start problem. Our previous work demonstrated that general-purpose user embeddings based on mobile app usage helped article recommendations. However, high-dimensional embeddings are cumbersome for online usage, thus limiting the adoption. On the other hand, user clustering, which partitions users into several groups, can provide a lightweight, online-friendly, and explainable way to help recommendations. Effective user clustering for article recommendations based on mobile app usage faces unique challenges, including (1) the gap between an active user's behavior of mobile app usage and article reading, and (2) the gap between mobile app usage patterns of active and cold-start users. To address the challenges, we propose a tailored Dual Alignment User Clustering (DAUC) model, which applies a sample-wise contrastive alignment to eliminate the gap between active users' mobile app usage and article reading behavior, and a distribution-wise adversarial alignment to eliminate the gap between active users' and cold-start users' app usage behavior. With DAUC, cold-start recommendation-optimized user clustering based on mobile app usage can be achieved. On top of the user clusters, we further build candidate generation strategies, real-time features, and corresponding ranking models without much engineering difficulty. Both online and offline experiments demonstrate the effectiveness of our work.
Journal ArticleDOI
TL;DR: This research evaluates how multi-armed bandit strategies optimize the bid size in a commercial demand-side platform (DSP) that buys inventory through ad exchanges and shows a clear and substantial economic benefit for ad buyers using DSPs.
Abstract: Online advertisements are bought through a mechanism called real-time bidding (RTB). In RTB, the ads are auctioned in real-time on every webpage load. The ad auctions can be of two types: second-price or first-price auctions. In second-price auctions, the bidder with the highest bid wins the auction, but they only pay the second-highest bid. This paper focuses on first-price auctions, where the buyer pays the amount that they bid. This research evaluates how multi-armed bandit strategies optimize the bid size in a commercial demand-side platform (DSP) that buys inventory through ad exchanges. First, we analyze seven multi-armed bandit algorithms on two different offline real datasets gathered from real second-price auctions. Then, we test and compare the performance of three algorithms in a production environment. Our results show that real data from second-price auctions can be used successfully to model first-price auctions. Moreover, we found that the trained multi-armed bandit algorithms reduce the bidding costs considerably compared to the baseline (naïve approach) on average 29%and optimize the whole budget by slightly reducing the win rate (on average 7.7%). Our findings, tested in a real scenario, show a clear and substantial economic benefit for ad buyers using DSPs.
References
More filters
Journal ArticleDOI
TL;DR: This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches.
Abstract: This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.

9,873 citations


"Using Multi-armed Bandit to Solve C..." refers background in this paper

  • ...This problem has also been identified under slightly different names, such as: the new user problem [2], the cold start problem [3] or new-user ramp-up problem [4]....

    [...]

Journal ArticleDOI
TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Abstract: Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

6,361 citations


"Using Multi-armed Bandit to Solve C..." refers background or methods in this paper

  • ...The UCB algorithm estimates the value UCBtj for each plan....

    [...]

  • ...The UCB gave us a surprisingly good precision and prediction results....

    [...]

  • ...UCB [7] consists of selecting the rate-plan that maximises the following function:...

    [...]

  • ...The following three MAB algorithms are being used: -greedy [7] aims at picking up the rate-plan that is currently considered the best (i....

    [...]

  • ...The multi-armed bandit (MAB) is a classical problem in decision theory [5,6,7]....

    [...]

Journal ArticleDOI
TL;DR: This paper surveys the landscape of actual and possible hybrid recommenders, and introduces a novel hybrid, EntreeC, a system that combines knowledge-based recommendation and collaborative filtering to recommend restaurants, and shows that semantic ratings obtained from the knowledge- based part of the system enhance the effectiveness of collaborative filtering.
Abstract: Recommender systems represent user preferences for the purpose of suggesting items to purchase or examine They have become fundamental applications in electronic commerce and information access, providing suggestions that effectively prune large information spaces so that users are directed toward those items that best meet their needs and preferences A variety of techniques have been proposed for performing recommendation, including content-based, collaborative, knowledge-based and other techniques To improve performance, these methods have sometimes been combined in hybrid recommenders This paper surveys the landscape of actual and possible hybrid recommenders, and introduces a novel hybrid, EntreeC, a system that combines knowledge-based recommendation and collaborative filtering to recommend restaurants Further, we show that semantic ratings obtained from the knowledge-based part of the system enhance the effectiveness of collaborative filtering

3,883 citations


"Using Multi-armed Bandit to Solve C..." refers background in this paper

  • ...This problem has also been identified under slightly different names, such as: the new user problem [2], the cold start problem [3] or new-user ramp-up problem [4]....

    [...]

Journal ArticleDOI
TL;DR: A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs.
Abstract: In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T-1/2). We show by a matching lower bound that this is the best possible. We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N1/2 T-1/2). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T-1/2).

2,370 citations


"Using Multi-armed Bandit to Solve C..." refers background or methods in this paper

  • ...The case of EXP3 shows even worse performance than the -greedy....

    [...]

  • ...EXP3 [19] selects a rate-plan according to a distribution, which is a mixture of the uniform distribution and a distribution that assigns each plan a probability mass exponential in the estimated cumulative rewards for that plan....

    [...]

  • ...Finally, EXP3 selects a plan according to a give distribution, as described in [19]....

    [...]

  • ...In this equation, µ̂j favours a greedy selection (exploitation) while the second term √ 2 ln t tj favours exploration driven by uncertainty; it is a confidence interval on the true value of the expectation of reward for plan j. EXP3 [19] selects a rate-plan according to a distribution, which is a mixture of the uniform distribution and a distribution that assigns each plan a probability mass exponential in the estimated cumulative rewards for that plan....

    [...]

Frequently Asked Questions (9)
Q1. What have the authors contributed in "Using multi-armed bandit to solve cold-start problems in recommender systems at telco" ?

The authors propose a new approach based on the multi-armed bandit algorithms to automatically recommend rate-plans for new users. An experiment is conducted on two different real-world datasets from two brands of a major international Telco operator showing promising results. 

The authors would like to extend their gratitude to Professor Helge Langseth at the Department of Computer and Information Science, at the Norwegian University of Science and Technology ( NTNU ), and Dr. Humberto N. Castejón Mart́ınez and Dr. Kenth Engø-Monsen at Telenor Research ; without whom this work would not have been possible. 

If the authors use the indicator function as the similarity measurement, then the problem becomes to design an algorithm that predict the rate-plan p∗t chosen by the new user. 

In the case of using the correlation value, the rewards of the non-selected rate-plans will be the correlation value between the two vectors p and p∗. 

The game of the recommender system is to repeatedly pick up one of the rate-plans and suggest to a new user whenever she enters the system. 

If the authors denote the similarity value between the recommended plan pt and the actual demand of the new user ut by a similarity(needt, pt), then the objective when solving the CSAR problem is to select the rate-plans pt that maximizes the following so called ”cumulative reward” (Reward) over all T new users:RewardT = T∑ t=1 (similarity(needt, pt))The CSAR problem would be easy to solve if the authors knew about the user’s needs needt. 

To have a better explanation, by looking at the UCB algorithm as described in previous section, the authors see that the recommendation of a rate-plan is a result of solving the trade-off between the average reward and the number of times the plan has been selected so far by users. 

To evaluate any algorithm solving this problem, the authors can use the classical precision measurement: PrecisionT = 1 TReward (1) TCorrelation value 

The is also true for the second dataset, where a randomly recommended rate-plan only has a 113 = 7.67% probability of being correct.