What have the authors contributed in "The two-armed-bandit problem w ith" ?

(Open Access) The two-armed-bandit problem with time-invariant finite memory (1970) | Thomas M. Cover

IEEE Tx.437sAcuoNs 0N INFORMATION THEORY,VOL. IT-16, NO.

MARCH

1970

The Two-Armed-Bandit Problem With

Time-Invariant Finite Memory

THOMAS M. COVER

AND

MARTIN E. HELLMAN

Absfracf-This paper solves the classical two-armed-bandit

problem under the finite-memory constraint described below.

Given are probability densities p0 and p,, and two experiments

A and B. It is not known which density is associated with which

experiment. Thus the experimental outcome Y of experiment A

is as likely to be distributed according to p0 as it is to be distributed

according to p,. It is desired to sequentially choose an experiment

to be performed on the basis of past observations according to

the algorithm !I’, = f(Tn-l, en,

Y,), e,, = e(T&, where T,, E

II,

a**>

m) is the state of memory at time n, e, E (A, B ] is the

choice of experiment, and Y, is the random variable observation.

The goal is to maximize the asymptotic proportion T of uses of the

experiment associated with density PO.

Let l(y) = po(y)/pr(y), and let i and ?denote the almost every-

where greatest lower bound and least upper bound on Z(y). Let

1 = max (7, l/i). Then the optimal value of r, over all m-state

algorithms Cf, e), will be shown to be Zm-r/(lm-r + 1). An e-optimal

family of m-state algorithms will be demonstrated. In general,

optimal algorithms do not exist, and e-optimal algorithms require

artiticial randomization.

INTRODUCTION

UPPOSE one is given two coins, labeled A and B.

Suppose also that it is known that one of the coins

has bias p, towards heads and the other has bias

p, towards heads, but it is not known which coin has

which bias. At each trial a coin is to be selected and tossed,

and it is desired to maximize the proportion of heads

(successes) achieved in the limit as the number of trials

tends to infinity. An equivalent objective is to maximize

the proportion of tosses using the coin with the larger

bias. How should the choice of coin at trial n depend on

the previous outcomes, in order to achieve this goal?

This problem is commonly referred to as the sequential

design of experiments or the two-armed-bandit problem

(TABP) [l]-[3].

Note that this problem combines hypothesis testing

(which coin has which bias?) with the added degree of

freedom that the experimenter may select his experiment

(A or B) at each toss. The experimenter must utilize

his information to maximize the proportion of successes.

This paper will be concerned with a generalized TABP

in which the “coins” may have an infinite number of

sides. A further generalization of the TABP to an infinite

number of coins will be provided in Section VI. These

problems will be solved under a finite-memory constraint,

Manuscript received May 19, 1969; revised September 29, 1969.

This work was supported in part under Contract AF 49(638) 1517

and under the NSF Cradaate Fellowship Program.

T. M. Cover is with the Department of Electrical Engineering,

Starlford University, Stanford, Calif. 94305.

M. E. Hellman was with the Watson IBM Research Center,

Yorktown Heights, N. Y. He is now with the Massachusetts Institute

of Technology, Cambridge, Mass. 02139.

i.e., the experimenter is not allowed to remember the

outcomes of all previous trials, but only a finite-valued

statistic. On the basis of this statistic, the next coin must

be chosen.

Stated more precisely, the experimenter is provided

two experiments, A and B. Also given are two probability

measures p0 and 6, defined on the arbitrary probability

space (3, a3), where y is the experimental outcome space

and @ is a u-field of subsets over y. There are two hy-

potheses concerning the probability distribution of the

experimental outcome Y:

H, :

Y N p0 under experiment

Y N 6, under experiment

(1)

H, :

Y N @I under experiment

Y N p0 under experiment

Let the a priori probabilities of

and

be p0 and ul

respectively, where ?ro + rrl = 1. This seemingly Bayesian

formulation, in which the priors are specified, is not

restrictive since the set of all admissible algorithms (or

the set of all optimal algorithms with respect to the

Neyman-Pearson formulation) may be generated by

letting r. take on all values in the unit interval.

Let ei E

{A, B}

denote the ith experiment performed

and Y, E y denote the ith experimental outcome. It

is assumed that the experimental outcomes are inde-

pendent in the sense that

P(Y,, Yj 1 ej, ej, H) = P(Y, 1 ei, H)P(Yj 1 ej, H)

i # j,

where

is the true hypothesis.

A success is said to occur if the experiment associated

with 6, is performed. At times n = 1, 2, 3, * * . a choice

of experiment e, is made. Letting

if success occurs at time n,

s, =

(2)

if failure occurs at time n,

the objective is to maximize

where the expectation is taken with respect to the distri-

bution on the two hypotheses and the distribution on

(si) induced by the experiment selection algorithm.

Therefore, r is the expected long run proportion of suc-

cesses.

Let the data be summarized by an m-valued statistic

that is updated according to the rule

Tn = On-1, XJ T, E (1,

-a - , ml

(4)

186

IEJ0E TRANSACTIONS ON INFORMA’PION THEORY, MARCH 1970

The idea of adding a finite-memory constraint is due

to Robbins [5]. Robbins defines memory to be of length k

if the choice of coin at each trial is allowed to depend

only on the outcomes of the

previous trials. Letting

X = (A,

BJ X

‘y denote the observation space, the prob-

lem becomes one of determining the function e:

---)

{A,

for which the algorithm

en+l = 4X,, JL, --- , X-k+d

(6)

Xn = (en, Y,)

(7)

maximizes r. Since

has but four members in Robbins’s

problem, memory is still finite according to the definition

of Section I, with m = 4k. However, if the experimental

outcome space y is infinite, an infinite-state memory is

needed to recall the last

experimental outcomes.

Although Robbins’s original algorithm has been succes-

sively improved by Isbell [6], Smith and Pyke [7], and

Samuels [8], an optimal scheme has not been established.

However, if the choice of coin may also depend on time,

the problem has been solved by Cover [9]. A memory

= 2 is sufficient, i.e., there exists an algorithm e for

which the scheme

’ where

T,,

is the value of

after n observations, X,, =

(e,, Y,,) is the nth observation (note the difference between

an observation X = (e, Y) and an experimental out-

come Y) and

is a stochastic function. Further, let e,,

be constrained to depend on the past outcomes X,,

x,, * * * , X,-, only through

T,,-,,

according to the function

e, = 4T.d

n = 1,2, se-

(5)

where e: (1, 2, . * + ,

m 1 --f (A,

) is again allowed to be

a stochastic function. (The randomization in the functions

and e must, to avoid cheating, be independent of the

data.) The size of memory is defined to be m.

The objective is now to find the pair

(f,

e) that max-

imizes r for given m, ?rO, pO, and &. For a reformulation in

terms of optimal finite-state machines see Section III.

As was previously mentioned, it is not only necessary

to test

versus

HI,

but also to use the result of the

test in an attempt to obtain successes. This produces a

conflict. The experimenter may believe

(in which

case he should perform

and yet he may wish to perform

if it would yield more information, thereby increasing

the probability of success on ‘future trials. The conflict

is between a desire for immediate success and a desire

to gather information.

Another conflict exists. A good test requires large

memory, but, as mentioned, hypothesis testing may not

yield a high proportion of successes. Thus, once the test

is completed, a large number of experiments that use

the result of the test is desired. However, an m-valued

statistic can only “count to m.” There is a problem in

deciding how much memory to allocate to testing and

how much to allocate to using the information gathered

by testing. Fortunately, the optimal solution that we

shall present suggests an interpretation answering this

question. The surprising answer is that all of the states

of memory may be devoted to hypothesis testing, and

the information so gathered may be used to gain successes

in a manner that does not interfere with the hypothesis

testing.

II.

HISTORYOBTHEPROBLEM

The TABP was introduced by Robbins [l] in 1952.

In that paper there was no constraint on memory and

the experiments were restricted to be binary-valued

(coin tosses). Robbins argued that a scheme that sampled

the “inferior” coin infinitely often, but with density of

sampling tending to zero, yielded r = 1. Here, at a par-

ticular time, the “inferior” coin is defined to be the coin

yielding the lower cumulative proportion of heads. Sub-

sequently, Bradt, Johnson, and Karlin [2] and Bradt

and Karlin [3] examined generalizations of the TABP in

which it was desired to maximize the number of successes

in a finite number of trials. This problem remains open

in the case where the coin biases (pl, pz) have an arbitrary

known joint distribution. However, Feldman [4] has

solved the generalized version of the TABP corresponding

to (1) (with known a priori probabilities) in the infinite-

memory case.

en+, = eGK, JLl, n>

(8)

achieves an asymptotic proportion of successes T = 1. The

algorithm is independent of the biases p1 and pz on the

two coins, and thus is ,optimal (achieves T = 1) for the

more general problem of maximizing the asymptotic pro-

portion of heads with two coins having arbitrary unknown

biases. This work also implies that, with the definition of

memory given in Section I, a memory of m = 4 states is

sufficient [lo] for a time-varying algorithm to achieve

T = 1.

A series of publications following the work of Tsetlin

has appeared in the Russian literature [ll]-[21] on the

behavior of automata in random media in an attempt to

model adaptive or learning systems. In many cases the

algorithms considered are similar to the TABP with finite

memory of the type defined in Section I. A series of ad

hoc

“expedient” automata (i.e., automata that perform better

than simply alternating coins at each trial) is examined,

but no optimal automata are found. Subsequent work by

Fu and Li [22], [23] and Chandrasekaran and Shen [24]-[27]

has enlarged the set of algorithms for which the asymptotic

behavior has been found. The fundamental problem

implicit in [ll]-[27] is presented in Section I and solved

in this paper. It should, be mentioned that the motivation

of the previous papers is different from ours in the respect

that previous work centered on modeling learning proc-

esses by finite-state automata. For this reason, the

number of states m was frequently allowed to tend to

infinity in the analysis, and the emphasis on optimal

m-state automata was lost.

Note this one word of caution. Memory size has been

defined to be the number of states of the automaton. This

seems to us to be natural. However, we have not included

any measure of the complexity of the computation of the

COVER AND HELLMAN : TWO-ARMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY

state transition function

and the choice of experiment

function e. Fortunately, the optimal function f is rather

X,=(e,,Yn)

7=Wi-,d

-jy---+“+,

simple to implement, as

can

be seen from the example at

Fig. 1.

Decision process viewed as a finite state automaton.

the end of Section IV. Moreover, if an auxiliary stream of

random variables is available, the calculation of

and e

The state transition matrices conditioned on

and H,

may be performed by “hard-wired” circuitry without are given by

memorv elements.

III.

FINITE-STATE MACHINES

P!i = E (Pij(X) [ Ho}

(16)

The two-armed-bandit problem that will be solved in

and

this section has the form

experiment

pij = f {pii 1 HI) *

(17)

H, : Y - (Pi

Y - CPl

As will be shown in the proof of Theorem 1, these expecta-

(9) tions may be explicitly expressed as follows:

H, : Y - @I

Y - 80

where 6,, and (?I are arbitrary known probability measures.

Pti = 1 bii(A,

Ybifo(Y)

Thus Y is not restricted to be a binary-valued random

variable as in previous work [l], [5]. In Section VI, the

pi,@, Y)U

w)fl(~N

(Y>

(18)

solution will be generalized to the form

Pii

CP&,

y)aifl(d

A B

H, : Y - p. Y - 61

f Pi,@, ~10 - dfo(~I) dv

(Y)

(19)

00)

H, : Y-u’,

Y - cP3.

where

and

are the Radon-Nikodym derivatives

(densities) of p0 and @I with respect to some dominating

Attention will be restricted to the algorithm measure v. Define the m X m matrices

P”

[p:,.]

and

Tn = f(Tn-1, XJ

T, E

11,%

a-- , ml

(11)

P’

= [pii] and let go and p1 be the stationary probability

distributions on the state space 3 under

and

H,.

The

es = e(Tn-d e, E IA, Bj

(12)

stationary probability distributions are solutions of the

matrix equations

(13)

where

is the state of memory, e the choice of experiment,

bnd Y the resulting observation. A reformulation of this

algorithm in the terminology of finite-state machines will

ae convenient. X and Y will denote random variables,

and x and y their outcomes.

Consider a finite-state stochastic sequential machine

with state space ‘3 = (1, 2, * 9 * , m), input space

(A, Bj

X y and output space

{A,

B1. Let the state

transition behavior of this machine be specified by a

family of

m X

m stochastic matrices [pii(x defined for

x= (e,y>,eE iA,BJ,yEy,andi,iE

iL%--*,ml.

Then

p& Y> = Pr 1 T, = i I T,-l = i, -G = (e,

Y> I

(14)

is the conditional probability of transition from memory

state i to j under the observation of experiment e with

outcome y.

Let the output function be described by the sequence

LYE, 0 5 ai 5 1, i = 1, 2, . * . , m, with the interpretation

that

ai = Pr (e,,, =

A j T, = i].

(15)

Thus, the next experiment chosen is a random variable

depending solely on the past experience as summarized in

the current state of memory

T,,.

The automaton is depicted

in Fig. 1.

(20)

p1 = VIP’.

(21)

Note that if

P”

is irreducible, & is the asymptotic propor-

tion of time spent in state i, conditioned on Hk. Parallel

work on hypothesis testing with finite memory [28]

establishes that irreducible automata are at least “one

state better” than reducible automata. The same argument

applies to the current formulation of the TABP so that,

here too, attention will be restricted to irreducible auto-

mata.

Letting r. and r1 be the asymptotic proportion of

successes under

and

H,,

it is seen that

(22)

(23)

i=l

where the LY( are defined by (15).

If a Bayesian approach is taken and a priori probabil-

ities ?rO and ?rl (p. + 7rl = 1) are assigned to

and

then

r = ?roro + n-l?“l.

(24)

Although the Bayesian approach will be taken, the results

will apply to the Neyman-Pearson formulation as well.

In the Neyman-Pearson formulation the problem is to

188

IEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1970

maximize r1 subject to the constraint r. 2 IX, for a given

level 0.

Returning to the Bayesian formulation, the goal is to

maximize

over all P,~(x), and cr<; i, j = 1, 2, * * . , m.

Designate this maximum value of r by r*.

(334

In order to place an upper bound on r it is necessary to

relate the parameters of the automaton to the statistics

of the problem. The following definitions and theorems

will prove useful.

7 = wax {$, #} = 3;

(33c)

Since

la(T)

= 7 = 2;

Z,(T)

= $, the maximum and

minimum likelihood ratio events are given by “tails on

coin

A”

and “tails on coin

B,”

respectively.

Dejkitions

Theorem

Let the measure v = p. + pl. Thus (Pi and @I are both

absolutely continuous with respect to Y. Define f,(y) and

fl(y) to be the Radon-Nikodym derivatives of PO and &,I

with respect to v (f. and fl are the probability density

functions of 6, and 61 with respect to v). Let

For alli, j E (1, 2, ... , m],

l/‘i < ppi/p:j < 5.

Proof:

From (16) it is seen that

ppj = Pr

(T,, = j 1 T,,-, = i, Ho}.

(34)

(35)

L(Y) = fO(Y)lfl(Y)

(25)

MY) = f,(Yllfo(Y) = WA(Y).

It is seen that IA and 1, are the likelihood ratios for an

experimental outcome y that results from experiments

and

respectively.

Further define (for C C y)

= inf -

c 1

@o(C)

v(C)>0 @l(C)

(26)

II 1

@l(C)

& = inf ___ .

v(C)>0 @o(C)

(27)

Thus ?A is the almost everywhere (a.e.) maximum likeli-

hood ratio (1.r.) for experiment

and

is the a.e.

minimum 1.r. for experiment

Similarly, TB and

are

the a.e. maximum and minimum l.r.‘s for experiment

Clearly, from the definitions, 7, =

l/G

and

iA =

l/t,.

Thus defining

7 = max IA, lip]

= min

{iA, LB)

it is seen that

(28)

and

Dejinition

7 = max {TAa,

i/C)

i =

min

(iA,

l/TA]

T =

i/i.

(29)

(30)

(31)

The likelihood ratio Z(X) of an observation z = (e, y),

E {A, B}

, is defined to be

J(x) =

L(Y)

(32)

Obviously,

< Z(X) 5 f.

For example, if two unlabeled coins of biases p, = 0.7

and p, = 0.8 are given, the possible events C are heads

and tails, and

Equivalently

ppi = Pr

{T, = j 1 T,,-l = i, Ho, e, = A]

.Pr (e, =

A 1 T,-, = i, Ho}

+ Pr

{T, = j I T,,-, = i, Ho, e,, = B)

.Pr (e, =

B I T,,-l = i, Ho).

(36)

But, since the choice of e, is a (randomized) function of

T,,-,

alone,

Pr (en =

A I T,,-l = i, Ho}

= Pr {en =

A I T,,-, = i)

Similarly

= Cfi*

(37)

Pr (e,, =

B 1 T,,-, = i, Ho) =

1 - LY+

(38)

From (14),

IT,, = j I T,,-, = i, Ho, e, = A)

pi,@, Y)~o(Y) dv (Y), (39)

since under

the experimental outcome Y has

as its

density function when

is performed. Similarly,

{T, = j I Tnml = i, Ho, em = B}

Pii@,

YIfl(YI dV(Y)*

Then (36) becomes

POi = ai

Pij(A, Y)fo(Y) WY)

By definition

(40)

(41)

(42)

(43)

COVER AND HELLMAN : TWO-.4RMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY

so that

pij = ffi

Pi,@, YL(Y)fl(Y) WY)

+ (1 - 4 1

Pi@, YMY)fo(Y) MY).

(44)

Furthermore, Z,(y) < j* and Z,(y) < ?, a.e. Y, and by (29)

7 = max {lA, EB). Hence

p:i I 7 ai

Pi,(A, Y)fl(Y) dV(Y)

+ 0 - 4 j-

Pi,@, Y)fo(Y) MY) *

(45)

Proceeding similarly it is found that

Pii = ai

Pi,@, Y)flM f-WY)

+ (1 - 4 /-

Pi,@, Y)fo(Y) WY).

(46)

Combining (45) and (46) yields

pyi/p:i 5 7

(47)

thus proving half of the theorem. The other half follows in

an analogous manner.

DeJinition

The state likelihood ratio vector a = (X1, * * * , X,) is

defined by

xi = pq/pt i = 1,2, -. . , m.

(48)

Theorem d

Before proceeding with the proof of the theorem, an

example will be given. Consider the coin-flipping TABP

7ro = 4

Ho : p, = 0.9

p, = 0.8

n-1 = 3 H, : pl = 0.8 p, = 0.9

where p. and p, are the probabilities of the event heads

(H) under the appropriate conditions. Thus, for example,

if coin

is flipped and Ho is true, then Pr( heads] =

p, = 0.9. Calculation shows that

7 = max jA, ?B] = max (g, 8, 3, +) = 2.

(52)

Thus, for an m-state memory the best possible limiting

proportion of uses of the “best” coin (in this case, coin

is given by

r* =

2”-’

2*-’

+ 1.

(53)

In the next section an automaton will be exhibited that

achieves r* arbitrarily closely.

For an irreducible automaton in which the Xi’s are

arranged in nondecreasing order the following relation

holds:

Example

If p. = 0.5, p1 = 0.501, the situation is quite different.

Here 7 E 1.002 and

Remark

Xi,l/Xd 5 (7)“.

(49)

Since it has been noted that irreducible automata can

do at least as well as reducible ones, the irreducibility

restriction is of no consequence.

Proof:

The proof of this theorem follows from Theorem

1 using arguments contained in Lemma 2 of [28]. The

reader is referred there for details.

r* Z (l.002)m-‘/((1.002)m-1 + 1).

(54)

Thus, even m = 500 states yields only a proportion of

successes r*

NN e/(e + 1). No 500-state machine can do

better.

Proof: By Theorem 2, Xz 5 X1(?)“, X3 5 X1(?)‘, * 1 . ,

x, 5 x1(~)2’“-1).

Thus, for all i E 3 = (1, 2, * . . , m)

Theorem S

Hence

For an m-state automaton r is bounded above by

where

But

r* = max

{

(ym-I)

= 2(rn-1)

2(~0~,(0

(~)zcm-l) _ 1 )

l/2

7 TO$Tl

)

(50)

In the special case r0 = r1 = 4,

r* =

p-l)/(tbn-l) + 1).

(51)

Remark

If r* = no (or rl), a degenerate situation exists in which

the machine that always chooses experiment

(or

optimal. In this case memory is not large enough to gather

sufficient information to offset the a priori bias [28].

Remark 2

The larger the value of :,-the larger the resultant propor-

tion of successes r*. Thus, 1 is a measure of the separation

between

and

HI.

Example

(55)

(56)

(57)

r. 5 X1(l)““-“(l -

7-J.

(5%

The two-armed-bandit problem with time-invariant finite memory

Citations

Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory

Influence of Repetition through Limited Recall

Markov Chain Training Models for Nonseparable Classes

Improvements to the complex question answering models

Two-Armed Bandit Strategies that Discount Past and Future

References

Some aspects of the sequential design of experiments

On the Behavior of Finite Automata in Random Media

On Sequential Designs for Maximizing the Sum of $n$ Observations

Learning with Finite Memory

Hypothesis Testing with Finite Statistics

Related Papers (5)

Learning with Finite Memory

Hypothesis Testing with Finite Statistics

Some aspects of the sequential design of experiments

On the Behavior of Finite Automata in Random Media

The theory of learning in games

Frequently Asked Questions (1)

Q1. What have the authors contributed in "The two-armed-bandit problem w ith" ?