The two-armed-bandit problem with time-invariant finite memory

TL;DR: This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below and shows the optimal value of r, over all m -state algorithms (f, e), will be shown to be l-1 / (l-1 + 1) .
Abstract: This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities p_0 and p_1 , and two experiments A and B . It is not known which density is associated with which experiment. Thus the experimental outcome Y of experiment A is as likely to be distributed according to p_0 as it is to be distributed according to p_1 . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm T_n = f(T_{n-1}, e_n, Y_n), e_n = e(T_{n-1}) , where T_n \in \{1, 2, \cdots, m\} is the state of memory at time n, e_n \in \{A, B\} is the choice of experiment, and Y_n , is the random variable observation. The goal is to maximize the asymptotic proportion r of uses of the experiment associated with density p_0 . Let l(y) = p_0 (y) / p_1 (y) , and let \bar{l} and \bar{\bar{l}} denote the almost everywhere greatest lower bound and least upper bound on l(y) . Let 1 = \max {\bar{\bar{l}}, 1/\bar{l}} . Then the optimal value of r , over all m -state algorithms (f, e) , will be shown to be l^{m-1} / (l^{m-1} + 1) . An e -optimal family of m -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and e -optimal algorithms require artificial randomization.

Summary (2 min read)


  • S UPPOSE one is given two coins, labeled A and B. Suppose also that it is known that one of the coins has bias p, towards heads and the other has bias p, towards heads, but it is not known which coin has which bias.
  • A further generalization of the TABP to an infinite number of coins will be provided in Section VI.
  • A good test requires large memory, but, as mentioned, hypothesis testing may not yield a high proportion of successes.


  • In that paper there was no constraint on memory and the experiments were restricted to be binary-valued (coin tosses).
  • Feldman [4] has solved the generalized version of the TABP corresponding to (1) (with known a priori probabilities) in the infinitememory case.
  • The algorithm is independent of the biases p1 and pz on the two coins, and thus is ,optimal (achieves T = 1) for the more general problem of maximizing the asymptotic proportion of heads with two coins having arbitrary unknown biases.
  • It should, be mentioned that the motivation of the previous papers is different from ours in the respect that previous work centered on modeling learning processes by finite-state automata.
  • For this reason, the number of states m was frequently allowed to tend to infinity in the analysis, and the emphasis on optimal m-state automata was lost.

Pti = 1 bii(A, Ybifo(Y)

  • Where f. and fl are the Radon-Nikodym derivatives of p0 and @I with respect to some dominating Attention will be restricted to the algorithm measure v. Define the m X m matrices P" = [p:,.] and EQUATION P' = [pii] and let go and p1 be the stationary probability distributions on the state space 3 under Ho and H,.
  • The es = e(Tn-d e, E IA, Bj (12) stationary probability distributions are solutions of the matrix equations ( 13) where T is the state of memory, e the choice of experiment, bnd Y the resulting observation.
  • A reformulation of this algorithm in the terminology of finite-state machines will ae convenient.
  • Parallel work on hypothesis testing with finite memory [28] establishes that irreducible automata are at least "one state better" than reducible automata.
  • The following definitions and theorems will prove useful.


  • For the problem in which two coins of unknown bias are presented, is almost independent of the exact biases of the coins.
  • Two hypotheses exist concerning pd and pB, the respective biases of the coins toward heads: H: :pA < PB followed by the utilization of the coin deemed to have the highest probability of heads.
  • It is readily verified by indexing the states in reverse order that the e-optimal m-state machine for (pl, p2) E Q., is identical to that for (pl, p2) E %.
  • This is the conservatism that the finite-memory constraint demands.
  • The TABP with finite memory given a finite sequence of observations has not been solved.


  • Inspection of the solution of the TABP indicates that optimal finite-memory learning is reasonably far from human intuition and practice.
  • Some interesting properties of the solution are the following.
  • The state transition function f is deterministic.
  • This differs from what the authors might call the one-armed-bandit problem [29] in which the experiment to be performed at each stage is unchanged, but for which the e-optimal state transition function f involves randomization.
  • This conflict does not generally disappear in the infinite-.

UPPOSE one is given two coins, labeled A and B.
Suppose also that it is known that one of the coins
has bias p, towards heads and the other has bias
p, towards heads, but it is not known which coin has
which bias. At each trial a coin is to be selected and tossed,
and it is desired to maximize the proportion of heads
(successes) achieved in the limit as the number of trials
tends to infinity. An equivalent objective is to maximize
the proportion of tosses using the coin with the larger
bias. How should the choice of coin at trial n depend on
the previous outcomes, in order to achieve this goal?
This problem is commonly referred to as the sequential
design of experiments or the two-armed-bandit problem
(TABP) [l]-[3].
Note that this problem combines hypothesis testing
(which coin has which bias?) with the added degree of
freedom that the experimenter may select his experiment
(A or B) at each toss. The experimenter must utilize
his information to maximize the proportion of successes.
This paper will be concerned with a generalized TABP
in which the coins may have an infinite number of
sides. A further generalization of the TABP to an infinite
number of coins will be provided in Section VI. These
problems will be solved under a finite-memory constraint,
Manuscript received May 19, 1969; revised September 29, 1969.
This work was supported in part under Contract AF 49(638) 1517
and under the NSF Cradaate Fellowship Program.
T. M. Cover is with the Department of Electrical Engineering,
Starlford University, Stanford, Calif. 94305.
M. E. Hellman was with the Watson IBM Research Center,
Yorktown Heights, N. Y. He is now with the Massachusetts Institute
of Technology, Cambridge, Mass. 02139.
i.e., the experimenter is not allowed to remember the
outcomes of all previous trials, but only a finite-valued
statistic. On the basis of this statistic, the next coin must
be chosen.
Stated more precisely, the experimenter is provided
two experiments, A and B. Also given are two probability
measures p0 and 6, defined on the arbitrary probability
space (3, a3), where y is the experimental outcome space
and @ is a u-field of subsets over y. There are two hy-
potheses concerning the probability distribution of the
experimental outcome Y:
H, :
Y N p0 under experiment
Y N 6, under experiment
H, :
Y N @I under experiment
Y N p0 under experiment
Let the a priori probabilities of
be p0 and ul
respectively, where ?ro + rrl = 1. This seemingly Bayesian
formulation, in which the priors are specified, is not
restrictive since the set of all admissible algorithms (or
the set of all optimal algorithms with respect to the
Neyman-Pearson formulation) may be generated by
letting r. take on all values in the unit interval.
Let ei E
{A, B}
denote the ith experiment performed
and Y, E y denote the ith experimental outcome. It
is assumed that the experimental outcomes are inde-
pendent in the sense that
P(Y,, Yj 1 ej, ej, H) = P(Y, 1 ei, H)P(Yj 1 ej, H)
i # j,
is the true hypothesis.
A success is said to occur if the experiment associated
with 6, is performed. At times n = 1, 2, 3, * * . a choice
of experiment e, is made. Letting
if success occurs at time n,
s, =
if failure occurs at time n,
the objective is to maximize
where the expectation is taken with respect to the distri-
bution on the two hypotheses and the distribution on
(si) induced by the experiment selection algorithm.
Therefore, r is the expected long run proportion of suc-
Let the data be summarized by an m-valued statistic
that is updated according to the rule
Tn = On-1, XJ T, E (1,
-a - , ml

The idea of adding a finite-memory constraint is due
to Robbins [5]. Robbins defines memory to be of length k
if the choice of coin at each trial is allowed to depend
only on the outcomes of the
previous trials. Letting
X = (A,
y denote the observation space, the prob-
lem becomes one of determining the function e:
for which the algorithm
en+l = 4X,, JL, --- , X-k+d
Xn = (en, Y,)
maximizes r. Since
has but four members in Robbinss
problem, memory is still finite according to the definition
of Section I, with m = 4k. However, if the experimental
outcome space y is infinite, an infinite-state memory is
needed to recall the last
experimental outcomes.
Although Robbinss original algorithm has been succes-
sively improved by Isbell [6], Smith and Pyke [7], and
Samuels [8], an optimal scheme has not been established.
However, if the choice of coin may also depend on time,
the problem has been solved by Cover [9]. A memory
= 2 is sufficient, i.e., there exists an algorithm e for
which the scheme
is the value of
after n observations, X,, =
(e,, Y,,) is the nth observation (note the difference between
an observation X = (e, Y) and an experimental out-
come Y) and
is a stochastic function. Further, let e,,
be constrained to depend on the past outcomes X,,
x,, * * * , X,-, only through
according to the function
e, = 4T.d
n = 1,2, se-
where e: (1, 2, . * + ,
m 1 --f (A,
) is again allowed to be
a stochastic function. (The randomization in the functions
and e must, to avoid cheating, be independent of the
data.) The size of memory is defined to be m.
The objective is now to find the pair
e) that max-
imizes r for given m, ?rO, pO, and &. For a reformulation in
terms of optimal finite-state machines see Section III.
As was previously mentioned, it is not only necessary
to test
but also to use the result of the
test in an attempt to obtain successes. This produces a
conflict. The experimenter may believe
(in which
case he should perform
and yet he may wish to perform
if it would yield more information, thereby increasing
the probability of success on future trials. The conflict
is between a desire for immediate success and a desire
to gather information.
Another conflict exists. A good test requires large
memory, but, as mentioned, hypothesis testing may not
yield a high proportion of successes. Thus, once the test
is completed, a large number of experiments that use
the result of the test is desired. However, an m-valued
statistic can only count to m. There is a problem in
deciding how much memory to allocate to testing and
how much to allocate to using the information gathered
by testing. Fortunately, the optimal solution that we
shall present suggests an interpretation answering this
question. The surprising answer is that all of the states
of memory may be devoted to hypothesis testing, and
the information so gathered may be used to gain successes
in a manner that does not interfere with the hypothesis
The TABP was introduced by Robbins [l] in 1952.
In that paper there was no constraint on memory and
the experiments were restricted to be binary-valued
(coin tosses). Robbins argued that a scheme that sampled
the inferior coin infinitely often, but with density of
sampling tending to zero, yielded r = 1. Here, at a par-
ticular time, the inferior coin is defined to be the coin
yielding the lower cumulative proportion of heads. Sub-
sequently, Bradt, Johnson, and Karlin [2] and Bradt
and Karlin [3] examined generalizations of the TABP in
which it was desired to maximize the number of successes
in a finite number of trials. This problem remains open
in the case where the coin biases (pl, pz) have an arbitrary
known joint distribution. However, Feldman [4] has
solved the generalized version of the TABP corresponding
to (1) (with known a priori probabilities) in the infinite-
memory case.
en+, = eGK, JLl, n>
achieves an asymptotic proportion of successes T = 1. The
algorithm is independent of the biases p1 and pz on the
two coins, and thus is ,optimal (achieves T = 1) for the
more general problem of maximizing the asymptotic pro-
portion of heads with two coins having arbitrary unknown
biases. This work also implies that, with the definition of
memory given in Section I, a memory of m = 4 states is
sufficient [lo] for a time-varying algorithm to achieve
T = 1.
A series of publications following the work of Tsetlin
has appeared in the Russian literature [ll]-[21] on the
behavior of automata in random media in an attempt to
model adaptive or learning systems. In many cases the
algorithms considered are similar to the TABP with finite
memory of the type defined in Section I. A series of ad
expedient automata (i.e., automata that perform better
than simply alternating coins at each trial) is examined,
but no optimal automata are found. Subsequent work by
Fu and Li [22], [23] and Chandrasekaran and Shen [24]-[27]
has enlarged the set of algorithms for which the asymptotic
behavior has been found. The fundamental problem
implicit in [ll]-[27] is presented in Section I and solved
in this paper. It should, be mentioned that the motivation
of the previous papers is different from ours in the respect
that previous work centered on modeling learning proc-
esses by finite-state automata. For this reason, the
number of states m was frequently allowed to tend to
infinity in the analysis, and the emphasis on optimal
m-state automata was lost.
Note this one word of caution. Memory size has been
defined to be the number of states of the automaton. This
seems to us to be natural. However, we have not included
any measure of the complexity of the computation of the

state transition function
and the choice of experiment
function e. Fortunately, the optimal function f is rather
simple to implement, as
be seen from the example at
Fig. 1.
Decision process viewed as a finite state automaton.
the end of Section IV. Moreover, if an auxiliary stream of
random variables is available, the calculation of
and e
The state transition matrices conditioned on
and H,
may be performed by hard-wired circuitry without are given by
memorv elements.
P!i = E (Pij(X) [ Ho}
The two-armed-bandit problem that will be solved in
this section has the form
pij = f {pii 1 HI) *
H, : Y - (Pi
Y - CPl
As will be shown in the proof of Theorem 1, these expecta-
(9) tions may be explicitly expressed as follows:
H, : Y - @I
Y - 80
where 6,, and (?I are arbitrary known probability measures.
Pti = 1 bii(A,
Thus Y is not restricted to be a binary-valued random
variable as in previous work [l], [5]. In Section VI, the
pi,@, Y)U
solution will be generalized to the form
H, : Y - p. Y - 61
f Pi,@, ~10 - dfo(~I) dv
H, : Y-u,
Y - cP3.
are the Radon-Nikodym derivatives
(densities) of p0 and @I with respect to some dominating
Attention will be restricted to the algorithm measure v. Define the m X m matrices
Tn = f(Tn-1, XJ
T, E
a-- , ml
= [pii] and let go and p1 be the stationary probability
distributions on the state space 3 under
es = e(Tn-d e, E IA, Bj
stationary probability distributions are solutions of the
matrix equations
is the state of memory, e the choice of experiment,
bnd Y the resulting observation. A reformulation of this
algorithm in the terminology of finite-state machines will
ae convenient. X and Y will denote random variables,
and x and y their outcomes.
Consider a finite-state stochastic sequential machine
with state space 3 = (1, 2, * 9 * , m), input space
(A, Bj
X y and output space
B1. Let the state
transition behavior of this machine be specified by a
family of
m X
m stochastic matrices [pii(x defined for
x= (e,y>,eE iA,BJ,yEy,andi,iE
p& Y> = Pr 1 T, = i I T,-l = i, -G = (e,
Y> I
is the conditional probability of transition from memory
state i to j under the observation of experiment e with
outcome y.
Let the output function be described by the sequence
LYE, 0 5 ai 5 1, i = 1, 2, . * . , m, with the interpretation
ai = Pr (e,,, =
A j T, = i].
Thus, the next experiment chosen is a random variable
depending solely on the past experience as summarized in
the current state of memory
The automaton is depicted
in Fig. 1.
p1 = VIP.
Note that if
is irreducible, & is the asymptotic propor-
tion of time spent in state i, conditioned on Hk. Parallel
work on hypothesis testing with finite memory [28]
establishes that irreducible automata are at least one
state better than reducible automata. The same argument
applies to the current formulation of the TABP so that,
here too, attention will be restricted to irreducible auto-
Letting r. and r1 be the asymptotic proportion of
successes under
it is seen that
where the LY( are defined by (15).
If a Bayesian approach is taken and a priori probabil-
ities ?rO and ?rl (p. + 7rl = 1) are assigned to
r = ?roro + n-l?l.
Although the Bayesian approach will be taken, the results
will apply to the Neyman-Pearson formulation as well.
In the Neyman-Pearson formulation the problem is to

maximize r1 subject to the constraint r. 2 IX, for a given
level 0.
Returning to the Bayesian formulation, the goal is to
over all P,~(x), and cr<; i, j = 1, 2, * * . , m.
Designate this maximum value of r by r*.
In order to place an upper bound on r it is necessary to
relate the parameters of the automaton to the statistics
of the problem. The following definitions and theorems
will prove useful.
7 = wax {$, #} = 3;
= 7 = 2;
= $, the maximum and
minimum likelihood ratio events are given by tails on
and tails on coin
Let the measure v = p. + pl. Thus (Pi and @I are both
absolutely continuous with respect to Y. Define f,(y) and
fl(y) to be the Radon-Nikodym derivatives of PO and &,I
with respect to v (f. and fl are the probability density
functions of 6, and 61 with respect to v). Let
For alli, j E (1, 2, ... , m],
l/i < ppi/p:j < 5.
From (16) it is seen that
ppj = Pr
(T,, = j 1 T,,-, = i, Ho}.
L(Y) = fO(Y)lfl(Y)
MY) = f,(Yllfo(Y) = WA(Y).
It is seen that IA and 1, are the likelihood ratios for an
experimental outcome y that results from experiments
Further define (for C C y)
= inf -
c 1
v(C)>0 @l(C)
II 1
& = inf ___ .
v(C)>0 @o(C)
Thus ?A is the almost everywhere (a.e.) maximum likeli-
hood ratio (1.r.) for experiment
is the a.e.
minimum 1.r. for experiment
Similarly, TB and
the a.e. maximum and minimum l.r.s for experiment
Clearly, from the definitions, 7, =
iA =
Thus defining
7 = max IA, lip]
= min
{iA, LB)
it is seen that
7 = max {TAa,
i =
T =
The likelihood ratio Z(X) of an observation z = (e, y),
E {A, B}
, is defined to be
J(x) =
< Z(X) 5 f.
For example, if two unlabeled coins of biases p, = 0.7
and p, = 0.8 are given, the possible events C are heads
and tails, and
ppi = Pr
{T, = j 1 T,,-l = i, Ho, e, = A]
.Pr (e, =
A 1 T,-, = i, Ho}
+ Pr
{T, = j I T,,-, = i, Ho, e,, = B)
.Pr (e, =
B I T,,-l = i, Ho).
But, since the choice of e, is a (randomized) function of
Pr (en =
A I T,,-l = i, Ho}
= Pr {en =
A I T,,-, = i)
= Cfi*
Pr (e,, =
B 1 T,,-, = i, Ho) =
1 - LY+
From (14),
IT,, = j I T,,-, = i, Ho, e, = A)
pi,@, Y)~o(Y) dv (Y), (39)
since under
the experimental outcome Y has
as its
density function when
is performed. Similarly,
{T, = j I Tnml = i, Ho, em = B}
YIfl(YI dV(Y)*
Then (36) becomes
POi = ai
Pij(A, Y)fo(Y) WY)
By definition

so that
pij = ffi
Pi,@, YL(Y)fl(Y) WY)
+ (1 - 4 1
Pi@, YMY)fo(Y) MY).
Furthermore, Z,(y) < j* and Z,(y) < ?, a.e. Y, and by (29)
7 = max {lA, EB). Hence
p:i I 7 ai
Pi,(A, Y)fl(Y) dV(Y)
+ 0 - 4 j-
Pi,@, Y)fo(Y) MY) *
Proceeding similarly it is found that
Pii = ai
Pi,@, Y)flM f-WY)
+ (1 - 4 /-
Pi,@, Y)fo(Y) WY).
Combining (45) and (46) yields
pyi/p:i 5 7
thus proving half of the theorem. The other half follows in
an analogous manner.
The state likelihood ratio vector a = (X1, * * * , X,) is
defined by
xi = pq/pt i = 1,2, -. . , m.
Theorem d
Before proceeding with the proof of the theorem, an
example will be given. Consider the coin-flipping TABP
7ro = 4
Ho : p, = 0.9
p, = 0.8
n-1 = 3 H, : pl = 0.8 p, = 0.9
where p. and p, are the probabilities of the event heads
(H) under the appropriate conditions. Thus, for example,
if coin
is flipped and Ho is true, then Pr( heads] =
p, = 0.9. Calculation shows that
7 = max jA, ?B] = max (g, 8, 3, +) = 2.
Thus, for an m-state memory the best possible limiting
proportion of uses of the best coin (in this case, coin
is given by
r* =
+ 1.
In the next section an automaton will be exhibited that
achieves r* arbitrarily closely.
For an irreducible automaton in which the Xis are
arranged in nondecreasing order the following relation
If p. = 0.5, p1 = 0.501, the situation is quite different.
Here 7 E 1.002 and
Xi,l/Xd 5 (7).
Since it has been noted that irreducible automata can
do at least as well as reducible ones, the irreducibility
restriction is of no consequence.
The proof of this theorem follows from Theorem
1 using arguments contained in Lemma 2 of [28]. The
reader is referred there for details.
r* Z (l.002)m-/((1.002)m-1 + 1).
Thus, even m = 500 states yields only a proportion of
successes r*
NN e/(e + 1). No 500-state machine can do
Proof: By Theorem 2, Xz 5 X1(?), X3 5 X1(?), * 1 . ,
x, 5 x1(~)2-1).
Thus, for all i E 3 = (1, 2, * . . , m)
Theorem S
For an m-state automaton r is bounded above by
r* = max
= 2(rn-1)
(~)zcm-l) _ 1 )
7 TO$Tl
In the special case r0 = r1 = 4,
r* =
p-l)/(tbn-l) + 1).
If r* = no (or rl), a degenerate situation exists in which
the machine that always chooses experiment
optimal. In this case memory is not large enough to gather
sufficient information to offset the a priori bias [28].
Remark 2
The larger the value of :,-the larger the resultant propor-
tion of successes r*. Thus, 1 is a measure of the separation
r. 5 X1(l)““-(l -

