scispace - formally typeset
Open AccessJournal ArticleDOI

The two-armed-bandit problem with time-invariant finite memory

Reads0
Chats0
TLDR
This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below and shows the optimal value of r, over all m -state algorithms (f, e), will be shown to be l-1 / (l-1 + 1) .
Abstract
This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities p_0 and p_1 , and two experiments A and B . It is not known which density is associated with which experiment. Thus the experimental outcome Y of experiment A is as likely to be distributed according to p_0 as it is to be distributed according to p_1 . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm T_n = f(T_{n-1}, e_n, Y_n), e_n = e(T_{n-1}) , where T_n \in \{1, 2, \cdots, m\} is the state of memory at time n, e_n \in \{A, B\} is the choice of experiment, and Y_n , is the random variable observation. The goal is to maximize the asymptotic proportion r of uses of the experiment associated with density p_0 . Let l(y) = p_0 (y) / p_1 (y) , and let \bar{l} and \bar{\bar{l}} denote the almost everywhere greatest lower bound and least upper bound on l(y) . Let 1 = \max {\bar{\bar{l}}, 1/\bar{l}} . Then the optimal value of r , over all m -state algorithms (f, e) , will be shown to be l^{m-1} / (l^{m-1} + 1) . An e -optimal family of m -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and e -optimal algorithms require artificial randomization.

read more

Content maybe subject to copyright    Report

IEEE Tx.437sAcuoNs 0N INFORMATION THEORY,VOL. IT-16, NO.
2,
MARCH
1970
The Two-Armed-Bandit Problem With
Time-Invariant Finite Memory
THOMAS M. COVER
AND
MARTIN E. HELLMAN
Absfracf-This paper solves the classical two-armed-bandit
problem under the finite-memory constraint described below.
Given are probability densities p0 and p,, and two experiments
A and B. It is not known which density is associated with which
experiment. Thus the experimental outcome Y of experiment A
is as likely to be distributed according to p0 as it is to be distributed
according to p,. It is desired to sequentially choose an experiment
to be performed on the basis of past observations according to
the algorithm !I, = f(Tn-l, en,
Y,), e,, = e(T&, where T,, E
II,
2,
a**>
m) is the state of memory at time n, e, E (A, B ] is the
choice of experiment, and Y, is the random variable observation.
The goal is to maximize the asymptotic proportion T of uses of the
experiment associated with density PO.
Let l(y) = po(y)/pr(y), and let i and ?denote the almost every-
where greatest lower bound and least upper bound on Z(y). Let
1 = max (7, l/i). Then the optimal value of r, over all m-state
algorithms Cf, e), will be shown to be Zm-r/(lm-r + 1). An e-optimal
family of m-state algorithms will be demonstrated. In general,
optimal algorithms do not exist, and e-optimal algorithms require
artiticial randomization.
I.
INTRODUCTION
s
UPPOSE one is given two coins, labeled A and B.
Suppose also that it is known that one of the coins
has bias p, towards heads and the other has bias
p, towards heads, but it is not known which coin has
which bias. At each trial a coin is to be selected and tossed,
and it is desired to maximize the proportion of heads
(successes) achieved in the limit as the number of trials
tends to infinity. An equivalent objective is to maximize
the proportion of tosses using the coin with the larger
bias. How should the choice of coin at trial n depend on
the previous outcomes, in order to achieve this goal?
This problem is commonly referred to as the sequential
design of experiments or the two-armed-bandit problem
(TABP) [l]-[3].
Note that this problem combines hypothesis testing
(which coin has which bias?) with the added degree of
freedom that the experimenter may select his experiment
(A or B) at each toss. The experimenter must utilize
his information to maximize the proportion of successes.
This paper will be concerned with a generalized TABP
in which the coins may have an infinite number of
sides. A further generalization of the TABP to an infinite
number of coins will be provided in Section VI. These
problems will be solved under a finite-memory constraint,
Manuscript received May 19, 1969; revised September 29, 1969.
This work was supported in part under Contract AF 49(638) 1517
and under the NSF Cradaate Fellowship Program.
T. M. Cover is with the Department of Electrical Engineering,
Starlford University, Stanford, Calif. 94305.
M. E. Hellman was with the Watson IBM Research Center,
Yorktown Heights, N. Y. He is now with the Massachusetts Institute
of Technology, Cambridge, Mass. 02139.
i.e., the experimenter is not allowed to remember the
outcomes of all previous trials, but only a finite-valued
statistic. On the basis of this statistic, the next coin must
be chosen.
Stated more precisely, the experimenter is provided
two experiments, A and B. Also given are two probability
measures p0 and 6, defined on the arbitrary probability
space (3, a3), where y is the experimental outcome space
and @ is a u-field of subsets over y. There are two hy-
potheses concerning the probability distribution of the
experimental outcome Y:
H, :
i
Y N p0 under experiment
A
Y N 6, under experiment
B
(1)
H, :
i
Y N @I under experiment
A
Y N p0 under experiment
B.
Let the a priori probabilities of
H,
and
H,
be p0 and ul
respectively, where ?ro + rrl = 1. This seemingly Bayesian
formulation, in which the priors are specified, is not
restrictive since the set of all admissible algorithms (or
the set of all optimal algorithms with respect to the
Neyman-Pearson formulation) may be generated by
letting r. take on all values in the unit interval.
Let ei E
{A, B}
denote the ith experiment performed
and Y, E y denote the ith experimental outcome. It
is assumed that the experimental outcomes are inde-
pendent in the sense that
P(Y,, Yj 1 ej, ej, H) = P(Y, 1 ei, H)P(Yj 1 ej, H)
i # j,
where
H
is the true hypothesis.
A success is said to occur if the experiment associated
with 6, is performed. At times n = 1, 2, 3, * * . a choice
of experiment e, is made. Letting
I
1,
if success occurs at time n,
s, =
1
(2)
0,
if failure occurs at time n,
the objective is to maximize
where the expectation is taken with respect to the distri-
bution on the two hypotheses and the distribution on
(si) induced by the experiment selection algorithm.
Therefore, r is the expected long run proportion of suc-
cesses.
Let the data be summarized by an m-valued statistic
T
that is updated according to the rule
Tn = On-1, XJ T, E (1,
2,
-a - , ml
(4)

186
IEJ0E TRANSACTIONS ON INFORMAPION THEORY, MARCH 1970
The idea of adding a finite-memory constraint is due
to Robbins [5]. Robbins defines memory to be of length k
if the choice of coin at each trial is allowed to depend
only on the outcomes of the
k
previous trials. Letting
X = (A,
BJ X
y denote the observation space, the prob-
lem becomes one of determining the function e:
Xk
---)
{A,
B)
for which the algorithm
en+l = 4X,, JL, --- , X-k+d
(6)
Xn = (en, Y,)
(7)
maximizes r. Since
X
has but four members in Robbinss
problem, memory is still finite according to the definition
of Section I, with m = 4k. However, if the experimental
outcome space y is infinite, an infinite-state memory is
needed to recall the last
k
experimental outcomes.
Although Robbinss original algorithm has been succes-
sively improved by Isbell [6], Smith and Pyke [7], and
Samuels [8], an optimal scheme has not been established.
However, if the choice of coin may also depend on time,
the problem has been solved by Cover [9]. A memory
k
= 2 is sufficient, i.e., there exists an algorithm e for
which the scheme
where
T,,
is the value of
T
after n observations, X,, =
(e,, Y,,) is the nth observation (note the difference between
an observation X = (e, Y) and an experimental out-
come Y) and
f
is a stochastic function. Further, let e,,
be constrained to depend on the past outcomes X,,
x,, * * * , X,-, only through
T,,-,,
according to the function
e, = 4T.d
n = 1,2, se-
(5)
where e: (1, 2, . * + ,
m 1 --f (A,
B
) is again allowed to be
a stochastic function. (The randomization in the functions
f
and e must, to avoid cheating, be independent of the
data.) The size of memory is defined to be m.
The objective is now to find the pair
(f,
e) that max-
imizes r for given m, ?rO, pO, and &. For a reformulation in
terms of optimal finite-state machines see Section III.
As was previously mentioned, it is not only necessary
to test
H,
versus
HI,
but also to use the result of the
test in an attempt to obtain successes. This produces a
conflict. The experimenter may believe
HO
(in which
case he should perform
A)
and yet he may wish to perform
B
if it would yield more information, thereby increasing
the probability of success on future trials. The conflict
is between a desire for immediate success and a desire
to gather information.
Another conflict exists. A good test requires large
memory, but, as mentioned, hypothesis testing may not
yield a high proportion of successes. Thus, once the test
is completed, a large number of experiments that use
the result of the test is desired. However, an m-valued
statistic can only count to m. There is a problem in
deciding how much memory to allocate to testing and
how much to allocate to using the information gathered
by testing. Fortunately, the optimal solution that we
shall present suggests an interpretation answering this
question. The surprising answer is that all of the states
of memory may be devoted to hypothesis testing, and
the information so gathered may be used to gain successes
in a manner that does not interfere with the hypothesis
testing.
II.
HISTORYOBTHEPROBLEM
The TABP was introduced by Robbins [l] in 1952.
In that paper there was no constraint on memory and
the experiments were restricted to be binary-valued
(coin tosses). Robbins argued that a scheme that sampled
the inferior coin infinitely often, but with density of
sampling tending to zero, yielded r = 1. Here, at a par-
ticular time, the inferior coin is defined to be the coin
yielding the lower cumulative proportion of heads. Sub-
sequently, Bradt, Johnson, and Karlin [2] and Bradt
and Karlin [3] examined generalizations of the TABP in
which it was desired to maximize the number of successes
in a finite number of trials. This problem remains open
in the case where the coin biases (pl, pz) have an arbitrary
known joint distribution. However, Feldman [4] has
solved the generalized version of the TABP corresponding
to (1) (with known a priori probabilities) in the infinite-
memory case.
en+, = eGK, JLl, n>
(8)
achieves an asymptotic proportion of successes T = 1. The
algorithm is independent of the biases p1 and pz on the
two coins, and thus is ,optimal (achieves T = 1) for the
more general problem of maximizing the asymptotic pro-
portion of heads with two coins having arbitrary unknown
biases. This work also implies that, with the definition of
memory given in Section I, a memory of m = 4 states is
sufficient [lo] for a time-varying algorithm to achieve
T = 1.
A series of publications following the work of Tsetlin
has appeared in the Russian literature [ll]-[21] on the
behavior of automata in random media in an attempt to
model adaptive or learning systems. In many cases the
algorithms considered are similar to the TABP with finite
memory of the type defined in Section I. A series of ad
hoc
expedient automata (i.e., automata that perform better
than simply alternating coins at each trial) is examined,
but no optimal automata are found. Subsequent work by
Fu and Li [22], [23] and Chandrasekaran and Shen [24]-[27]
has enlarged the set of algorithms for which the asymptotic
behavior has been found. The fundamental problem
implicit in [ll]-[27] is presented in Section I and solved
in this paper. It should, be mentioned that the motivation
of the previous papers is different from ours in the respect
that previous work centered on modeling learning proc-
esses by finite-state automata. For this reason, the
number of states m was frequently allowed to tend to
infinity in the analysis, and the emphasis on optimal
m-state automata was lost.
Note this one word of caution. Memory size has been
defined to be the number of states of the automaton. This
seems to us to be natural. However, we have not included
any measure of the complexity of the computation of the

COVER AND HELLMAN : TWO-ARMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY
state transition function
f
and the choice of experiment
function e. Fortunately, the optimal function f is rather
X,=(e,,Yn)
7=Wi-,d
-jy---++,
simple to implement, as
can
be seen from the example at
Fig. 1.
Decision process viewed as a finite state automaton.
the end of Section IV. Moreover, if an auxiliary stream of
random variables is available, the calculation of
f
and e
The state transition matrices conditioned on
Ho
and H,
may be performed by hard-wired circuitry without are given by
memorv elements.
III.
FINITE-STATE MACHINES
P!i = E (Pij(X) [ Ho}
z
(16)
The two-armed-bandit problem that will be solved in
and
this section has the form
experiment
A
experiment
B
pij = f {pii 1 HI) *
(17)
H, : Y - (Pi
Y - CPl
As will be shown in the proof of Theorem 1, these expecta-
(9) tions may be explicitly expressed as follows:
H, : Y - @I
Y - 80
where 6,, and (?I are arbitrary known probability measures.
Pti = 1 bii(A,
Ybifo(Y)
Thus Y is not restricted to be a binary-valued random
variable as in previous work [l], [5]. In Section VI, the
+
pi,@, Y)U
-
w)fl(~N
dv
(Y>
(18)
solution will be generalized to the form
1
Pii
=
s
CP&,
y)aifl(d
A B
H, : Y - p. Y - 61
f Pi,@, ~10 - dfo(~I) dv
(Y)
(19)
00)
H, : Y-u,
Y - cP3.
where
f.
and
fl
are the Radon-Nikodym derivatives
(densities) of p0 and @I with respect to some dominating
Attention will be restricted to the algorithm measure v. Define the m X m matrices
P
=
[p:,.]
and
Tn = f(Tn-1, XJ
T, E
11,%
a-- , ml
(11)
P
= [pii] and let go and p1 be the stationary probability
distributions on the state space 3 under
Ho
and
H,.
The
es = e(Tn-d e, E IA, Bj
(12)
stationary probability distributions are solutions of the
matrix equations
(13)
where
T
is the state of memory, e the choice of experiment,
bnd Y the resulting observation. A reformulation of this
algorithm in the terminology of finite-state machines will
ae convenient. X and Y will denote random variables,
and x and y their outcomes.
Consider a finite-state stochastic sequential machine
with state space 3 = (1, 2, * 9 * , m), input space
X
=
(A, Bj
X y and output space
{A,
B1. Let the state
transition behavior of this machine be specified by a
family of
m X
m stochastic matrices [pii(x defined for
x= (e,y>,eE iA,BJ,yEy,andi,iE
iL%--*,ml.
Then
p& Y> = Pr 1 T, = i I T,-l = i, -G = (e,
Y> I
(14)
is the conditional probability of transition from memory
state i to j under the observation of experiment e with
outcome y.
Let the output function be described by the sequence
LYE, 0 5 ai 5 1, i = 1, 2, . * . , m, with the interpretation
that
ai = Pr (e,,, =
A j T, = i].
(15)
Thus, the next experiment chosen is a random variable
depending solely on the past experience as summarized in
the current state of memory
T,,.
The automaton is depicted
in Fig. 1.
(20)
p1 = VIP.
(21)
Note that if
P
is irreducible, & is the asymptotic propor-
tion of time spent in state i, conditioned on Hk. Parallel
work on hypothesis testing with finite memory [28]
establishes that irreducible automata are at least one
state better than reducible automata. The same argument
applies to the current formulation of the TABP so that,
here too, attention will be restricted to irreducible auto-
mata.
Letting r. and r1 be the asymptotic proportion of
successes under
Ho
and
H,,
it is seen that
(22)
(23)
i=l
where the LY( are defined by (15).
If a Bayesian approach is taken and a priori probabil-
ities ?rO and ?rl (p. + 7rl = 1) are assigned to
Ho
and
HI
then
r = ?roro + n-l?l.
(24)
Although the Bayesian approach will be taken, the results
will apply to the Neyman-Pearson formulation as well.
In the Neyman-Pearson formulation the problem is to

188
IEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1970
maximize r1 subject to the constraint r. 2 IX, for a given
level 0.
Returning to the Bayesian formulation, the goal is to
maximize
r
over all P,~(x), and cr<; i, j = 1, 2, * * . , m.
Designate this maximum value of r by r*.
(334
WW
In order to place an upper bound on r it is necessary to
relate the parameters of the automaton to the statistics
of the problem. The following definitions and theorems
will prove useful.
7 = wax {$, #} = 3;
(33c)
Since
la(T)
= 7 = 2;
Z,(T)
=
i
= $, the maximum and
minimum likelihood ratio events are given by tails on
coin
A
and tails on coin
B,
respectively.
Dejkitions
Theorem
1
Let the measure v = p. + pl. Thus (Pi and @I are both
absolutely continuous with respect to Y. Define f,(y) and
fl(y) to be the Radon-Nikodym derivatives of PO and &,I
with respect to v (f. and fl are the probability density
functions of 6, and 61 with respect to v). Let
For alli, j E (1, 2, ... , m],
l/i < ppi/p:j < 5.
Proof:
From (16) it is seen that
ppj = Pr
(T,, = j 1 T,,-, = i, Ho}.
(34)
(35)
L(Y) = fO(Y)lfl(Y)
(25)
MY) = f,(Yllfo(Y) = WA(Y).
It is seen that IA and 1, are the likelihood ratios for an
experimental outcome y that results from experiments
A
and
B,
respectively.
Further define (for C C y)
iA
= inf -
c 1
@o(C)
v(C)>0 @l(C)
(26)
II 1
@l(C)
& = inf ___ .
v(C)>0 @o(C)
(27)
Thus ?A is the almost everywhere (a.e.) maximum likeli-
hood ratio (1.r.) for experiment
A;
and
iA
is the a.e.
minimum 1.r. for experiment
A.
Similarly, TB and
iB
are
the a.e. maximum and minimum l.r.s for experiment
B.
Clearly, from the definitions, 7, =
l/G
and
iA =
l/t,.
Thus defining
7 = max IA, lip]
i
= min
{iA, LB)
it is seen that
(28)
and
Dejinition
7 = max {TAa,
i/C)
i =
min
(iA,
l/TA]
T =
i/i.
(29)
(30)
(31)
The likelihood ratio Z(X) of an observation z = (e, y),
e
E {A, B}
, is defined to be
J(x) =
L(Y)
-
(32)
Obviously,
i
< Z(X) 5 f.
For example, if two unlabeled coins of biases p, = 0.7
and p, = 0.8 are given, the possible events C are heads
and tails, and
Equivalently
ppi = Pr
{T, = j 1 T,,-l = i, Ho, e, = A]
.Pr (e, =
A 1 T,-, = i, Ho}
+ Pr
{T, = j I T,,-, = i, Ho, e,, = B)
.Pr (e, =
B I T,,-l = i, Ho).
(36)
But, since the choice of e, is a (randomized) function of
T,,-,
alone,
Pr (en =
A I T,,-l = i, Ho}
= Pr {en =
A I T,,-, = i)
Similarly
= Cfi*
(37)
Pr (e,, =
B 1 T,,-, = i, Ho) =
1 - LY+
(38)
From (14),
Pr
IT,, = j I T,,-, = i, Ho, e, = A)
=
s
pi,@, Y)~o(Y) dv (Y), (39)
since under
Ho
the experimental outcome Y has
f.
as its
density function when
A
is performed. Similarly,
Pr
{T, = j I Tnml = i, Ho, em = B}
=
s
Pii@,
YIfl(YI dV(Y)*
Then (36) becomes
POi = ai
J
Pij(A, Y)fo(Y) WY)
By definition
(40)
(41)
(42)
(43)

COVER AND HELLMAN : TWO-.4RMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY
so that
0
pij = ffi
J
Pi,@, YL(Y)fl(Y) WY)
+ (1 - 4 1
Pi@, YMY)fo(Y) MY).
(44)
Furthermore, Z,(y) < j* and Z,(y) < ?, a.e. Y, and by (29)
7 = max {lA, EB). Hence
p:i I 7 ai
[S
Pi,(A, Y)fl(Y) dV(Y)
+ 0 - 4 j-
Pi,@, Y)fo(Y) MY) *
1
(45)
Proceeding similarly it is found that
Pii = ai
s
Pi,@, Y)flM f-WY)
+ (1 - 4 /-
Pi,@, Y)fo(Y) WY).
(46)
Combining (45) and (46) yields
pyi/p:i 5 7
(47)
thus proving half of the theorem. The other half follows in
an analogous manner.
DeJinition
The state likelihood ratio vector a = (X1, * * * , X,) is
defined by
xi = pq/pt i = 1,2, -. . , m.
(48)
Theorem d
Before proceeding with the proof of the theorem, an
example will be given. Consider the coin-flipping TABP
A
B
7ro = 4
Ho : p, = 0.9
p, = 0.8
n-1 = 3 H, : pl = 0.8 p, = 0.9
where p. and p, are the probabilities of the event heads
(H) under the appropriate conditions. Thus, for example,
if coin
A
is flipped and Ho is true, then Pr( heads] =
p, = 0.9. Calculation shows that
7 = max jA, ?B] = max (g, 8, 3, +) = 2.
(52)
Thus, for an m-state memory the best possible limiting
proportion of uses of the best coin (in this case, coin
A)
is given by
r* =
2-
2*-
+ 1.
(53)
In the next section an automaton will be exhibited that
achieves r* arbitrarily closely.
For an irreducible automaton in which the Xis are
arranged in nondecreasing order the following relation
holds:
Example
If p. = 0.5, p1 = 0.501, the situation is quite different.
Here 7 E 1.002 and
Remark
Xi,l/Xd 5 (7).
(49)
Since it has been noted that irreducible automata can
do at least as well as reducible ones, the irreducibility
restriction is of no consequence.
Proof:
The proof of this theorem follows from Theorem
1 using arguments contained in Lemma 2 of [28]. The
reader is referred there for details.
r* Z (l.002)m-/((1.002)m-1 + 1).
(54)
Thus, even m = 500 states yields only a proportion of
successes r*
NN e/(e + 1). No 500-state machine can do
better.
Proof: By Theorem 2, Xz 5 X1(?), X3 5 X1(?), * 1 . ,
x, 5 x1(~)2-1).
Thus, for all i E 3 = (1, 2, * . . , m)
Theorem S
Hence
For an m-state automaton r is bounded above by
T"
where
But
r* = max
{
(ym-I)
= 2(rn-1)
-
2(~0~,(0
(~)zcm-l) _ 1 )
l/2
7 TO$Tl
)
.
(50)
In the special case r0 = r1 = 4,
r* =
p-l)/(tbn-l) + 1).
(51)
Remark
1
If r* = no (or rl), a degenerate situation exists in which
the machine that always chooses experiment
A
(or
B)
is
optimal. In this case memory is not large enough to gather
sufficient information to offset the a priori bias [28].
Remark 2
The larger the value of :,-the larger the resultant propor-
tion of successes r*. Thus, 1 is a measure of the separation
between
Ho
and
HI.
Example
so
(55)
(56)
(57)
r. 5 X1(l)““-(l -
7-J.
(5%

Citations
More filters
Posted Content

Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory

TL;DR: In this article, the authors consider multi-armed bandit problems in social groups where each individual has bounded memory and shares the common goal of learning the best arm/option, and show that the goal can be achieved easily with the aid of social persuasion.
Proceedings ArticleDOI

Influence of Repetition through Limited Recall

TL;DR: A stylized model of learning from feeds is studied and it is found that imperfect recall not only leads to double-counting and extreme opinions in populations, but also impedes the ability of the receiver to learn the true state as the popula- tion of the senders increases.
Book ChapterDOI

Markov Chain Training Models for Nonseparable Classes

TL;DR: In Chapter 2, the convergence under separability property of the proportional increment training procedure was derived and it was shown that a good estimate of the performance of the classifier as a function of training length is a good guide to the design of a convergent training procedure.
Dissertation

Improvements to the complex question answering models

TL;DR: This thesis formulated the task of complex question answering using reinforcement framework, which to the best knowledge has not been applied for this task before and has the potential to improve itself by fine-tuning the feature weights from user feedback, and experimented with question decomposition where instead of trying to find the answer of the complex question directly, the answer was decomposed into a set of simple questions.
Journal ArticleDOI

Two-Armed Bandit Strategies that Discount Past and Future

TL;DR: In this article, the authors explore the small-sample performance of m-step look ahead strategies that only use the k most recent observations, for the Bernoulli two-armed bandit problem.
References
More filters
Journal ArticleDOI

Some aspects of the sequential design of experiments

TL;DR: The authors proposed a theory of sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves, which is a major advance.
Journal ArticleDOI

Learning with Finite Memory

TL;DR: In this paper, the design and performance of optimal finite-memory systems for the two-hypothesis testing problem with probability of error loss criterion was studied. But the problem was not studied in this paper.
Journal ArticleDOI

Hypothesis Testing with Finite Statistics

TL;DR: In this paper, it was shown that a four-valued statistic is sufficient to solve the two-hypothesis testing problem with a limiting probability of error zero under either hypothesis.
Frequently Asked Questions (1)
Q1. What have the authors contributed in "The two-armed-bandit problem w ith" ?

Absfracf-This paper solves the classical two-armed-bandit problem under the finite-memory constraint descr ibed below.