scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The two-armed-bandit problem with time-invariant finite memory

TL;DR: This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below and shows the optimal value of r, over all m -state algorithms (f, e), will be shown to be l-1 / (l-1 + 1) .
Abstract: This paper solves the classical two-armed-bandit problem under the finite-memory constraint described below. Given are probability densities p_0 and p_1 , and two experiments A and B . It is not known which density is associated with which experiment. Thus the experimental outcome Y of experiment A is as likely to be distributed according to p_0 as it is to be distributed according to p_1 . It is desired to sequentially choose an experiment to be performed on the basis of past observations according to the algorithm T_n = f(T_{n-1}, e_n, Y_n), e_n = e(T_{n-1}) , where T_n \in \{1, 2, \cdots, m\} is the state of memory at time n, e_n \in \{A, B\} is the choice of experiment, and Y_n , is the random variable observation. The goal is to maximize the asymptotic proportion r of uses of the experiment associated with density p_0 . Let l(y) = p_0 (y) / p_1 (y) , and let \bar{l} and \bar{\bar{l}} denote the almost everywhere greatest lower bound and least upper bound on l(y) . Let 1 = \max {\bar{\bar{l}}, 1/\bar{l}} . Then the optimal value of r , over all m -state algorithms (f, e) , will be shown to be l^{m-1} / (l^{m-1} + 1) . An e -optimal family of m -state algorithms will be demonstrated. In general, optimal algorithms do not exist, and e -optimal algorithms require artificial randomization.

Summary (2 min read)

I. INTRODUCTION

  • S UPPOSE one is given two coins, labeled A and B. Suppose also that it is known that one of the coins has bias p, towards heads and the other has bias p, towards heads, but it is not known which coin has which bias.
  • A further generalization of the TABP to an infinite number of coins will be provided in Section VI.
  • A good test requires large memory, but, as mentioned, hypothesis testing may not yield a high proportion of successes.

II. HISTORYOBTHEPROBLEM

  • In that paper there was no constraint on memory and the experiments were restricted to be binary-valued (coin tosses).
  • Feldman [4] has solved the generalized version of the TABP corresponding to (1) (with known a priori probabilities) in the infinitememory case.
  • The algorithm is independent of the biases p1 and pz on the two coins, and thus is ,optimal (achieves T = 1) for the more general problem of maximizing the asymptotic proportion of heads with two coins having arbitrary unknown biases.
  • It should, be mentioned that the motivation of the previous papers is different from ours in the respect that previous work centered on modeling learning processes by finite-state automata.
  • For this reason, the number of states m was frequently allowed to tend to infinity in the analysis, and the emphasis on optimal m-state automata was lost.

Pti = 1 bii(A, Ybifo(Y)

  • Where f. and fl are the Radon-Nikodym derivatives of p0 and @I with respect to some dominating Attention will be restricted to the algorithm measure v. Define the m X m matrices P" = [p:,.] and EQUATION P' = [pii] and let go and p1 be the stationary probability distributions on the state space 3 under Ho and H,.
  • The es = e(Tn-d e, E IA, Bj (12) stationary probability distributions are solutions of the matrix equations ( 13) where T is the state of memory, e the choice of experiment, bnd Y the resulting observation.
  • A reformulation of this algorithm in the terminology of finite-state machines will ae convenient.
  • Parallel work on hypothesis testing with finite memory [28] establishes that irreducible automata are at least "one state better" than reducible automata.
  • The following definitions and theorems will prove useful.

V. COMPOSITE HYPOTHESES

  • For the problem in which two coins of unknown bias are presented, is almost independent of the exact biases of the coins.
  • Two hypotheses exist concerning pd and pB, the respective biases of the coins toward heads: H: :pA < PB followed by the utilization of the coin deemed to have the highest probability of heads.
  • It is readily verified by indexing the states in reverse order that the e-optimal m-state machine for (pl, p2) E Q., is identical to that for (pl, p2) E %.
  • This is the conservatism that the finite-memory constraint demands.
  • The TABP with finite memory given a finite sequence of observations has not been solved.

VII. CONCLUSIONS

  • Inspection of the solution of the TABP indicates that optimal finite-memory learning is reasonably far from human intuition and practice.
  • Some interesting properties of the solution are the following.
  • The state transition function f is deterministic.
  • This differs from what the authors might call the one-armed-bandit problem [29] in which the experiment to be performed at each stage is unchanged, but for which the e-optimal state transition function f involves randomization.
  • This conflict does not generally disappear in the infinite-.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

IEEE Tx.437sAcuoNs 0N INFORMATION THEORY,VOL. IT-16, NO.
2,
MARCH
1970
The Two-Armed-Bandit Problem With
Time-Invariant Finite Memory
THOMAS M. COVER
AND
MARTIN E. HELLMAN
Absfracf-This paper solves the classical two-armed-bandit
problem under the finite-memory constraint described below.
Given are probability densities p0 and p,, and two experiments
A and B. It is not known which density is associated with which
experiment. Thus the experimental outcome Y of experiment A
is as likely to be distributed according to p0 as it is to be distributed
according to p,. It is desired to sequentially choose an experiment
to be performed on the basis of past observations according to
the algorithm !I, = f(Tn-l, en,
Y,), e,, = e(T&, where T,, E
II,
2,
a**>
m) is the state of memory at time n, e, E (A, B ] is the
choice of experiment, and Y, is the random variable observation.
The goal is to maximize the asymptotic proportion T of uses of the
experiment associated with density PO.
Let l(y) = po(y)/pr(y), and let i and ?denote the almost every-
where greatest lower bound and least upper bound on Z(y). Let
1 = max (7, l/i). Then the optimal value of r, over all m-state
algorithms Cf, e), will be shown to be Zm-r/(lm-r + 1). An e-optimal
family of m-state algorithms will be demonstrated. In general,
optimal algorithms do not exist, and e-optimal algorithms require
artiticial randomization.
I.
INTRODUCTION
s
UPPOSE one is given two coins, labeled A and B.
Suppose also that it is known that one of the coins
has bias p, towards heads and the other has bias
p, towards heads, but it is not known which coin has
which bias. At each trial a coin is to be selected and tossed,
and it is desired to maximize the proportion of heads
(successes) achieved in the limit as the number of trials
tends to infinity. An equivalent objective is to maximize
the proportion of tosses using the coin with the larger
bias. How should the choice of coin at trial n depend on
the previous outcomes, in order to achieve this goal?
This problem is commonly referred to as the sequential
design of experiments or the two-armed-bandit problem
(TABP) [l]-[3].
Note that this problem combines hypothesis testing
(which coin has which bias?) with the added degree of
freedom that the experimenter may select his experiment
(A or B) at each toss. The experimenter must utilize
his information to maximize the proportion of successes.
This paper will be concerned with a generalized TABP
in which the coins may have an infinite number of
sides. A further generalization of the TABP to an infinite
number of coins will be provided in Section VI. These
problems will be solved under a finite-memory constraint,
Manuscript received May 19, 1969; revised September 29, 1969.
This work was supported in part under Contract AF 49(638) 1517
and under the NSF Cradaate Fellowship Program.
T. M. Cover is with the Department of Electrical Engineering,
Starlford University, Stanford, Calif. 94305.
M. E. Hellman was with the Watson IBM Research Center,
Yorktown Heights, N. Y. He is now with the Massachusetts Institute
of Technology, Cambridge, Mass. 02139.
i.e., the experimenter is not allowed to remember the
outcomes of all previous trials, but only a finite-valued
statistic. On the basis of this statistic, the next coin must
be chosen.
Stated more precisely, the experimenter is provided
two experiments, A and B. Also given are two probability
measures p0 and 6, defined on the arbitrary probability
space (3, a3), where y is the experimental outcome space
and @ is a u-field of subsets over y. There are two hy-
potheses concerning the probability distribution of the
experimental outcome Y:
H, :
i
Y N p0 under experiment
A
Y N 6, under experiment
B
(1)
H, :
i
Y N @I under experiment
A
Y N p0 under experiment
B.
Let the a priori probabilities of
H,
and
H,
be p0 and ul
respectively, where ?ro + rrl = 1. This seemingly Bayesian
formulation, in which the priors are specified, is not
restrictive since the set of all admissible algorithms (or
the set of all optimal algorithms with respect to the
Neyman-Pearson formulation) may be generated by
letting r. take on all values in the unit interval.
Let ei E
{A, B}
denote the ith experiment performed
and Y, E y denote the ith experimental outcome. It
is assumed that the experimental outcomes are inde-
pendent in the sense that
P(Y,, Yj 1 ej, ej, H) = P(Y, 1 ei, H)P(Yj 1 ej, H)
i # j,
where
H
is the true hypothesis.
A success is said to occur if the experiment associated
with 6, is performed. At times n = 1, 2, 3, * * . a choice
of experiment e, is made. Letting
I
1,
if success occurs at time n,
s, =
1
(2)
0,
if failure occurs at time n,
the objective is to maximize
where the expectation is taken with respect to the distri-
bution on the two hypotheses and the distribution on
(si) induced by the experiment selection algorithm.
Therefore, r is the expected long run proportion of suc-
cesses.
Let the data be summarized by an m-valued statistic
T
that is updated according to the rule
Tn = On-1, XJ T, E (1,
2,
-a - , ml
(4)

186
IEJ0E TRANSACTIONS ON INFORMAPION THEORY, MARCH 1970
The idea of adding a finite-memory constraint is due
to Robbins [5]. Robbins defines memory to be of length k
if the choice of coin at each trial is allowed to depend
only on the outcomes of the
k
previous trials. Letting
X = (A,
BJ X
y denote the observation space, the prob-
lem becomes one of determining the function e:
Xk
---)
{A,
B)
for which the algorithm
en+l = 4X,, JL, --- , X-k+d
(6)
Xn = (en, Y,)
(7)
maximizes r. Since
X
has but four members in Robbinss
problem, memory is still finite according to the definition
of Section I, with m = 4k. However, if the experimental
outcome space y is infinite, an infinite-state memory is
needed to recall the last
k
experimental outcomes.
Although Robbinss original algorithm has been succes-
sively improved by Isbell [6], Smith and Pyke [7], and
Samuels [8], an optimal scheme has not been established.
However, if the choice of coin may also depend on time,
the problem has been solved by Cover [9]. A memory
k
= 2 is sufficient, i.e., there exists an algorithm e for
which the scheme
where
T,,
is the value of
T
after n observations, X,, =
(e,, Y,,) is the nth observation (note the difference between
an observation X = (e, Y) and an experimental out-
come Y) and
f
is a stochastic function. Further, let e,,
be constrained to depend on the past outcomes X,,
x,, * * * , X,-, only through
T,,-,,
according to the function
e, = 4T.d
n = 1,2, se-
(5)
where e: (1, 2, . * + ,
m 1 --f (A,
B
) is again allowed to be
a stochastic function. (The randomization in the functions
f
and e must, to avoid cheating, be independent of the
data.) The size of memory is defined to be m.
The objective is now to find the pair
(f,
e) that max-
imizes r for given m, ?rO, pO, and &. For a reformulation in
terms of optimal finite-state machines see Section III.
As was previously mentioned, it is not only necessary
to test
H,
versus
HI,
but also to use the result of the
test in an attempt to obtain successes. This produces a
conflict. The experimenter may believe
HO
(in which
case he should perform
A)
and yet he may wish to perform
B
if it would yield more information, thereby increasing
the probability of success on future trials. The conflict
is between a desire for immediate success and a desire
to gather information.
Another conflict exists. A good test requires large
memory, but, as mentioned, hypothesis testing may not
yield a high proportion of successes. Thus, once the test
is completed, a large number of experiments that use
the result of the test is desired. However, an m-valued
statistic can only count to m. There is a problem in
deciding how much memory to allocate to testing and
how much to allocate to using the information gathered
by testing. Fortunately, the optimal solution that we
shall present suggests an interpretation answering this
question. The surprising answer is that all of the states
of memory may be devoted to hypothesis testing, and
the information so gathered may be used to gain successes
in a manner that does not interfere with the hypothesis
testing.
II.
HISTORYOBTHEPROBLEM
The TABP was introduced by Robbins [l] in 1952.
In that paper there was no constraint on memory and
the experiments were restricted to be binary-valued
(coin tosses). Robbins argued that a scheme that sampled
the inferior coin infinitely often, but with density of
sampling tending to zero, yielded r = 1. Here, at a par-
ticular time, the inferior coin is defined to be the coin
yielding the lower cumulative proportion of heads. Sub-
sequently, Bradt, Johnson, and Karlin [2] and Bradt
and Karlin [3] examined generalizations of the TABP in
which it was desired to maximize the number of successes
in a finite number of trials. This problem remains open
in the case where the coin biases (pl, pz) have an arbitrary
known joint distribution. However, Feldman [4] has
solved the generalized version of the TABP corresponding
to (1) (with known a priori probabilities) in the infinite-
memory case.
en+, = eGK, JLl, n>
(8)
achieves an asymptotic proportion of successes T = 1. The
algorithm is independent of the biases p1 and pz on the
two coins, and thus is ,optimal (achieves T = 1) for the
more general problem of maximizing the asymptotic pro-
portion of heads with two coins having arbitrary unknown
biases. This work also implies that, with the definition of
memory given in Section I, a memory of m = 4 states is
sufficient [lo] for a time-varying algorithm to achieve
T = 1.
A series of publications following the work of Tsetlin
has appeared in the Russian literature [ll]-[21] on the
behavior of automata in random media in an attempt to
model adaptive or learning systems. In many cases the
algorithms considered are similar to the TABP with finite
memory of the type defined in Section I. A series of ad
hoc
expedient automata (i.e., automata that perform better
than simply alternating coins at each trial) is examined,
but no optimal automata are found. Subsequent work by
Fu and Li [22], [23] and Chandrasekaran and Shen [24]-[27]
has enlarged the set of algorithms for which the asymptotic
behavior has been found. The fundamental problem
implicit in [ll]-[27] is presented in Section I and solved
in this paper. It should, be mentioned that the motivation
of the previous papers is different from ours in the respect
that previous work centered on modeling learning proc-
esses by finite-state automata. For this reason, the
number of states m was frequently allowed to tend to
infinity in the analysis, and the emphasis on optimal
m-state automata was lost.
Note this one word of caution. Memory size has been
defined to be the number of states of the automaton. This
seems to us to be natural. However, we have not included
any measure of the complexity of the computation of the

COVER AND HELLMAN : TWO-ARMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY
state transition function
f
and the choice of experiment
function e. Fortunately, the optimal function f is rather
X,=(e,,Yn)
7=Wi-,d
-jy---++,
simple to implement, as
can
be seen from the example at
Fig. 1.
Decision process viewed as a finite state automaton.
the end of Section IV. Moreover, if an auxiliary stream of
random variables is available, the calculation of
f
and e
The state transition matrices conditioned on
Ho
and H,
may be performed by hard-wired circuitry without are given by
memorv elements.
III.
FINITE-STATE MACHINES
P!i = E (Pij(X) [ Ho}
z
(16)
The two-armed-bandit problem that will be solved in
and
this section has the form
experiment
A
experiment
B
pij = f {pii 1 HI) *
(17)
H, : Y - (Pi
Y - CPl
As will be shown in the proof of Theorem 1, these expecta-
(9) tions may be explicitly expressed as follows:
H, : Y - @I
Y - 80
where 6,, and (?I are arbitrary known probability measures.
Pti = 1 bii(A,
Ybifo(Y)
Thus Y is not restricted to be a binary-valued random
variable as in previous work [l], [5]. In Section VI, the
+
pi,@, Y)U
-
w)fl(~N
dv
(Y>
(18)
solution will be generalized to the form
1
Pii
=
s
CP&,
y)aifl(d
A B
H, : Y - p. Y - 61
f Pi,@, ~10 - dfo(~I) dv
(Y)
(19)
00)
H, : Y-u,
Y - cP3.
where
f.
and
fl
are the Radon-Nikodym derivatives
(densities) of p0 and @I with respect to some dominating
Attention will be restricted to the algorithm measure v. Define the m X m matrices
P
=
[p:,.]
and
Tn = f(Tn-1, XJ
T, E
11,%
a-- , ml
(11)
P
= [pii] and let go and p1 be the stationary probability
distributions on the state space 3 under
Ho
and
H,.
The
es = e(Tn-d e, E IA, Bj
(12)
stationary probability distributions are solutions of the
matrix equations
(13)
where
T
is the state of memory, e the choice of experiment,
bnd Y the resulting observation. A reformulation of this
algorithm in the terminology of finite-state machines will
ae convenient. X and Y will denote random variables,
and x and y their outcomes.
Consider a finite-state stochastic sequential machine
with state space 3 = (1, 2, * 9 * , m), input space
X
=
(A, Bj
X y and output space
{A,
B1. Let the state
transition behavior of this machine be specified by a
family of
m X
m stochastic matrices [pii(x defined for
x= (e,y>,eE iA,BJ,yEy,andi,iE
iL%--*,ml.
Then
p& Y> = Pr 1 T, = i I T,-l = i, -G = (e,
Y> I
(14)
is the conditional probability of transition from memory
state i to j under the observation of experiment e with
outcome y.
Let the output function be described by the sequence
LYE, 0 5 ai 5 1, i = 1, 2, . * . , m, with the interpretation
that
ai = Pr (e,,, =
A j T, = i].
(15)
Thus, the next experiment chosen is a random variable
depending solely on the past experience as summarized in
the current state of memory
T,,.
The automaton is depicted
in Fig. 1.
(20)
p1 = VIP.
(21)
Note that if
P
is irreducible, & is the asymptotic propor-
tion of time spent in state i, conditioned on Hk. Parallel
work on hypothesis testing with finite memory [28]
establishes that irreducible automata are at least one
state better than reducible automata. The same argument
applies to the current formulation of the TABP so that,
here too, attention will be restricted to irreducible auto-
mata.
Letting r. and r1 be the asymptotic proportion of
successes under
Ho
and
H,,
it is seen that
(22)
(23)
i=l
where the LY( are defined by (15).
If a Bayesian approach is taken and a priori probabil-
ities ?rO and ?rl (p. + 7rl = 1) are assigned to
Ho
and
HI
then
r = ?roro + n-l?l.
(24)
Although the Bayesian approach will be taken, the results
will apply to the Neyman-Pearson formulation as well.
In the Neyman-Pearson formulation the problem is to

188
IEEE TRANSACTIONS ON INFORMATION THEORY, MARCH 1970
maximize r1 subject to the constraint r. 2 IX, for a given
level 0.
Returning to the Bayesian formulation, the goal is to
maximize
r
over all P,~(x), and cr<; i, j = 1, 2, * * . , m.
Designate this maximum value of r by r*.
(334
WW
In order to place an upper bound on r it is necessary to
relate the parameters of the automaton to the statistics
of the problem. The following definitions and theorems
will prove useful.
7 = wax {$, #} = 3;
(33c)
Since
la(T)
= 7 = 2;
Z,(T)
=
i
= $, the maximum and
minimum likelihood ratio events are given by tails on
coin
A
and tails on coin
B,
respectively.
Dejkitions
Theorem
1
Let the measure v = p. + pl. Thus (Pi and @I are both
absolutely continuous with respect to Y. Define f,(y) and
fl(y) to be the Radon-Nikodym derivatives of PO and &,I
with respect to v (f. and fl are the probability density
functions of 6, and 61 with respect to v). Let
For alli, j E (1, 2, ... , m],
l/i < ppi/p:j < 5.
Proof:
From (16) it is seen that
ppj = Pr
(T,, = j 1 T,,-, = i, Ho}.
(34)
(35)
L(Y) = fO(Y)lfl(Y)
(25)
MY) = f,(Yllfo(Y) = WA(Y).
It is seen that IA and 1, are the likelihood ratios for an
experimental outcome y that results from experiments
A
and
B,
respectively.
Further define (for C C y)
iA
= inf -
c 1
@o(C)
v(C)>0 @l(C)
(26)
II 1
@l(C)
& = inf ___ .
v(C)>0 @o(C)
(27)
Thus ?A is the almost everywhere (a.e.) maximum likeli-
hood ratio (1.r.) for experiment
A;
and
iA
is the a.e.
minimum 1.r. for experiment
A.
Similarly, TB and
iB
are
the a.e. maximum and minimum l.r.s for experiment
B.
Clearly, from the definitions, 7, =
l/G
and
iA =
l/t,.
Thus defining
7 = max IA, lip]
i
= min
{iA, LB)
it is seen that
(28)
and
Dejinition
7 = max {TAa,
i/C)
i =
min
(iA,
l/TA]
T =
i/i.
(29)
(30)
(31)
The likelihood ratio Z(X) of an observation z = (e, y),
e
E {A, B}
, is defined to be
J(x) =
L(Y)
-
(32)
Obviously,
i
< Z(X) 5 f.
For example, if two unlabeled coins of biases p, = 0.7
and p, = 0.8 are given, the possible events C are heads
and tails, and
Equivalently
ppi = Pr
{T, = j 1 T,,-l = i, Ho, e, = A]
.Pr (e, =
A 1 T,-, = i, Ho}
+ Pr
{T, = j I T,,-, = i, Ho, e,, = B)
.Pr (e, =
B I T,,-l = i, Ho).
(36)
But, since the choice of e, is a (randomized) function of
T,,-,
alone,
Pr (en =
A I T,,-l = i, Ho}
= Pr {en =
A I T,,-, = i)
Similarly
= Cfi*
(37)
Pr (e,, =
B 1 T,,-, = i, Ho) =
1 - LY+
(38)
From (14),
Pr
IT,, = j I T,,-, = i, Ho, e, = A)
=
s
pi,@, Y)~o(Y) dv (Y), (39)
since under
Ho
the experimental outcome Y has
f.
as its
density function when
A
is performed. Similarly,
Pr
{T, = j I Tnml = i, Ho, em = B}
=
s
Pii@,
YIfl(YI dV(Y)*
Then (36) becomes
POi = ai
J
Pij(A, Y)fo(Y) WY)
By definition
(40)
(41)
(42)
(43)

COVER AND HELLMAN : TWO-.4RMED-BANDIT PROBLEM WITH TIME-INVARIANT FINITE MEMORY
so that
0
pij = ffi
J
Pi,@, YL(Y)fl(Y) WY)
+ (1 - 4 1
Pi@, YMY)fo(Y) MY).
(44)
Furthermore, Z,(y) < j* and Z,(y) < ?, a.e. Y, and by (29)
7 = max {lA, EB). Hence
p:i I 7 ai
[S
Pi,(A, Y)fl(Y) dV(Y)
+ 0 - 4 j-
Pi,@, Y)fo(Y) MY) *
1
(45)
Proceeding similarly it is found that
Pii = ai
s
Pi,@, Y)flM f-WY)
+ (1 - 4 /-
Pi,@, Y)fo(Y) WY).
(46)
Combining (45) and (46) yields
pyi/p:i 5 7
(47)
thus proving half of the theorem. The other half follows in
an analogous manner.
DeJinition
The state likelihood ratio vector a = (X1, * * * , X,) is
defined by
xi = pq/pt i = 1,2, -. . , m.
(48)
Theorem d
Before proceeding with the proof of the theorem, an
example will be given. Consider the coin-flipping TABP
A
B
7ro = 4
Ho : p, = 0.9
p, = 0.8
n-1 = 3 H, : pl = 0.8 p, = 0.9
where p. and p, are the probabilities of the event heads
(H) under the appropriate conditions. Thus, for example,
if coin
A
is flipped and Ho is true, then Pr( heads] =
p, = 0.9. Calculation shows that
7 = max jA, ?B] = max (g, 8, 3, +) = 2.
(52)
Thus, for an m-state memory the best possible limiting
proportion of uses of the best coin (in this case, coin
A)
is given by
r* =
2-
2*-
+ 1.
(53)
In the next section an automaton will be exhibited that
achieves r* arbitrarily closely.
For an irreducible automaton in which the Xis are
arranged in nondecreasing order the following relation
holds:
Example
If p. = 0.5, p1 = 0.501, the situation is quite different.
Here 7 E 1.002 and
Remark
Xi,l/Xd 5 (7).
(49)
Since it has been noted that irreducible automata can
do at least as well as reducible ones, the irreducibility
restriction is of no consequence.
Proof:
The proof of this theorem follows from Theorem
1 using arguments contained in Lemma 2 of [28]. The
reader is referred there for details.
r* Z (l.002)m-/((1.002)m-1 + 1).
(54)
Thus, even m = 500 states yields only a proportion of
successes r*
NN e/(e + 1). No 500-state machine can do
better.
Proof: By Theorem 2, Xz 5 X1(?), X3 5 X1(?), * 1 . ,
x, 5 x1(~)2-1).
Thus, for all i E 3 = (1, 2, * . . , m)
Theorem S
Hence
For an m-state automaton r is bounded above by
T"
where
But
r* = max
{
(ym-I)
= 2(rn-1)
-
2(~0~,(0
(~)zcm-l) _ 1 )
l/2
7 TO$Tl
)
.
(50)
In the special case r0 = r1 = 4,
r* =
p-l)/(tbn-l) + 1).
(51)
Remark
1
If r* = no (or rl), a degenerate situation exists in which
the machine that always chooses experiment
A
(or
B)
is
optimal. In this case memory is not large enough to gather
sufficient information to offset the a priori bias [28].
Remark 2
The larger the value of :,-the larger the resultant propor-
tion of successes r*. Thus, 1 is a measure of the separation
between
Ho
and
HI.
Example
so
(55)
(56)
(57)
r. 5 X1(l)““-(l -
7-J.
(5%

Citations
More filters
Journal ArticleDOI
01 Jul 1974
TL;DR: Attention has been focused on the norms of behavior of learning automata, issues in the design of updating schemes, convergence of the action probabilities, and interaction of several automata.
Abstract: Stochastic automata operating in an unknown random environment have been proposed earlier as models of learning. These automata update their action probabilities in accordance with the inputs received from the environment and can improve their own performance during operation. In this context they are referred to as learning automata. A survey of the available results in the area of learning automata has been attempted in this paper. Attention has been focused on the norms of behavior of learning automata, issues in the design of updating schemes, convergence of the action probabilities, and interaction of several automata. Utilization of learning automata in parameter optimization and hypothesis testing is discussed, and potential areas of application are suggested.

688 citations

Journal ArticleDOI
TL;DR: If the key can take K values, then an optimal strategy for B secures him a probability of an undetected substitution ≪ K−1, and several encoding functions Φ(.,.) are given, some of which achieve this bound.
Abstract: We consider a new kind of coding problem, which has applications in a variety of situations. A message x is to be encoded using a key m to form an encrypted message y = Φ(x, m), which is then supplied to a user G. G knows m and so can calculate x. It is desired to choose Φ(.,.) so as to protect G against B, who knows x, y, and Φ(.,.) (but not m); B may substitute a false message y' for y. It is shown that if the key can take K values, then an optimal strategy for B secures him a probability of an undetected substitution ≪ K−1. Several encoding functions Φ(.,.) are given, some of which achieve this bound.

396 citations

Journal ArticleDOI
TL;DR: The proposed RNN-FLCS preserves the advantages of the original NN-FLC, such as the ability to find proper network structure and parameters simultaneously and dynamically and to avoid the rule-matching time of the inference engine.
Abstract: This paper proposes a reinforcement neural-network-based fuzzy logic control system (RNN-FLCS) for solving various reinforcement learning problems. The proposed RNN-FLCS is constructed by integrating two neural-network-based fuzzy logic controllers (NN-FLC's), each of which is a connectionist model with a feedforward multilayered network developed for the realization of a fuzzy logic controller. One NN-FLC performs as a fuzzy predictor, and the other as a fuzzy controller. Using the temporal difference prediction method, the fuzzy predictor can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the fuzzy controller. The fuzzy controller performs a stochastic exploratory algorithm to adapt itself according to the internal reinforcement signal. During the learning process, both structure learning and parameter learning are performed simultaneously in the two NN-FLC's using the fuzzy similarity measure. The proposed RNN-FLCS can construct a fuzzy logic control and decision-making system automatically and dynamically through a reward/penalty signal or through very simple fuzzy information feedback such as "high," "too high," "low," and "too low." The proposed RNN-FLCS is best applied to the learning environment, where obtaining exact training data is expensive. It also preserves the advantages of the original NN-FLC, such as the ability to find proper network structure and parameters simultaneously and dynamically and to avoid the rule-matching time of the inference engine. Computer simulations were conducted to illustrate its performance and applicability. >

331 citations

Journal ArticleDOI
01 May 1985
TL;DR: A class of learning tasks is described that combines aspects of learning automation tasks and supervised learning pattern-classification tasks, called associative reinforcement learning tasks, and an algorithm is presented, called the associative reward-penalty, or AR-P algorithm, for which a form of optimal performance is proved.
Abstract: A class of learning tasks is described that combines aspects of learning automation tasks and supervised learning pattern-classification tasks. These tasks are called associative reinforcement learning tasks. An algorithm is presented, called the associative reward-penalty, or AR-P algorithm for which a form of optimal performance is proved. This algorithm simultaneously generalizes a class of stochastic learning automata and a class of supervised learning pattern-classification methods related to the Robbins-Monro stochastic approximation procedure. The relevance of this hybrid algorithm is discussed with respect to the collective behaviour of learning automata and the behaviour of networks of pattern-classifying adaptive elements. Simulation results are presented that illustrate the associative reinforcement learning task and the performance of the AR-P algorithm as compared with that of several existing algorithms.

319 citations

Journal ArticleDOI
TL;DR: It is shown that, under certain conditions, the adaptive controller's actions eventually become optimal for the particular control task with which it is faced, in the sense that they maximize the expected reward obtained in the future.
Abstract: This paper describes an adaptive controller for discrete-time stochastic environments. The controller receives the environment's current state and a reward signal which indicates the desirability of that state. In response, it selects an appropriate control action and notes its effect. The cycle repeats indefinitely. The control environments to be tackled include the well-known n -armed bandit problem, and the adaptive controller comprises an ensemble of n -armed bandit controllers, suitably interconnected. The design of these constituent elements is not discussed. It is shown that, under certain conditions, the controller's actions eventually become optimal for the particular control task with which it is faced, in the sense that they maximize the expected reward obtained in the future.

199 citations


Cites background from "The two-armed-bandit problem with t..."

  • ...…of exclusively selecting actions k* for which gk* = Max {gk). k This is the m-armed bandit problem, an obvious generalization of the familier two-armed bandit problem which has been discussed extensively in the literature (Cover and Hellman, 1970; Shapiro and Narendra, 1969; Witten, 1973, 1974)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The authors proposed a theory of sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves, which is a major advance.
Abstract: Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working with anything but a fixed number of independent random variables. A major advance now appears to be in the making with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves.

2,034 citations

Journal ArticleDOI
TL;DR: In this paper, the design and performance of optimal finite-memory systems for the two-hypothesis testing problem with probability of error loss criterion was studied. But the problem was not studied in this paper.
Abstract: This paper develops the theory of the design and performance of optimal finite-memory systems for the two-hypothesis testing problem. Let $X_1, X_2, \cdots$ be a sequence of independent identically distributed random variables drawn according to a probability measure $\mathscr{P}$. Consider the standard two-hypothesis testing problem with probability of error loss criterion in which $\mathscr{P} = \mathscr{P}_0$ with probability $\pi_0$; and $\mathscr{P} = \mathscr{P}_1$ with probability $\pi_1$. Let the data be summarized after each new observation by an $m$-valued statistic $T\in\{ 1, 2, \cdots, m\}$ which is updated according to the rule $T_n = f(T_{n-1}, X_n),$ where $f$ is a (perhaps randomized) time-invariant function. Let $d:\{ 1, 2,\cdots, m\} \rightarrow\{ H_0, H_1\}$ be a fixed decision function taking action $d(T_n)$ at time $n$, and let $P_e(f,d)$ be the long-run probability of error of the algorithm $(f, d)$ as the number of trials $n\rightarrow\infty$. Define $P^\ast = \inf_{(f,d)}P_e(f, d)$. Let the a.e. maximum and minimum likelihood ratios be defined by $\bar{l} = \sup(\mathscr{P}_0(A)/\mathscr{P}_1(A))$ and $\underline{l} = \inf(\mathscr{P}_0(A)/\mathscr{P}_1(A))$ where the supremum and infimum are taken over all measurable sets $A$ for which $\mathscr{P}_0(A) + \mathscr{P}_1(A) > 0$. Define $\gamma = \bar{l}/\underline{l}$. It will be shown that $P^\ast = \lbrack 2(\pi_0\pi_1\gamma^{m-1})^{\frac{1}{2}} - 1\rbrack/(\gamma^{m-1} - 1)$, under the nondegeneracy condition $\gamma^{m-1} \geqq \max\{\pi_0/\pi_1, \pi_1/\pi_0\}$; and a simple family of $\varepsilon$-optimal $(f, d)$'s will be exhibited. In general, an optimal $(f, d)$ does not exist; and $\varepsilon$-optimal algorithms involve randomization in $f$.

167 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that a four-valued statistic is sufficient to solve the two-hypothesis testing problem with a limiting probability of error zero under either hypothesis.
Abstract: Let $X_1, X_2, \cdots$ be a sequence of independent identically distributed random variables drawn according to a probability measure $\mathscr{P}$. The two-hypothesis testing problem $H_0: \mathscr{P} = \mathscr{P}_0 \operatorname{vs.} H_1: \mathscr{P} = \mathscr{P}_1$ is investigated under the constraint that the data must be summarized after each observation by an $m$-valued statistic $T_n\varepsilon \{1, 2, \cdots, m\}$, where $T_n$ is updated according to the rule $T_{n+1} = f_n(T_n, X_{n+1})$. An algorithm with a four-valued statistic is described which achieves a limiting probability of error zero under either hypothesis. It is also demonstrated that a four-valued statistic is sufficient to resolve composite hypothesis testing problems which may be reduced to the form $H_0:p > p_0 \operatorname{vs.} H_1:p < p_0$ where $X_1, X_2, \cdots$ is a Bernoulli sequence with bias $p$.

135 citations

Frequently Asked Questions (1)
Q1. What have the authors contributed in "The two-armed-bandit problem w ith" ?

Absfracf-This paper solves the classical two-armed-bandit problem under the finite-memory constraint descr ibed below.