scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Automata - A Survey

Reads0
Chats0
TLDR
Attention has been focused on the norms of behavior of learning automata, issues in the design of updating schemes, convergence of the action probabilities, and interaction of several automata.
Abstract
Stochastic automata operating in an unknown random environment have been proposed earlier as models of learning. These automata update their action probabilities in accordance with the inputs received from the environment and can improve their own performance during operation. In this context they are referred to as learning automata. A survey of the available results in the area of learning automata has been attempted in this paper. Attention has been focused on the norms of behavior of learning automata, issues in the design of updating schemes, convergence of the action probabilities, and interaction of several automata. Utilization of learning automata in parameter optimization and hypothesis testing is discussed, and potential areas of application are suggested.

read more

Content maybe subject to copyright    Report

IEEE
TRANSACTIONS
ON
SYSTEMS,
MAN,
AND
CYBERNETICS,
VOL.
SMC-4,
NO.
4,
JULY
1974
323
Learning
Automata
A
Survey
KUMPATI
S.
NARENDRA,
SENIOR
MEMBER,
IEEE,
AND
M.
A.
L.
THATHACHAR
Abstract-Stochastic
automata
operating
in
an
unknown
random
can
be
considered
to
show
learning
behavior.
Tsypkin
environment
have
been
proposed
earlier
as
models
of
learning.
These
[GT1]
has
recently
argued
that
seemingly
diverse
problems
automata
update
their
action
probabilities
in
accordance
with
the
inputs
.
.
.
received
from
the
environment
and
can
improve
their
own
performance
in
pa
t
rec
i
o
idenfatio
n
lering.
during
operation.
In
this
context
they
are
referred
to
as
learning
auto-
can
be
treated
ii
a
unified
manner
as
problems
in
learning
mata.
A
survey
of
the
available
results
in
the
area
of
learning
automata
using
probabilistic
iterative
methods.
has
been
attempted
in this
paper.
Attention
has
been
focused
on
the
Viewed
in
a
purely
mathematical
context
the
goal
of
a
norms
of
behavior
of
learning
automata,
issues
in
the
design
of
updating
learning
system
is
the
optimization
of
a
functional
not
schemes,
convergence
of
the action
probabilities,
and
interaction
of
. .
several
automata.
Utilization
of
learning
automata
in
parameter
known
expicily,as
functoeaml
with
athemati
daltexpeta-n
optimization
and
hypothesis
testing
is
discussed,
and
potential
areas
o
tion
of
a
random
functional
with
a
probability
distribution
application
are
suggested.
function
not
known
in
advance.
An
approach
that
has
been
used
in
the
past
is
to
reduce
the
problem
to
the
determina-
I.
INTRODUCTION
tion
of
an
optimal
set
of
parameters
and
then
apply
stochastic
hillclimbing
techniques
[GT1].
An
alternative
rN
CLASSICAL
deterministic
control
theory,
the
control
of
a
process
is
always
preceded
by
complete
knowledge
approach
gaining
attention
recently
is
to
regard
the
problem
as
one
of
finding
an
optimal
action
out
of
a
set
of
allowable
ofrthe
of
the
proce;a
r
the
m
athmtia
actions
and
to
achieve
this
using
stochastic
automata
[LN2].
description
of
the
process
is
assumedito
bknon,
an
the
The
following
example
of
the
learning
process
of
a
student
inputs
to
the
process
are
deterministic
functions
of
time.
wit
a-rbblsi
ece
lutae
h
uoao
Later
developments
in
stochastic
control
theory
took
into
approach.
account
uncertainties
that
might
be
present
in
the
process;
Conti
s
stochastic
control
was
effected
by
assuming
that
the
thestden
andanite
st
aer
nAtivn
isp
probabilistic
characteristics
of
the
uncertainties
are
known.
p
he
student
an
set
on
alternative
s,
Frequently,
the
uncertainties
are
of
a
higher
order,
and
even
foling
w
he
teacher
respond
in
a
ary
anner
the
probabilistic
characteristics
such
as
the
distribution
indicaing
wheth
the
selcter
is
ight
or
wrong.
functions
may
not
be
completely
known.
It
is
then
necessary
teac
thri
howeve,poabster
is
a
nonzr
to
make
observations
on
the
process
as
it
is
in
operation
and
probabilit
ficither
esponse
zfra
gain
further
knowledge
of
the
process.
In
other
words,
a
of
the
answers
selected
by
the
student.
The
saving
feature
of
distinctive
feature
of
such
problems
is
that
there
is
little
the
an
sethat
tik
that
The
tachr'
neative
a
priori
information,
and
additional
information
is
to
be
acquired
on
line.
One
viewpoint
is
to
regard
these
as
responses
have
the
least
probability
for
the
correct
answer.
acquired
on
learnine.
Oeipnitrgdhsa
Under
these
circumstances
the
interest
is
in
finding
the
aro
mings
learing.
.
manner
in
which
the
student
should
plan
a
choice
of
a
Learning
is
defined
as
any
relatively
permanent
changin
sequence
of
alternatives
and
process
the
information
behavior
resulting
from
past
experience,
and
a
learning
obtained
from
the
teacher
so
that
he
learns
the
correct
system
is
characterized
by
its
ability
to
improve
its
behavior
answer.
with
time,
in
some
sense
tending
towards
an
ultimate
goal.
In
stochastic
automata
models
the
stochastic
automaton
In
mathematical
psychology,
models
of
learning
systems
corresponds
to
the
student,
and
the
random
environment
in
[GBI],
[GLI]
have
been
developed
to
explain
behavior
which
it
operates
represents
the
probabilistic
teacher.
The
patterns
among
living
organisms.
These
models
in
turn
have
actions
(or
states)
of
the
stochastic
automaton
are
the
lately
been
adapted
to
synthesize
engineering
systems,
which
various
alternative
answers
that
are
provided.
The
responses
of
the
environment
for
a
particular
action
of
the
stochastic
Manuscript
received
January
15,
1974;
revised
February
13,
1974.
automaton
are
the
teacher's
probabilistic
responses.
The
This
work
was
supported
by
the
National
Science
Foundation
under
problem
is
to
obtain
the
optimal
action
that
corresponds
to
Grant
GK-20580.
K.
S.
Narendra
is
with
the
Becton
Center,
Yale
University,
New
thcortanw.
Haven,
Conn.
The
stochastic
automaton
attempts
a
solution
of
this
M.
A.
L.
Thathachar
is
with
the
Becton
Center,
Yale
University,
problem
as
follows.
To
start
with,
no
information
as
to
New
Haven,
Conn.,
on
leave
from
the
Indian
Institute
of
Science,
hc
n
steotmlato
sasmd
n
qa
Bangalore,
India.'
hconisteotmlato
isas
edadeql

324
IEEE
TRANSACTIONS
ON
SYSTEMS,
MAN,
AND
CYBERNETICS,
JULY
1974
probabilities
are
attached
to
all
the
actions.
One
action
is
Union
and
elsewhere
has
followed
the
trend
set
by
his
selected
at
random,
the
response
of
the
environment
to
this
source
paper.
No
attempt,
however,
has
been
made
in
this
action
is
observed,
and
based
on
this
response
the
action
paper
to
review
all
these
studies.
probabilities
are
changed.
Now
a
new
action
is
selected
Varshavskii
and
Vorontsova
[LVY]
observed
that
the
use
according
to
the
updated
action
probabilities,
and
the
of
stochastic
automata
with
updating
of
action
probabilities
procedure
is
repeated.
A
stochastic
automaton
acting
in
could reduce
the
number
of
states
in
comparison
with
this
manner
to
improve
its
performance
is
referred
to
as
a
deterministic
automata.
This
idea
has
proved
to
be
very
learning
automaton
in
this
paper.
fruitful
and
has
been
exploited
in
a
series
of
investigations,
Stochastic
hillclimbing
methods
(such
as
stochastic
the
results
of
which
form
the
subject
of
this
paper.
approximation)
and
stochastic
automata
methods
represent
Fu
and
his
associates
[LFI]-[LF6]
were
among
the
first
two
distinct
approaches
to
the
learning
problem.
Though
to
introduce
stochastic
automata
into
the
control
literature.
both
approaches
involve
iterative
procedures,
updating
at
A
variety
of
applications
to
parameter
optimization,
every
stage
is
done
in
the
parameter
space
in
the
first
method
pattern
recognition,
and
game
theory
were
considered
by
and
probability
space
in
the
second.
It
is,
of
course,
possible
this
school.
McLaren
[LM1]
explored
the
properties
of
that
they
lead
to
equivalent
descriptions
in
some
examples.
linear
updating
schemes
and
suggested
the
concept
of
a
The
automata
methods
have
two
distinct
advantages
over
"growing"
automaton
[LM2].
Chandrasekaran
and
Shen
stochastic
hillclimbing
methods
in
that
the
action
space
[LC1]-[LC3]
made
useful
studies
of
nonlinear
updating
need
not
be
a
metric
space
(i.e.,
no
concept
of
neighborhood
schemes,
nonstationary
environments,
and
games
of
is
needed),
and
since
at
every
stage
any
element
of
the
automata.
Tsypkin
and
Poznyak
[LTI]
attempted
to
unify
action
set
can
be
chosen,
global
rather
than
local
optimum
the
updating
schemes
by
focusing
attention
on
an
inverse
can
be
obtained.
optimization
problem.
The
present
authors
and
their
Experimental
simulation
of
automata
methods
carried
associates
[LS1],
[LS2],
[LV3]-[LV10],
[LN1],
[LN2],
out
during
the
last
few
years
has
indicated
the
feasibility
of
[LL1]-[LL5]
have
studied
the
theory
and
applications
of
the
automaton
approach
in
the
solution
of
interesting
learning
automata
and
also
carried
out
simulation
studies
examples
in
parameter
optimization,
hypothesis
testing,
and
in
the
area.
game
theory.
The
automaton
approach
also
appears
The
survey
papers
on
learning
control
systems
by
appropriate
in
the
study
of
hierarchical
systems
and
in
Sklansky
[GS1]
and
Fu
[GF1]
have
devoted
part
of
their
tackling
certain
nonstationary
optimization
problems.
attention
to
learning
automata.
The
topic
also
finds
a
place
Furthermore,
several
other
avenues
to
learning
can
be
in
some
books
and
collections
of
articles
on
learning
interpreted
as
iterative
procedures
in
the
probability
space,
systems
[GM2],
[GF2],
[LF6].
The
literature
on
the
and
the
learning
automaton
provides
a
natural
mathematical
two-armed
bandit
problem
is
relevant
in
the
present
context
model
for
such
situations
and
serves
as
a
unifying
theme
but
is
not
referred
to
in
detail
as the
approach
taken
is
among
diverse
techniques
[GM3].
rather
different
[LC5],
[LW2].
References
to
other
Previous
studies
on
learning
automata
have
led
to
a
contributions
will
be
made
at
appropriate
points
in
the
body
certain
understanding
of
the
basic
issues
involved
and
have
of
the
paper.
provided
guidelines
for
the
design
of
algorithms.
An
appreciation
of
the
fundamental
problems
in
the
field
has
Organization
also
taken
place.
It
appears
that
research
in
this
area
has
This
paper
has
been
divided
into
nine
sections.
Following
reached
a
stage
where
the
power
and
applicability
of
the
the
introduction,
the
basic
concepts
and
definitions
of
approach
needs
to
be
made
widely
known
in
order
that
it
stochastic
automata
and
random
environments
are
given
in
can
be
fully
exploited
in
solving
problems
in
relevant
areas.
Section
II.
The
possible
ways
in
which
the
behavior
of
In
this
paper
we
review
recent
results
in
the
area
of
learning
learning
automata
can
be
judged
are
defined
in
Section
III.
automata,
reexamine
some
of
the
theoretical
questions
that
Section
IV
deals
with
reinforcement
schemes
(or
updating
arise,
and
suggest
potential
areas
where
the
available
results
algorithms)
and
their
properties
and
includes
a
discussion
may
find
application.
of
convergence.
Section
V
describes
collective
behavior
of
automata
in
terms
of
games
between
automata
and
multi-
level
structures
of
automata.
Nonstationary
environments
Historically,
the
first
learning
automata
models
were
are
briefly
considered
in
Section
VI.
Possible uses
of
developed
in
mathematical
psychology.
Early
work
in
this
learning
automata
in
optimization
and
hypothesis
testing
area
has
been
well
documented
in
the
book
by
Bush
and
form
the
subject
matter
of
Section
VII.
A
short
description
Mosteller
[GBl].
More
recent
results
can
be
found
in
of
the
fields
of
application
of
learning
automata
is
given
in
Atkinson
et
al.
[GAl].
A
rigorous
mathematical
framework
Section
VIII.
A
comprehensive
bibliography
is
provided
in
has
been
developed
for
the
study
of
learning
problems
by
the
Reference
section
and
is
divided
into
three
subsections
Iosifescu
and
Theodorescu
[GIl]
as
well
as
by
Norman
dealing
with
1)
general
references
in
the
literature
pertinent
[GNl].
to
the
topic
considered,
2)
some
important
papers
on
de-
Tsetlin
[DT1]
introduced
the
concept
of
using
determi-
terministic
automata
that
provided
the
impetus
for
stochas-
nistic
automata
operating
in
random
environments
as
tic
automata
models,
and
3)
publications
wholly
devoted
to
models
of
learning.
A
great
deal
of
work
in
the
Soviet
learning
automata.

NARENDRA
AND
THATHACHAR:
LEARNING
AUTOMATA
325
INPUT
|
STOCHASTIC
ACTION(OUTPUT)
Learning
Automaton
(Stochastic
Automaton
in
a
Random
}{O,I)
AUTOMATON
a
4E
a,,a2,-ar
Environment)
Fig.
1.
Stochastic
automaton.
Fig.
3
represents
a
feedback
connection
of
a
stochastic
automaton
and
an
environment.
The
actions
of
the
autom-
PENALTY
PROBABILITY
SET
aton
in
this
case
form
the
inputs
to
the
environment.
The
{Cp
C2,
*.
CrO
responses
of
the
environment
in
turn
are
the
inputs
to
the
automaton
and
influence
the
updating
of
the
action
INPUT
ENVI,...arONMEN
OUTPU
RESPONE
probabilities.
As
these
responses
are
random,
the
action
Fig.
2.
Environment.
probability
vector
p(n)
is
also
random.
In
psychological
learning
experiments
the
organism
under
PENALTY
PROBABILITY
SET
study
is
said
to
learn
when
it
improves
the
probability
of
{Cp
C,2
Cr)
correct
response
as
a
result
of
interaction
with
its
environ-
ment.
Since
the
stochastic
automaton
being
considered
in
this
paper
behaves
in
a
similar
fashion,
it
appears
proper
to
refer
to
it
as
a
learning
automaton.
Thus
a
learning
automaton
is
a
stochastic
automaton
that
operates
in
a
{p,A)
random
environment
and
updates
its
action
probabilities
in
ACTION
STOCHASTIC
INPUT
accordance
with
the
inputs
received
from
the
environment
a
E
{a,,..
ar)
AUTOMATON
xE
{O,l}
so
as
to
improve
its
performance
in
some
specified
sense.
Fig.
3.
Learning
automaton.
In
the
context
of
psychology,
a
learning
automaton
may
be
regarded
as
a
model
of
the
learning
behavior
of
the
II.
STOCHAsTic
AUTOMATA
AND
RANDom
ENVIRONMENTS
organism
under
study
and
the
environment
as
controlled
by
the
experimenter.
In
an
engineering
application
such
as
the
Stochastic
Automaton
control
of
a
process,
the
controller
corresponds
to
the
A
stochastic
automaton
is
a
sextuple
{x,0,oc,p,A,G}
where
learning
automaton,
while
the
rest
of
the
system
with
all
x
is
the
input
set,
4
=
{01,02,.-
*
,l}
is
the
set
of
internal
uncertainties
constitutes
the
environment.
states,
os
=
{01,0C2,
*Cr
}
with
r
<
s
is
the
output
or
It
is
useful
to
note
the
distinction
between
several
models
action
set,
p
is
the
state
probability
vector
governing
the
based
on
the
nature
of
the
input
to
the
learning
automaton.
choice
of
the
state
at
each
stage
(i.e.,
at
each
stage
n,
If
the
input
set
is
binary,
e.g.,
{0,
1},
the
model
is
known
as
p(n)
=
(pl(n),p2(n),
-
*,p.(n))),
A
is
an
algorithm
(also
a
P-model.
On
the
other
hand
it
is
called
a
Q-model
if
the
called
an
updating
scheme
or
reinforcement
scheme)
which
input
set
is
a
finite
collection
of
distinct
symbols
as,
for
generates
p(n
+
1)
from
p(n),
and
G:
0
-a
o
is
the
output
example,
obtained
by
quantization
and
an
S-model
if
the
function.
G
could
be
a
stochastic
function,
but
there
is
no
input
set
is
an
interval
[0,1].
Each
of
these
models
appears
loss
of
generality
in
assuming
it
to
be
deterministic
[GP1].
appropriate
in
certain
situations.
In
this
paper
G
is
taken
to
be
deterministic
and
one-to-one
A
remark
on
the
terminology
is
relevant
here.
Following
(i.e.,
r
=
s,
states
and
actions
are
regarded
synonymous)
Tsetlin
[DT1],
deterministic
automata
operating
in
random
and
s
<
xo.
Fig.
1
shows
a
stochastic
automaton
with
its
environments
have
been
proposed
as
models
of
learning
inputs
and
actions.
behavior.
Thus
they
are
also
contenders
to
the
term
It
may
be
noted
that
the
states
of
a
stochastic
automaton
"learning
automata."
However,
in
the
view
of
the
present
correspond
to
the
states
of
a
discrete-state
discrete-
authors
the
stochastic
automaton
with
updating
of
action
parameter
Markov
process.
Occasionally,
it
may
be
probabilities
is
a
general
model
from
which
the
deterministic
convenient
to
regard
the
pi(n)
themselves
as
states
of
a
automaton
can
be
obtained
as
a
special
case
having
a
continuous-state
Markov
process.
0-1-state
transition
matrix,
and
it
appears
reasonable
to
Environment
apply
the
term
learning
automaton
to
the
more
general
model.
In
cases
where
it
is
felt
necessary
to
emphasize
the
Only
an
environment
(also
called
a
medium)
with
random
learning
properties
of
a
deterministic
automaton
one
can
response
characteristics
is
of
interest
in
the
problems
use
a
qualifying
term
such
as
"deterministic
learning
considered.
The
environment
(shown
in
Fig.
2)
has
inputs
automaton."
It
may
also
be
noted
that
learning
automata
c(n)
=
{f,xl.
'r}
and
outputs
(responses)
belonging
to
a
of
this
paper
have
been
referred
to
as
"variable-structure
set
x.
Frequently
the
responses
are
binary
{0,1
}
with
zero
stochastic
automata,"
in
earlier
literature
[LVI].
being
called
the
nonpenalty
response
and
one
as
the
penalty
response.
The
probability
of
emitting
a
particular
output
III.
NORMS
OF
BEHAVIOR
OF
LEARNING
AUTOMATA
symbol
(say,
1)
depends
on
the
input
and
is
denoted
by
The
basic
operation
carried
out
by
a
learning
automaton
ci(i
=
1,..
*
,r).
The
ci
are
called
the
penalty
probabilities,
is
the
updating
of
the
action
probabilities
on
the
basis
of
the
If
the
ci
do
not
depend
on
n,
the
environment
is
said
to
be
responses
of
the
environment.
A
natural
question
here
is
to
stationary.
Otherwise
it
is
nonstationary.
It
is
assumed
that
examine
whether
the
updating
is
done
in
such
a
manner
as
the
ci
are
unknown
initially;
the
problem
would
be
trivial
to
result
in
a
performance
compatible
with
intuitive
notions
if
they
are
known
a
priori.
of
learning.

326
IEEE
TRANSACTIONS
ON
SYSTEMS,
MAN,
AND
CYBERNETICS,
JULY
1974
One
quantity
useful
in
judging
the
behavior
of
a
learning
In
practice,
the
penalty
probabilities
are
often
completely
automaton
is
the
average
penalty
received
by
the
automaton.
unknown,
and
it
would
be
necessary
to
have
desirable
At
a
certain
stage
n,
if
the
action
(i
is
selected
with
prob-
performance
whatever
be
the
values
of
ci,
that
is,
in
all
ability
pi(n)
the
average
penalty
conditioned
on
p(n)
is
stationary
random
media.
The
performance
would
also
be
superior
if
the
decrease
of
E[M(n)]
is
monotonic.
Both
M(n)
=
E
{x(n)
p(n)}
these
requirements
are
considered
in
the
following
definition
r
[LL3].
=
pi(n)ci.
(1)
Definition
4:
A
learning
automaton
is
said
to
be
absolutely
i-
1
expedient
if
If
no
a
priori
information
is
available,
and
the
actions
are
E[M(n
+
1)
p(n)]
<
M(n)
(6)
chosen
with
equal
probability
(i.e.,
at
random),
the
value
of
the
average
penalty
is
denoted
by
Mo
and
is
given
by
for
all
n,
all
pk(n)
E
(0,
1)(k
=
1,.*.*,r),
and
all
possible
values2
of
ci(i=
1,
,r).
Absolute
expediency
implies
M
=
Cl
+
C2
+
+
Cr
(2)
that
M(n)
is
a
supermartingale
and
that
E[M(n)]
is
strictly
r
monotonically
decreasing
with
n
in
all
stationary
random
environments.
If
M(n)
<
Mo
initially,
absolute
expediency
The
average
penaltyaisimadetlessothan
Me
atilea
implies
expediency.
It
is
thus
a
stronger
requirement
on
the
if
the
average
penalty
iS
made
less
than
MO,
at
least
asymptotically.
Such
a
behavior
is
called
expediency
and
is
learning
automaton.
Furthermore,
it
can
be
shown
that
defined
as
follows
[DTI],
[LCI].
absolute
expediency
implies
E-optimality
in
all
stationary
Definition
1:
A
learning
automaton
is
called
expedient'
if
random
environments
[LL4].
It
is
not
at
present
known
whether
the
reverse
implication
is
true.
lowever,
every
lim
E[M(n)]
<
Mo.
(3)
learning
automaton
presently
known
to
be
e-optimal
in
all
n-
cc
stationary
media
is
also
absolutely
expedient.
Hence
When
a
learning
automaton
is
expedient
it
only
does
better
s-optimality
and
absolute
expediency
will
be
treated
as
than
one
which
chooses
actions
in
a
purely
random
manner.
synonymous
in
the
sequel.
It
would
be
desirable
if
the
average
penalty
could
be
The
definitions
in
this
section
have
been
given
with
minimized
by
a
proper
selection
of
the
actions.
In
such
a
reference
to
a
P-model
but
can
be
applied
with
minor
case
the
learning
automaton
is
called
optimal.
From
(1)
it
changes
to
Q-
and
S-models
[LV3], [LV8],
[LCl].
can
be
seen
that
the
minimum
value
of
M(n)
is
mini
{c'}.
Definition
2:
A
learning
automaton
is
called
optinmal
if
IV.
REINFORCEMENT
SCHEMES
Having
decided
on
the
norms
of
behavior
of
learning
lim
E[M(n)]
=
cl
(4)
automata
we
can
now
focus
attention
on
the
means
of
where
achieving
the
desired
performance.
It
is
evident
from
the
cl
=
min
{c-}.
description
of
the
learning
automaton
that
the
crucial
i
factor
that
affects
the
performance
is
the
reinforcement
Optimality
implies
that
asymptotically
the
action
associated
scheme
for
the
updating
of
the
action
probabilities.
It
thus
with
the
minimum
penalty
probability
is
chosen
with
becomes
necessary
to
relate
the
structure
of
a
reinforcement
probability
one.
While
optimality
appears
a
very
desirable
scheme
and
the
performance
of
the
automaton
using
the
property,
certain
conditions
in
a
given
situation
may
scheme
r
a
preclude
its
achievement.
In
such
a
case
one
would
aim
at
a
genera
suboptimal
performance.
One
such
property
is
given
by
sented
by
e-optimality
[LV4].
p(n
+
1)
=
T[p(n),c(n),x(n)]
(7)
Definition
3:
A
learning
automaton
is
called
c-optimal
if
where
T
is
an
operator;
x(n)
and
x(n)
represent
the
action
lim
E[M(n)]
<
cl
+
c
(5)
of
the
automaton
and
the
input
to
the
automaton
at
instant
n
8
X
00
n,
respectively.
One
can
classify
the
reinforcement
schemes
can
be
obtained
for
any
arbitrary
c
>
0
by
a
suitable
either
on
the
basis
of
the
property
exhibited
by
a
learning
choice
of
the
parameters
of
the
reinforcement
scheme.
automaton
using
the
scheme
(as,
for
example,
the
automaton
s-optimality
implies
that
the
performance
of
the
automaton
being
expedient
or
optimal)
or
on
the
basis
of
the
nature
of
can
be
made
as
close
to
the
optimal
as
desired.
the
functions
appearing
in
the
scheme
(as,
for
example,
It
is
possible
that
the
preceding
properties
hold
only
when
linear,
nonlinear,
or
hybrid).
If
p(n
+
1)
is
a
linear
function
fpenalty
probabilities
c
satisfy
certain
restric-
of
the
components
of
p(n),
the
reinforcement
scheme
is
said
tions,
for
example,
that
they
should
lie
in
certain
intervals,
to
be
linear,
otherwise
it
is
nonlinear.
Sometimes
it
is
In
such
cases
the
properties
are
said
to
be
conditional.
advantageous
to
update
p(n)
according
to
different
schemes
depending
on
the
intervals
in
which
the
value
of
p(n)
lies.
1
Since
pi(n),
limn
.
p1(n),
and
consequently
M(n)
are,
in
general,
random
variables,
the
expectation
operator
is
needed
in
the
definition
2
It
is
usually
assumed
that
the
set
{c1}
has
unique
maximum
and
to
represent
the
average
penalty.
minimum
elements.

NARENDRA
AND
THATHACHAR:
LEARNING
AUTOMATA
327
In
such
a
case
the
combined
reinforcement
scheme
is
called
It
is
known
that
an
automaton
using
the
LR-P
scheme
is
a
hybrid
scheme.
expedient
in
all
stationary
random
environments.
Expres-
The
basic
idea
behind
any
reinforcement
scheme
is
rather
sions
for
the
rate
of
learning
and
the
variance
of
the
action
simple.
If
the
learning
automaton
selects
an
action
ici
at
probabilities
are
also
available.
instant
n
and
a
nonpenalty
input
occurs,
the
action
By
setting
probability
pi(n)
is
increased,
and
all
the
other
components
fj(p)
=
apj
gj(p)
0,
for
all]
(10)
of
p(n)
are
decreased.
For
a
penalty
input,
pi(n)
is
decreased,
and
the
other
components
are
increased.
These
changes
in
we
get
the
linear
reward-inaction
(LR-I)
scheme.
This
pi(n)
are
known
as
reward
and
penalty,
respectively.
scheme
was
considered
first
in
mathematical
psychology
Occasionally
the action
probabilities
may
be
retained
at
the
[GBl]
but
was
later
independently
conceived
and
intro-
previous
values,
in
which
case
the
status
quo
is
known
as
duced
into
the
engineering
literature
by
Shapiro
and
"inaction."
Narendra
[LSI],
[LS2].
In
general,
when
the
action
at
n
is
o
The
characteristic
of
the
scheme
is
that
it
ignores
penalty
inputs
from
the
environment
so
that
the
action
probabilities
pi(n
+
1)
=
p/n)-f/p(n)),
for
x(n)
=
0
remain
unchanged
under
these
inputs.
Because
of
this
pj(n
+
1)
=
pJ(n)
+
gj(p(n)),
for
x(n)
=
1.
property
a
learning
autoinaton
using
the
scheme
has
been
called
a
"benevolent
automaton"
by
Tsypkin
and
Poznyak
(.
7&
i)
(8a)
[LTI].
The
algorithm
for
pi(n
+
1)
is
to
be
fixed
so
that
Pk(n
+
1)
The
LR--I
scheme
was
originally
reported
to
be
optimal
in
(k
=
1,*
*
,r)
add
to
unity.
Thus
all
stationary
random
environments,
but
it
is
now
known
that
it
is
only
c-optimal
[LV4],
[LL4].
It
is
significant,
p1(n
+
1)
=
pi(n)
+
E
f1(p(n)),
for
x(n)
=
0
however,
that
replacing
the
penalty
by
inaction
in
the
LR-P
scheme
totally
changes
the
performance
from
expediency
to
pi(n
+
1)
=
pi(n)
-
E
gsp(n)),
for
x(n)
=
1
(8b)
c-optimality.
3heretheonnegative
continuousfunctionsOther
possible
combinations
such
as
the
linear
reward-
where
the
nonnegative3
continuous
functions
fj()
and
reward,
penalty-penalty,
and
inaction-penalty
schemes
gj(
)
are
such
that
Pk(n
+
1)
E
(0,1),
for
all
k
=
1,
*
,r
have
been
considered
in
[LV9],
but
these
are,
in
general,
whenever
every
pk(nl)
E
(0,1).
The
latter
requirement
5s
inferior
to
the
LR-I
and
LR-P
schemes.
The
effect
of
varying
necessary
to
prevent
the
automaton
from
getting
trapped
the
parameters
a
and
b
with
n
has
also
been
studied
in
prematurely
in
an
absorbing
barrier.
[LV9]
Varshavskii
and
Vorontsova
[LVI]
were
the
first
to
suggest
such
reinforcement
schemes
for
two-state
automata
Nonlinear
Schemes
and
thus
set
the
trend
for
later
developments.
They
con-
As
mentioned
earlier,
the
first
nonlinear
scheme
for
a
sidered
two
schemes-one
linear
and
the
other
nonlinear-
two-state
automaton
was
proposed
by
Varshavskii
and
in
terms
of
updating
of
the
state-transition
probabilities.
Vorontsova
[LVI]
in
terms
of
transition
probabilities.
The
Fu,
McLaren,
and
McMurtry
[LFI],
[LF2]
simplified
the
total-probability
version
of
the
scheme
corresponds
to
the
procedure
by
considering
updating
of
the
total
action
choice
probabilities
as
dealt
with
here.
gj(p)
=
fj(p)
=
apj(l
-
pj),
j
=
1,2.
(11)
Linear
Schemes
This
scheme
is
c-optimal
in
a
restricted
random
environment
The
earliest
known
scheme
can
be
obtained
by
setting
satisfying
either
c,
<
1/2
<
c2
or
c2
<
1/2
<
cl.
Chan-
fj-p
=
pj
j(p=pj
blrdrasekaran
and
Shen
[LCI]
have
studied
nonlinear
fj(p)
=
apj
g/(p)
=
bp1
+
b/r-
1,
schemes
with
power-law
nonlinearities.
Several
nonlinear
for
all
=
1,
j
,r
(9)
schemes,
which
are
c-optimal
in
all
stationary
random
where
0
<
a,
b
<
1.4
This
is
known
as
a
linear
reward-
environments,
have
been
suggested
by
Viswanathan
and
penalty
(denoted
LR-P)
scheme.
Early
studies
of
the
scheme,
Narendra
[LV9]
as
well
as
by
Lakshmivarahan
and
principally
dealing
with
the
two-state
case,
were
made
by
Thathachar
[LLI],
[LL3].
A
simple
scheme
of
this
type
for
Bush
and
Mosteller
[GB1]
and
Varshavskii
and
Vorontsova
the
two-state
case
is
[LVY].
McLaren
[LMI]
made
a
detailed
investigation
of
f
(p)
=
apj2(l
-
pj)
gj(p)
=
bpj(
-
pj),
the
multistate
case,
and
this
work
was
continued
by
j
=
1,2
(12)
Chandrasekaran
and
Shen
[LCI]
as
well
as
by
Viswana-
where
0
<
a
<
4,
0
<
b
<
.
than
and
Narendra
[LV9].
Norman
[LN4]
established
several
results
pertaining
to
the
ergodic
character
of
the
Acobntnofleaadnniertrm
otn
scheme.
appears
advantageous
[LL3].
Extensive
simulation
results
on
a
variety
of
schemes
utilizing
several
possible
combina-
tions
of
reward,
penalty,
and
inaction
are
available
in
3The
nonnegativity
condition
need
be
imposed
only
if
the
"*reward"
[LV
10].
A
result
that
unifies
most
of
the
preceding
rein-
character
of
f,(.)
and
the
"penalty"
character
of
g>(
)
are
to
be
forcement
schemes
has
been
reported
in
[LL3]
and
is
given
4g9j()
for
this
scheme
is
not
nonnegative
for
all
values
of
pj.
by
the
following.

Citations
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Deep learning in neural networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Journal ArticleDOI

Reinforcement learning: a survey

TL;DR: Central issues of reinforcement learning are discussed, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.
Posted Content

Reinforcement Learning: A Survey

TL;DR: A survey of reinforcement learning from a computer science perspective can be found in this article, where the authors discuss the central issues of RL, including trading off exploration and exploitation, establishing the foundations of RL via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state.
Journal ArticleDOI

Neuronlike adaptive elements that can solve difficult learning control problems

TL;DR: In this article, a system consisting of two neuron-like adaptive elements can solve a difficult learning control problem, where the task is to balance a pole that is hinged to a movable cart by applying forces to the cart base.
References
More filters
Book

Individual Choice Behavior

Book ChapterDOI

Stochastic Computing Systems

TL;DR: The invention of the stored-program digital computer during the second world war made it possible to replace the lower-level mental processes of man by electronic data-processing in machines, but the authors lack the "steam engine" or "digital computer" which will provide the necessary technology for learning and pattern recognition by machines.
Journal ArticleDOI

Adaptation and learning in automatic systems

TL;DR: In this paper, the authors propose a concrete-concrete approach to the problem of concretization.Concrete-convex, concrete, and concrete-decrease.
Journal ArticleDOI

Learning control systems--Review and outlook

TL;DR: The basic concept of learning control is introduced, and the following five learning schemes are briefly reviewed: 1) trainable controllers using pattern classifiers, 2) reinforcement learning control systems, 3) Bayesian estimation, 4) stochastic approximation, and 5) Stochastic automata models.