Learning Automata - A Survey

doi:10.1109/TSMC.1974.5408453

IEEE

TRANSACTIONS

ON

SYSTEMS,

MAN,

AND

CYBERNETICS,

VOL.

SMC-4,

NO.

4,

JULY

1974

323

Learning

Automata

A

Survey

KUMPATI

S.

NARENDRA,

SENIOR

MEMBER,

IEEE,

AND

M.

A.

L.

THATHACHAR

Abstract-Stochastic

automata

operating

in

an

unknown

random

can

be

considered

to

show

learning

behavior.

Tsypkin

environment

have

been

proposed

earlier

as

models

of

learning.

These

[GT1]

has

recently

argued

that

seemingly

diverse

problems

automata

update

their

action

probabilities

in

accordance

with

the

inputs

.

received

from

the

environment

and

can

improve

their

own

performance

in

pa

t

rec

i

o

idenfatio

n

lering.

during

operation.

In

this

context

they

are

referred

to

as

learning

auto-

can

be

treated

ii

a

unified

manner

as

problems

in

learning

mata.

A

survey

of

the

available

results

in

the

area

of

learning

automata

using

probabilistic

iterative

methods.

has

been

attempted

in this

paper.

Attention

has

been

focused

on

the

Viewed

in

a

purely

mathematical

context

the

goal

of

a

norms

of

behavior

of

learning

automata,

issues

in

the

design

of

updating

learning

system

is

the

optimization

of

a

functional

not

schemes,

convergence

of

the action

probabilities,

and

interaction

of

. .

several

automata.

Utilization

of

learning

automata

in

parameter

known

expicily,as

functoeaml

with

athemati

daltexpeta-n

optimization

and

hypothesis

testing

is

discussed,

and

potential

areas

o

tion

of

a

random

functional

with

a

probability

distribution

application

are

suggested.

function

not

known

in

advance.

An

approach

that

has

been

used

in

the

past

is

to

reduce

the

problem

to

the

determina-

I.

INTRODUCTION

tion

of

an

optimal

set

of

parameters

and

then

apply

stochastic

hillclimbing

techniques

[GT1].

An

alternative

rN

CLASSICAL

deterministic

control

theory,

the

control

of

a

process

is

always

preceded

by

complete

knowledge

approach

gaining

attention

recently

is

to

regard

the

problem

as

one

of

finding

an

optimal

action

out

of

a

set

of

allowable

ofrthe

of

the

proce;a

r

the

m

athmtia

actions

and

to

achieve

this

using

stochastic

automata

[LN2].

description

of

the

process

is

assumedito

bknon,

an

the

The

following

example

of

the

learning

process

of

a

student

inputs

to

the

process

are

deterministic

functions

of

time.

wit

a-rbblsi

ece

lutae

h

uoao

Later

developments

in

stochastic

control

theory

took

into

approach.

account

uncertainties

that

might

be

present

in

the

process;

Conti

s

stochastic

control

was

effected

by

assuming

that

the

thestden

andanite

st

aer

nAtivn

isp

probabilistic

characteristics

of

the

uncertainties

are

known.

p

he

student

an

set

on

alternative

s,

Frequently,

the

uncertainties

are

of

a

higher

order,

and

even

foling

w

he

teacher

respond

in

a

ary

anner

the

probabilistic

characteristics

such

as

the

distribution

indicaing

wheth

the

selcter

is

ight

or

wrong.

functions

may

not

be

completely

known.

It

is

then

necessary

teac

thri

howeve,poabster

is

a

nonzr

to

make

observations

on

the

process

as

it

is

in

operation

and

probabilit

ficither

esponse

zfra

gain

further

knowledge

of

the

process.

In

other

words,

a

of

the

answers

selected

by

the

student.

The

saving

feature

of

distinctive

feature

of

such

problems

is

that

there

is

little

the

an

sethat

tik

that

The

tachr'

neative

a

priori

information,

and

additional

information

is

to

be

acquired

on

line.

One

viewpoint

is

to

regard

these

as

responses

have

the

least

probability

for

the

correct

answer.

acquired

on

learnine.

Oeipnitrgdhsa

Under

these

circumstances

the

interest

is

in

finding

the

aro

mings

learing.

.

manner

in

which

the

student

should

plan

a

choice

of

a

Learning

is

defined

as

any

relatively

permanent

changin

sequence

of

alternatives

and

process

the

information

behavior

resulting

from

past

experience,

and

a

learning

obtained

from

the

teacher

so

that

he

learns

the

correct

system

is

characterized

by

its

ability

to

improve

its

behavior

answer.

with

time,

in

some

sense

tending

towards

an

ultimate

goal.

In

stochastic

automata

models

the

stochastic

automaton

In

mathematical

psychology,

models

of

learning

systems

corresponds

to

the

student,

and

the

random

environment

in

[GBI],

[GLI]

have

been

developed

to

explain

behavior

which

it

operates

represents

the

probabilistic

teacher.

The

patterns

among

living

organisms.

These

models

in

turn

have

actions

(or

states)

of

the

stochastic

automaton

are

the

lately

been

adapted

to

synthesize

engineering

systems,

which

various

alternative

answers

that

are

provided.

The

responses

of

the

environment

for

a

particular

action

of

the

stochastic

Manuscript

received

January

15,

1974;

revised

February

13,

1974.

automaton

are

the

teacher's

probabilistic

responses.

The

This

work

was

supported

by

the

National

Science

Foundation

under

problem

is

to

obtain

the

optimal

action

that

corresponds

to

Grant

GK-20580.

K.

S.

Narendra

is

with

the

Becton

Center,

Yale

University,

New

thcortanw.

Haven,

Conn.

The

stochastic

automaton

attempts

a

solution

of

this

M.

A.

L.

Thathachar

is

with

the

Becton

Center,

Yale

University,

problem

as

follows.

To

start

with,

no

information

as

to

New

Haven,

Conn.,

on

leave

from

the

Indian

Institute

of

Science,

hc

n

steotmlato

sasmd

n

qa

Bangalore,

India.'

hconisteotmlato

isas

edadeql

324

IEEE

TRANSACTIONS

ON

SYSTEMS,

MAN,

AND

CYBERNETICS,

JULY

1974

probabilities

are

attached

to

all

the

actions.

One

action

is

Union

and

elsewhere

has

followed

the

trend

set

by

his

selected

at

random,

the

response

of

the

environment

to

this

source

paper.

No

attempt,

however,

has

been

made

in

this

action

is

observed,

and

based

on

this

response

the

action

paper

to

review

all

these

studies.

probabilities

are

changed.

Now

a

new

action

is

selected

Varshavskii

and

Vorontsova

[LVY]

observed

that

the

use

according

to

the

updated

action

probabilities,

and

the

of

stochastic

automata

with

updating

of

action

probabilities

procedure

is

repeated.

A

stochastic

automaton

acting

in

could reduce

the

number

of

states

in

comparison

with

this

manner

to

improve

its

performance

is

referred

to

as

a

deterministic

automata.

This

idea

has

proved

to

be

very

learning

automaton

in

this

paper.

fruitful

and

has

been

exploited

in

a

series

of

investigations,

Stochastic

hillclimbing

methods

(such

as

stochastic

the

results

of

which

form

the

subject

of

this

paper.

approximation)

and

stochastic

automata

methods

represent

Fu

and

his

associates

[LFI]-[LF6]

were

among

the

first

two

distinct

approaches

to

the

learning

problem.

Though

to

introduce

stochastic

automata

into

the

control

literature.

both

approaches

involve

iterative

procedures,

updating

at

A

variety

of

applications

to

parameter

optimization,

every

stage

is

done

in

the

parameter

space

in

the

first

method

pattern

recognition,

and

game

theory

were

considered

by

and

probability

space

in

the

second.

It

is,

of

course,

possible

this

school.

McLaren

[LM1]

explored

the

properties

of

that

they

lead

to

equivalent

descriptions

in

some

examples.

linear

updating

schemes

and

suggested

the

concept

of

a

The

automata

methods

have

two

distinct

advantages

over

"growing"

automaton

[LM2].

Chandrasekaran

and

Shen

stochastic

hillclimbing

methods

in

that

the

action

space

[LC1]-[LC3]

made

useful

studies

of

nonlinear

updating

need

not

be

a

metric

space

(i.e.,

no

concept

of

neighborhood

schemes,

nonstationary

environments,

and

games

of

is

needed),

and

since

at

every

stage

any

element

of

the

automata.

Tsypkin

and

Poznyak

[LTI]

attempted

to

unify

action

set

can

be

chosen,

global

rather

than

local

optimum

the

updating

schemes

by

focusing

attention

on

an

inverse

can

be

obtained.

optimization

problem.

The

present

authors

and

their

Experimental

simulation

of

automata

methods

carried

associates

[LS1],

[LS2],

[LV3]-[LV10],

[LN1],

[LN2],

out

during

the

last

few

years

has

indicated

the

feasibility

of

[LL1]-[LL5]

have

studied

the

theory

and

applications

of

the

automaton

approach

in

the

solution

of

interesting

learning

automata

and

also

carried

out

simulation

studies

examples

in

parameter

optimization,

hypothesis

testing,

and

in

the

area.

game

theory.

The

automaton

approach

also

appears

The

survey

papers

on

learning

control

systems

by

appropriate

in

the

study

of

hierarchical

systems

and

in

Sklansky

[GS1]

and

Fu

[GF1]

have

devoted

part

of

their

tackling

certain

nonstationary

optimization

problems.

attention

to

learning

automata.

The

topic

also

finds

a

place

Furthermore,

several

other

avenues

to

learning

can

be

in

some

books

and

collections

of

articles

on

learning

interpreted

as

iterative

procedures

in

the

probability

space,

systems

[GM2],

[GF2],

[LF6].

The

literature

on

the

and

the

learning

automaton

provides

a

natural

mathematical

two-armed

bandit

problem

is

relevant

in

the

present

context

model

for

such

situations

and

serves

as

a

unifying

theme

but

is

not

referred

to

in

detail

as the

approach

taken

is

among

diverse

techniques

[GM3].

rather

different

[LC5],

[LW2].

References

to

other

on

learning

automata

have

led

to

a

contributions

will

be

made

at

appropriate

points

in

the

body

certain

understanding

of

the

basic

issues

involved

and

have

of

the

paper.

provided

guidelines

for

the

design

of

algorithms.

An

appreciation

of

the

fundamental

problems

in

the

field

has

Organization

also

taken

place.

It

appears

that

research

in

this

area

has

This

paper

has

been

divided

into

nine

sections.

Following

reached

a

stage

where

the

power

and

applicability

of

the

introduction,

the

basic

concepts

and

definitions

of

approach

needs

to

be

made

widely

known

in

order

that

it

stochastic

automata

and

random

environments

are

given

in

can

be

fully

exploited

in

solving

problems

in

relevant

areas.

Section

II.

The

possible

ways

in

which

the

behavior

of

In

this

paper

we

review

recent

results

in

the

area

of

learning

automata

can

be

judged

are

defined

in

Section

III.

automata,

reexamine

some

of

the

theoretical

questions

that

Section

IV

deals

with

reinforcement

schemes

(or

updating

arise,

and

suggest

potential

areas

where

the

available

results

algorithms)

and

their

properties

and

includes

a

discussion

may

find

application.

of

convergence.

Section

V

describes

collective

behavior

of

automata

in

terms

of

games

between

automata

and

multi-

level

structures

of

automata.

Nonstationary

environments

Historically,

the

first

learning

automata

models

were

are

briefly

considered

in

Section

VI.

Possible uses

of

developed

in

mathematical

psychology.

Early

work

in

this

learning

automata

in

optimization

and

hypothesis

testing

area

has

been

well

documented

in

the

book

by

Bush

and

form

the

subject

matter

of

Section

VII.

A

short

description

Mosteller

[GBl].

More

recent

results

can

be

found

in

of

the

fields

of

application

of

learning

automata

is

given

in

Atkinson

et

al.

[GAl].

A

rigorous

mathematical

framework

Section

VIII.

A

comprehensive

bibliography

is

provided

in

has

been

developed

for

the

study

of

learning

problems

by

the

Reference

section

and

is

divided

into

three

subsections

Iosifescu

and

Theodorescu

[GIl]

as

well

as

by

Norman

dealing

with

1)

general

references

in

the

literature

pertinent

[GNl].

to

the

topic

considered,

2)

some

important

papers

on

de-

Tsetlin

[DT1]

introduced

the

concept

of

using

determi-

terministic

automata

that

provided

the

impetus

for

stochas-

nistic

automata

operating

in

random

environments

as

tic

automata

models,

and

3)

publications

wholly

devoted

to

models

of

learning.

A

great

deal

of

work

in

the

Soviet

learning

automata.

NARENDRA

AND

THATHACHAR:

LEARNING

AUTOMATA

325

INPUT

|

STOCHASTIC

ACTION(OUTPUT)

Learning

Automaton

(Stochastic

Automaton

in

a

Random

}{O,I)

AUTOMATON

a

4E

a,,a2,-ar

Environment)

Fig.

1.

Stochastic

automaton.

Fig.

3

represents

a

feedback

connection

of

a

stochastic

automaton

and

an

environment.

The

actions

of

the

autom-

PENALTY

PROBABILITY

SET

aton

in

this

case

form

the

inputs

to

the

environment.

The

{Cp

C2,

*.

CrO

responses

of

the

environment

in

turn

are

the

inputs

to

the

automaton

and

influence

the

updating

of

the

action

INPUT

ENVI,...arONMEN

OUTPU

RESPONE

probabilities.

As

these

responses

are

random,

the

action

Fig.

2.

Environment.

probability

vector

p(n)

is

also

random.

In

psychological

learning

experiments

the

organism

under

PENALTY

PROBABILITY

SET

study

is

said

to

learn

when

it

improves

the

probability

of

{Cp

C,2

Cr)

correct

response

as

a

result

of

interaction

with

its

environ-

ment.

Since

the

stochastic

automaton

being

considered

in

this

paper

behaves

in

a

similar

fashion,

it

appears

proper

to

refer

to

it

as

a

learning

automaton.

Thus

a

learning

automaton

is

a

stochastic

automaton

that

operates

in

a

{p,A)

random

environment

and

updates

its

action

probabilities

in

ACTION

STOCHASTIC

INPUT

accordance

with

the

inputs

received

from

the

environment

a

E

{a,,..

ar)

AUTOMATON

xE

{O,l}

so

as

to

improve

its

performance

in

some

specified

sense.

Fig.

3.

Learning

automaton.

In

the

context

of

psychology,

a

learning

automaton

may

be

regarded

as

a

model

of

the

learning

behavior

of

the

II.

STOCHAsTic

AUTOMATA

AND

RANDom

ENVIRONMENTS

organism

under

study

and

the

environment

as

controlled

by

the

experimenter.

In

an

engineering

application

such

as

the

Stochastic

Automaton

control

of

a

process,

the

controller

corresponds

to

the

A

stochastic

automaton

is

a

sextuple

{x,0,oc,p,A,G}

where

learning

automaton,

while

the

rest

of

the

system

with

all

x

is

the

input

set,

4

=

{01,02,.-

*

,l}

is

the

set

of

internal

uncertainties

constitutes

the

environment.

states,

os

=

{01,0C2,

*Cr

}

with

r

<

s

is

the

output

or

It

is

useful

to

note

the

distinction

between

several

models

action

set,

p

is

the

state

probability

vector

governing

the

based

on

the

nature

of

the

input

to

the

learning

automaton.

choice

of

the

state

at

each

stage

(i.e.,

at

each

stage

n,

If

the

input

set

is

binary,

e.g.,

{0,

1},

the

model

is

known

as

p(n)

=

(pl(n),p2(n),

-

*,p.(n))),

A

is

an

algorithm

(also

a

P-model.

On

the

other

hand

it

is

called

a

Q-model

if

the

called

an

updating

scheme

or

reinforcement

scheme)

which

input

set

is

a

finite

collection

of

distinct

symbols

as,

for

generates

p(n

+

1)

from

p(n),

and

G:

0

-a

o

is

the

output

example,

obtained

by

quantization

and

an

S-model

if

the

function.

G

could

be

a

stochastic

function,

but

there

is

no

input

set

is

an

interval

[0,1].

Each

of

these

models

appears

loss

of

generality

in

assuming

it

to

be

deterministic

[GP1].

appropriate

in

certain

situations.

In

this

paper

G

is

taken

to

be

deterministic

and

one-to-one

A

remark

on

the

terminology

is

relevant

here.

Following

(i.e.,

r

=

s,

states

and

actions

are

regarded

synonymous)

Tsetlin

[DT1],

deterministic

automata

operating

in

random

and

s

<

xo.

Fig.

1

shows

a

stochastic

automaton

with

its

environments

have

been

proposed

as

models

of

learning

inputs

and

actions.

behavior.

Thus

they

are

also

contenders

to

the

term

It

may

be

noted

that

the

states

of

a

stochastic

automaton

"learning

automata."

However,

in

the

view

of

the

present

correspond

to

the

states

of

a

discrete-state

discrete-

authors

the

stochastic

automaton

with

updating

of

action

parameter

Markov

process.

Occasionally,

it

may

be

probabilities

is

a

general

model

from

which

the

deterministic

convenient

to

regard

the

pi(n)

themselves

as

states

of

a

automaton

can

be

obtained

as

a

special

case

having

a

continuous-state

Markov

process.

0-1-state

transition

matrix,

and

it

appears

reasonable

to

Environment

apply

the

term

learning

automaton

to

the

more

general

model.

In

cases

where

it

is

felt

necessary

to

emphasize

the

Only

an

environment

(also

called

a

medium)

with

random

learning

properties

of

a

deterministic

automaton

one

can

response

characteristics

is

of

interest

in

the

problems

use

a

qualifying

term

such

as

"deterministic

learning

considered.

The

environment

(shown

in

Fig.

2)

has

inputs

automaton."

It

may

also

be

noted

that

learning

automata

c(n)

=

{f,xl.

'r}

and

outputs

(responses)

belonging

to

a

of

this

paper

have

been

referred

to

as

"variable-structure

set

x.

Frequently

the

responses

are

binary

{0,1

}

with

zero

stochastic

automata,"

in

earlier

literature

[LVI].

being

called

the

nonpenalty

response

and

one

as

the

penalty

response.

The

probability

of

emitting

a

particular

output

III.

NORMS

OF

BEHAVIOR

OF

LEARNING

AUTOMATA

symbol

(say,

1)

depends

on

the

input

and

is

denoted

by

The

basic

operation

carried

out

by

a

learning

automaton

ci(i

=

1,..

*

,r).

The

ci

are

called

the

penalty

probabilities,

is

the

updating

of

the

action

probabilities

on

the

basis

of

the

If

the

ci

do

not

depend

on

n,

the

environment

is

said

to

be

responses

of

the

environment.

A

natural

question

here

is

to

stationary.

Otherwise

it

is

nonstationary.

It

is

assumed

that

examine

whether

the

updating

is

done

in

such

a

manner

as

the

ci

are

unknown

initially;

the

problem

would

be

trivial

to

result

in

a

performance

compatible

with

intuitive

notions

if

they

are

known

a

priori.

of

learning.

326

IEEE

TRANSACTIONS

ON

SYSTEMS,

MAN,

AND

CYBERNETICS,

JULY

1974

One

quantity

useful

in

judging

the

behavior

of

a

learning

In

practice,

the

penalty

probabilities

are

often

completely

automaton

is

the

average

penalty

received

by

the

automaton.

unknown,

and

it

would

be

necessary

to

have

desirable

At

a

certain

stage

n,

if

the

action

(i

is

selected

with

prob-

performance

whatever

be

the

values

of

ci,

that

is,

in

all

ability

pi(n)

the

average

penalty

conditioned

on

p(n)

is

stationary

random

media.

The

performance

would

also

be

superior

if

the

decrease

of

E[M(n)]

is

monotonic.

Both

M(n)

=

E

{x(n)

p(n)}

these

requirements

are

considered

in

the

following

definition

r

[LL3].

=

pi(n)ci.

(1)

Definition

4:

A

learning

automaton

is

said

to

be

absolutely

i-

1

expedient

if

If

no

a

priori

information

is

available,

and

the

actions

are

E[M(n

+

1)

p(n)]

<

M(n)

(6)

chosen

with

equal

probability

(i.e.,

at

random),

the

value

of

the

average

penalty

is

denoted

by

Mo

and

is

given

by

for

all

n,

all

pk(n)

E

(0,

1)(k

=

1,.*.*,r),

and

all

possible

values2

of

ci(i=

1,

,r).

Absolute

expediency

implies

M

=

Cl

+

C2

+

Cr

(2)

that

M(n)

is

a

supermartingale

and

that

E[M(n)]

is

strictly

r

monotonically

decreasing

with

n

in

all

stationary

random

environments.

If

M(n)

<

Mo

initially,

absolute

expediency

The

average

penaltyaisimadetlessothan

Me

atilea

implies

expediency.

It

is

thus

a

stronger

requirement

on

the

if

the

average

penalty

iS

made

less

than

MO,

at

least

asymptotically.

Such

a

behavior

is

called

expediency

and

is

learning

automaton.

Furthermore,

it

can

be

shown

that

defined

as

follows

[DTI],

[LCI].

absolute

expediency

implies

E-optimality

in

all

stationary

Definition

1:

A

learning

automaton

is

called

expedient'

if

random

environments

[LL4].

It

is

not

at

present

known

whether

the

reverse

implication

is

true.

lowever,

every

lim

E[M(n)]

<

Mo.

(3)

learning

automaton

presently

known

to

be

e-optimal

in

all

n-

cc

stationary

media

is

also

absolutely

expedient.

Hence

When

a

learning

automaton

is

expedient

it

only

does

better

s-optimality

and

absolute

expediency

will

be

treated

as

than

one

which

chooses

actions

in

a

purely

random

manner.

synonymous

in

the

sequel.

It

would

be

desirable

if

the

average

penalty

could

be

The

definitions

in

this

section

have

been

given

with

minimized

by

a

proper

selection

of

the

actions.

In

such

a

reference

to

a

P-model

but

can

be

applied

with

minor

case

the

learning

automaton

is

called

optimal.

From

(1)

it

changes

to

Q-

and

S-models

[LV3], [LV8],

[LCl].

can

be

seen

that

the

minimum

value

of

M(n)

is

mini

{c'}.

Definition

2:

A

learning

automaton

is

called

optinmal

if

IV.

REINFORCEMENT

SCHEMES

Having

decided

on

the

norms

of

behavior

of

learning

lim

E[M(n)]

=

cl

(4)

automata

we

can

now

focus

attention

on

the

means

of

where

achieving

the

desired

performance.

It

is

evident

from

the

cl

=

min

{c-}.

description

of

the

learning

automaton

that

the

crucial

i

factor

that

affects

the

performance

is

the

reinforcement

Optimality

implies

that

asymptotically

the

action

associated

scheme

for

the

updating

of

the

action

probabilities.

It

thus

with

the

minimum

penalty

probability

is

chosen

with

becomes

necessary

to

relate

the

structure

of

a

reinforcement

probability

one.

While

optimality

appears

a

very

desirable

scheme

and

the

performance

of

the

automaton

using

the

property,

certain

conditions

in

a

given

situation

may

scheme

r

a

preclude

its

achievement.

In

such

a

case

one

would

aim

at

a

genera

suboptimal

performance.

One

such

property

is

given

by

sented

by

e-optimality

[LV4].

p(n

+

1)

=

T[p(n),c(n),x(n)]

(7)

Definition

3:

A

learning

automaton

is

called

c-optimal

if

where

T

is

an

operator;

x(n)

and

x(n)

represent

the

action

lim

E[M(n)]

<

cl

+

c

(5)

of

the

automaton

and

the

input

to

the

automaton

at

instant

n

8

X

00

n,

respectively.

One

can

classify

the

reinforcement

schemes

can

be

obtained

for

any

arbitrary

c

>

0

by

a

suitable

either

on

the

basis

of

the

property

exhibited

by

a

learning

choice

of

the

parameters

of

the

reinforcement

scheme.

automaton

using

the

scheme

(as,

for

example,

the

automaton

s-optimality

implies

that

the

performance

of

the

automaton

being

expedient

or

optimal)

or

on

the

basis

of

the

nature

of

can

be

made

as

close

to

the

optimal

as

desired.

the

functions

appearing

in

the

scheme

(as,

for

example,

It

is

possible

that

the

preceding

properties

hold

only

when

linear,

nonlinear,

or

hybrid).

If

p(n

+

1)

is

a

linear

function

fpenalty

probabilities

c

satisfy

certain

restric-

of

the

components

of

p(n),

the

reinforcement

scheme

is

said

tions,

for

example,

that

they

should

lie

in

certain

intervals,

to

be

linear,

otherwise

it

is

nonlinear.

Sometimes

it

is

In

such

cases

the

properties

are

said

to

be

conditional.

advantageous

to

update

p(n)

according

to

different

schemes

depending

on

the

intervals

in

which

the

value

of

p(n)

lies.

1

Since

pi(n),

limn

.

p1(n),

and

consequently

M(n)

are,

in

general,

random

variables,

the

expectation

operator

is

needed

in

the

definition

2

It

is

usually

assumed

that

the

set

{c1}

has

unique

maximum

and

to

represent

the

average

penalty.

minimum

elements.

NARENDRA

AND

THATHACHAR:

LEARNING

AUTOMATA

327

In

such

a

case

the

combined

reinforcement

scheme

is

called

It

is

known

that

an

automaton

using

the

LR-P

scheme

is

a

hybrid

scheme.

expedient

in

all

stationary

random

environments.

Expres-

The

basic

idea

behind

any

reinforcement

scheme

is

rather

sions

for

the

rate

of

learning

and

the

variance

of

the

action

simple.

If

the

learning

automaton

selects

an

action

ici

at

probabilities

are

also

available.

instant

n

and

a

nonpenalty

input

occurs,

the

action

By

setting

probability

pi(n)

is

increased,

and

all

the

other

components

fj(p)

=

apj

gj(p)

0,

for

all]

(10)

of

p(n)

are

decreased.

For

a

penalty

input,

pi(n)

is

decreased,

and

the

other

components

are

increased.

These

changes

in

we

get

the

linear

reward-inaction

(LR-I)

scheme.

This

pi(n)

are

known

as

reward

and

penalty,

respectively.

scheme

was

considered

first

in

mathematical

psychology

Occasionally

the action

probabilities

may

be

retained

at

the

[GBl]

but

was

later

independently

conceived

and

intro-

in

which

case

the

status

quo

is

known

as

duced

into

the

engineering

literature

by

Shapiro

and

"inaction."

Narendra

[LSI],

[LS2].

In

general,

when

the

action

at

n

is

o

The

characteristic

of

the

scheme

is

that

it

ignores

penalty

inputs

from

the

environment

so

that

the

action

probabilities

pi(n

+

1)

=

p/n)-f/p(n)),

for

x(n)

=

0

remain

unchanged

under

these

inputs.

Because

of

this

pj(n

+

1)

=

pJ(n)

+

gj(p(n)),

for

x(n)

=

1.

property

a

learning

autoinaton

using

the

scheme

has

been

called

a

"benevolent

automaton"

by

Tsypkin

and

Poznyak

(.

7&

i)

(8a)

[LTI].

The

algorithm

for

pi(n

+

1)

is

to

be

fixed

so

that

Pk(n

+

1)

The

LR--I

scheme

was

originally

reported

to

be

optimal

in

(k

=

1,*

*

,r)

add

to

unity.

Thus

all

stationary

random

environments,

but

it

is

now

known

that

it

is

only

c-optimal

[LV4],

[LL4].

It

is

significant,

p1(n

+

1)

=

pi(n)

+

E

f1(p(n)),

for

x(n)

=

0

however,

that

replacing

the

penalty

by

inaction

in

the

LR-P

scheme

totally

changes

the

performance

from

expediency

to

pi(n

+

1)

=

pi(n)

-

E

gsp(n)),

for

x(n)

=

1

(8b)

c-optimality.

3heretheonnegative

continuousfunctionsOther

possible

combinations

such

as

the

linear

reward-

where

the

nonnegative3

continuous

functions

fj()

and

reward,

penalty-penalty,

and

inaction-penalty

schemes

gj(

)

are

such

that

Pk(n

+

1)

E

(0,1),

for

all

k

=

1,

*

,r

have

been

considered

in

[LV9],

but

these

are,

in

general,

whenever

every

pk(nl)

E

(0,1).

The

latter

requirement

5s

inferior

to

the

LR-I

and

LR-P

schemes.

The

effect

of

varying

necessary

to

prevent

the

automaton

from

getting

trapped

the

parameters

a

and

b

with

n

has

also

been

studied

in

prematurely

in

an

absorbing

barrier.

[LV9]

Varshavskii

and

Vorontsova

[LVI]

were

the

first

to

suggest

such

reinforcement

schemes

for

two-state

automata

Nonlinear

Schemes

and

thus

set

the

trend

for

later

developments.

They

con-

As

mentioned

earlier,

the

first

nonlinear

scheme

for

a

sidered

two

schemes-one

linear

and

the

other

nonlinear-

two-state

automaton

was

proposed

by

Varshavskii

and

in

terms

of

updating

of

the

state-transition

probabilities.

Vorontsova

[LVI]

in

terms

of

transition

probabilities.

The

Fu,

McLaren,

and

McMurtry

[LFI],

[LF2]

simplified

the

total-probability

version

of

the

scheme

corresponds

to

the

procedure

by

considering

updating

of

the

total

action

choice

probabilities

as

dealt

with

here.

gj(p)

=

fj(p)

=

apj(l

-

pj),

j

=

1,2.

(11)

Linear

Schemes

This

scheme

is

c-optimal

in

a

restricted

random

environment

The

earliest

known

scheme

can

be

obtained

by

setting

satisfying

either

c,

<

1/2

<

c2

or

c2

<

1/2

<

cl.

Chan-

fj-p

=

pj

j(p=pj

blrdrasekaran

and

Shen

[LCI]

have

studied

nonlinear

fj(p)

=

apj

g/(p)

=

bp1

+

b/r-

1,

schemes

with

power-law

nonlinearities.

Several

nonlinear

for

all

=

1,

j

,r

(9)

schemes,

which

are

c-optimal

in

all

stationary

random

where

0

<

a,

b

<

1.4

This

is

known

as

a

linear

reward-

environments,

have

been

suggested

by

Viswanathan

and

penalty

(denoted

LR-P)

scheme.

Early

studies

of

the

scheme,

Narendra

[LV9]

as

well

as

by

Lakshmivarahan

and

principally

dealing

with

the

two-state

case,

were

made

by

Thathachar

[LLI],

[LL3].

A

simple

scheme

of

this

type

for

Bush

and

Mosteller

[GB1]

and

Varshavskii

and

Vorontsova

the

two-state

case

is

[LVY].

McLaren

[LMI]

made

a

detailed

investigation

of

f

(p)

=

apj2(l

-

pj)

gj(p)

=

bpj(

-

pj),

the

multistate

case,

and

this

work

was

continued

by

j

=

1,2

(12)

Chandrasekaran

and

Shen

[LCI]

as

well

as

by

Viswana-

where

0

<

a

<

4,

0

<

b

<

.

than

and

Narendra

[LV9].

Norman

[LN4]

established

several

results

pertaining

to

the

ergodic

character

of

the

Acobntnofleaadnniertrm

otn

scheme.

appears

advantageous

[LL3].

Extensive

simulation

results

on

a

variety

of

schemes

utilizing

several

possible

combina-

tions

of

reward,

penalty,

and

inaction

are

available

in

3The

nonnegativity

condition

need

be

imposed

only

if

the

"*reward"

[LV

10].

A

result

that

unifies

most

of

the

preceding

rein-

character

of

f,(.)

and

the

"penalty"

character

of

g>(

)

are

to

be

forcement

schemes

has

been

reported

in

[LL3]

and

is

given

4g9j()

for

this

scheme

is

not

nonnegative

for

all

values

of

pj.

by

the

following.

Learning Automata - A Survey

Citations

Reinforcement Learning: An Introduction

Deep learning in neural networks

Reinforcement learning: a survey

Reinforcement Learning: A Survey

Neuronlike adaptive elements that can solve difficult learning control problems

References

Individual Choice Behavior

Stochastic Computing Systems

Adaptation and learning in automatic systems

On the Behavior of Finite Automata in Random Media

Learning control systems--Review and outlook

Related Papers (5)

Learning Automata: An Introduction

Varieties of learning automata: an overview

Learning Algorithms Theory and Applications

Reinforcement Learning: An Introduction

Neuronlike adaptive elements that can solve difficult learning control problems