Learning state space trajectories in recurrent neural networks

doi:10.1109/IJCNN.1989.118724

Carnegie Mellon University

Research Showcase @ CMU

Computer Science Department School of Computer Science

1988

Learning state space trajectories in recurrent neural

networks

Barak Pearlmuer

Carnegie Mellon University

Follow this and additional works at: hp://repository.cmu.edu/compsci

is Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been

accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please

contact research-showcase@andrew.cmu.edu.

Published In

.

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:

The copyright law of the United States (title 17, U.S. Code) governs the making

of photocopies or other reproductions of copyrighted material. Any copying of this

document without permission of its author may be prohibited by law.

Learning State Space Trajectories

in Recurrent Neural Networks

Barak A. Pearlmutter

December 31, 1988

CMU-CS-88-191-

Abstract

We describe a number of procedures for finding dE/dwij where E is an error functional of the

temporal trajectory of the states of a continuous recurrent network and w,y are the weights of that

network. Computing these quantities allows one to perform gradient descent in the weights to

minimize £, so these procedures form the kernels of connectionist learning algorithms. Simulations

in which networks are taught to move through limit cycles are shown. We also describe a number of

elaborations of the basic idea, such as mutable time delays and teacher forcing, and conclude with

a complexity analysis. This type of network seems particularly suited for temporally continuous

domains, such as signal processing, control, and speech.

This research was sponsored in part by National Science Foundation grant EET-8716324 and by the Office of Naval

Research under contract number N00014-86-K-0678. Barak Pearlmutter is a Fannie and John Hertz Foundation fellow.

The views and conclusions contained in this document are those of the author and should not be interpreted as representing

the official policies, either expressed or implied, of NSF, ONR, the Fannie and John Hertz Foundation or the U.S.

Government.

1

Introduction

Note: this is an expanded version of an earlier paper of the

same title [9].

Pineda [11] has shown how to train the fixpoints of a re-

current temporally continuous generalization of backprop-

agation networks [8,12,14]. Such networks are governed

by the coupled differential equations

1

dt

= -yi +

<r(Xi)

+ Ii

where

v

0j

(1)

(2)

is the total input to unit z, y, is the state of unit /, T

t

is

the time constant of unit z, a is an arbitrary differentiate

function

1

, w

f

y are the weights, and the initial conditions

yi(t

0

) and driving functions I

t

(t) are the inputs to the system.

Consider minimizing £(y), some functional of the tra-

jectory taken by y between to and t\. For instance,

&

=

f£(yo(0

~f(0)

2

dt

measures the deviation of

yo

from

the function/, and minimizing this £ would teach the net-

work to have

yo

imitate/. Below, we develop a technique

for computing dE(y)/dwij and

dE(y)/dT

iy

thus allowing us

to do gradient descent in the weights and time constants so

as to minimize £. The computation of dE/dwij seems to

require a phase in which the network is run backwards in

time,

and tricks for avoiding this are also described.

2

A Forward/Backward Technique

We can approximate the derivative in (1) with

dt

KJ

~

At

(3)

which yields a first order difference approximation to (1),

We use tildes to indicate temporally discretized versions of

continuous functions. The notation y,(0 is being used as

shorthand for the particular variable representing the de-

screte version of y

t

(fo + nAt), where n is an integer and

t

= to +

nAt.

Let us define

In the usual case £ is of the form f£f(y(t), Odt

so e

M =

df(y(t)

t

t) /dyi(t). Intuitively, measures how much a

typically <r(0 = (1 + e~*y-

1

, in which case *'(0 = <r(0(l - *(0).

Figure 1: The infinitesimal changes to y considered in e\(t)

(left) and z

{

(t) (right).

small change to yi at time t affects £ if everything else is

left unchanged.

Let us define

<9

+

£

Ut) = -rryr (6)

where the d+ denotes an ordered derivative [15], with vari-

ables ordered by time. Intuitively, z

t

(t) measures how much

a small change to y

t

at time t affects £ when this change

is propagated forward through time and influences the re-

mainder of the trajectory, as in figure 1. Of course, z; is

the limit of z, as At

—*

0.

We can use the chain rule for ordered derivatives to cal-

culate zi(t) in terms of the Zj(t+At). According to the chain

rule,

we add all the separate influences that varying yi(t)

has on £. It has a direct contribution of Atei(t)

t

which

comprises the first term of our equation for z

t

(r). Varying

yi(t) by e has an effect on y,(z+ At) of e(l - At/Ti), giving

us a second term, namely (1

—

At/Ti)z{t

+

At).

Each weight

vv,y

allows

y

t

(r)

to influence yj(t+At). Let us

compute this influence in stages; varying

yKO

by e varies

Xj(t)

by ewij, which varies c(xj(t)) by

€Wija'(Xj(t)),

which

varies $j(t

+

At) by ewija

f

(xj(t))At/Tj. This gives us our

third and final term,

^jWij<rXxj(t))AtZj{t

+

At)/Tj.

Combining these,

At

(7)

2,-(0 = Atei(t) + ^

1

-

—

J

~

Zl

(t +

At)

If we put this in the form of (3) and take the limit as

At 0 we obtain the differential equation

dzi 1 !

It

=

Ti

Zi

~

ei

~2^

T

WijCr

(Xj)Zj

-

(8)

For boundary conditions note that by (5) and (6) z

t

(fO =

Atei(t\), so in the limit as At

—•

0 we have zi(t\) = 0.

Consider making an infinitesimal change dwij to for

a period At starting at r. This will cause a corresponding

infinitesimal change in £ of

At

yMcrXxjitV—ZjiOdwij.

Unlwrsity Libraries

Carnegie Mellon

University

Pittsburgh,

Pennsylvania 152W

Figure 2: A lattice representation of (4).

Since we wish to know the effect of making this infinites-

imal change to w

x

y throughout time, we integrate over the

entire interval yielding

dE 1 f

h

sj-U

*****

<9>

If we substitute p, = 7f

1

into (4), find dE/dpi by pro-

ceeding analogously, and substitute 7, back in we get

dE 1 f

h

dy

L

m,-7,h«i*

<

I0)

One can also derive (8), (9) and (10) using the calculus

of variations and Lagrange multipliers (William Skaggs,

personal communication), or from the continuous form of

dynamic programming [5].

3

Simulation Results

Using first order finite difference approximations, we inte-

grated the system y forward from to to t\

9

set the boundary

conditions z;(fi) = 0, and integrated the system z back-

wards from t\ to

to

while numerically integrating z

;

cr

/

(x

/

)y

l

and

Zidyi/du

thus computing dE/dwij and dE/dTi. Since

computing

dzjdt

requires knowing <7'(jcj), we stored it and

replayed it backwards as well. We also stored and replayed

y, as it is used in expressions being numerically integrated.

We used the error functional

£-5£^\<*-4)

2

*

(11)

where d

t

(t) is the desired state of unit / at time t and si(t)

is the importance of unit i achieving that state at that time.

Throughout, we used

<r(£)

= (1 + e"*)"

1

. Time constants

were initialized to 1, weights were initialized to uniformly

distributed random values between 1 and

-1,

and the initial

values

y

t

(to)

were set to /,(*<)) +

cr(0).

The simulator used

the approximations (4) and (7) with At = 0.1.

All of these networks have an extra unit which has no

incoming connections, an external input of 0.5, and out-

going connections to all other units. This unit provides a

bias,

which is equivalent to the negative of a threshold.

This detail is suppressed below.

3.1 Exclusive Or

The network of figure 3 was trained to solve the xor prob-

lem. Aside from the addition of time constants, the net-

work topology was that used by Pineda in [11]. We defined

E - Y^

k

j fify^ " d^

k)

)

2

dt where k ranges over the four

cases,

d is the correct output, and y

0

is the state of the

output unit. The inputs to the net and range over

the four possible boolean combinations in the four differ-

ent cases. With suitable choice of step size and momentum

training time was comparable to standard backpropagation,

averaging about one hundred epochs.

For this task it is to the network's benefit for units to

attain their final values as quickly as possible, so there

was a tendency to lower the time constants towards 0. In

an effort to avoid small time constants, which degrade the

numerical accuracy of the simulation, we introduced a term

to decay the time constants towards 1. This decay factor

was not used in the other simulations described below, and

was not really necessary in this task if a suitably small At

was used in the simulation.

It is interesting that even for this binary task, the network

made use of dynamical behavior. After extensive training

the network behaved as expected, saturating the output unit

to the correct value. Earlier in training, however, we oc-

casionally (about one out of every ten training sessions)

observed the output unit at nearly the correct value be-

tween t = 2 and t = 3, but then saw it move in the wrong

direction at t = 3 and end up stabilizing at a wildly incor-

rect value. Another dynamic effect, which was present in

almost every run, is shown in figure 4. Here, the output

unit heads in the wrong direction initially and then corrects

itself before the error window. A very minor case of diving

towards the correct value and then moving away is seen in

the lower left hand corner of figure 4.

input

hidden output

Figure 3: The XOR network.

3

Learning state space trajectories in recurrent neural networks

Citations

Backpropagation through time: what it does and how to do it

A learning algorithm for continually running fully recurrent neural networks

Neural networks for control systems: a survey

A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures

Self-organized control of bipedal locomotion by neural oscillators in unpredictable environment

References

Optimization by Simulated Annealing

Learning internal representations by error propagation

Learning internal representations by error propagation

Neural computation of decisions in optimization problems

A learning algorithm for continually running fully recurrent neural networks

Related Papers (5)

A learning algorithm for continually running fully recurrent neural networks

Learning internal representations by error propagation

Approximation by superpositions of a sigmoidal function

Multilayer feedforward networks are universal approximators

Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations