Carnegie Mellon University
Research Showcase @ CMU
Computer Science Department School of Computer Science
1988
Learning state space trajectories in recurrent neural
networks
Barak Pearlmuer
Carnegie Mellon University
Follow this and additional works at: hp://repository.cmu.edu/compsci
is Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
contact research-showcase@andrew.cmu.edu.
Published In
.
NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS:
The copyright law of the United States (title 17, U.S. Code) governs the making
of photocopies or other reproductions of copyrighted material. Any copying of this
document without permission of its author may be prohibited by law.
Learning State Space Trajectories
in Recurrent Neural Networks
Barak A. Pearlmutter
December 31, 1988
CMU-CS-88-191-
Abstract
We describe a number of procedures for finding dE/dwij where E is an error functional of the
temporal trajectory of the states of a continuous recurrent network and w,y are the weights of that
network. Computing these quantities allows one to perform gradient descent in the weights to
minimize £, so these procedures form the kernels of connectionist learning algorithms. Simulations
in which networks are taught to move through limit cycles are shown. We also describe a number of
elaborations of the basic idea, such as mutable time delays and teacher forcing, and conclude with
a complexity analysis. This type of network seems particularly suited for temporally continuous
domains, such as signal processing, control, and speech.
This research was sponsored in part by National Science Foundation grant EET-8716324 and by the Office of Naval
Research under contract number N00014-86-K-0678. Barak Pearlmutter is a Fannie and John Hertz Foundation fellow.
The views and conclusions contained in this document are those of the author and should not be interpreted as representing
the official policies, either expressed or implied, of NSF, ONR, the Fannie and John Hertz Foundation or the U.S.
Government.
1
Introduction
Note: this is an expanded version of an earlier paper of the
same title [9].
Pineda [11] has shown how to train the fixpoints of a re-
current temporally continuous generalization of backprop-
agation networks [8,12,14]. Such networks are governed
by the coupled differential equations
1
dt
= -yi +
<r(Xi)
+ Ii
where
v
0j
(1)
(2)
is the total input to unit z, y, is the state of unit /, T
t
is
the time constant of unit z, a is an arbitrary differentiate
function
1
, w
f
y are the weights, and the initial conditions
yi(t
0
) and driving functions I
t
(t) are the inputs to the system.
Consider minimizing £(y), some functional of the tra-
jectory taken by y between to and t\. For instance,
&
=
f£(yo(0
~f(0)
2
dt
measures the deviation of
yo
from
the function/, and minimizing this £ would teach the net-
work to have
yo
imitate/. Below, we develop a technique
for computing dE(y)/dwij and
dE(y)/dT
iy
thus allowing us
to do gradient descent in the weights and time constants so
as to minimize £. The computation of dE/dwij seems to
require a phase in which the network is run backwards in
time,
and tricks for avoiding this are also described.
2
A Forward/Backward Technique
We can approximate the derivative in (1) with
dt
KJ
~
At
(3)
which yields a first order difference approximation to (1),
We use tildes to indicate temporally discretized versions of
continuous functions. The notation y,(0 is being used as
shorthand for the particular variable representing the de-
screte version of y
t
(fo + nAt), where n is an integer and
t
= to +
nAt.
Let us define
In the usual case £ is of the form f£f(y(t), Odt
so e
M =
df(y(t)
t
t) /dyi(t). Intuitively, measures how much a
typically <r(0 = (1 + e~*y-
1
, in which case *'(0 = <r(0(l - *(0).
Figure 1: The infinitesimal changes to y considered in e\(t)
(left) and z
{
(t) (right).
small change to yi at time t affects £ if everything else is
left unchanged.
Let us define
<9
+
£
Ut) = -rryr (6)
where the d+ denotes an ordered derivative [15], with vari-
ables ordered by time. Intuitively, z
t
(t) measures how much
a small change to y
t
at time t affects £ when this change
is propagated forward through time and influences the re-
mainder of the trajectory, as in figure 1. Of course, z; is
the limit of z, as At
—*
0.
We can use the chain rule for ordered derivatives to cal-
culate zi(t) in terms of the Zj(t+At). According to the chain
rule,
we add all the separate influences that varying yi(t)
has on £. It has a direct contribution of Atei(t)
t
which
comprises the first term of our equation for z
t
(r). Varying
yi(t) by e has an effect on y,(z+ At) of e(l - At/Ti), giving
us a second term, namely (1
—
At/Ti)z{t
+
At).
Each weight
vv,y
allows
y
t
(r)
to influence yj(t+At). Let us
compute this influence in stages; varying
yKO
by e varies
Xj(t)
by ewij, which varies c(xj(t)) by
€Wija'(Xj(t)),
which
varies $j(t
+
At) by ewija
f
(xj(t))At/Tj. This gives us our
third and final term,
^jWij<rXxj(t))AtZj{t
+
At)/Tj.
Combining these,
At
(7)
2,-(0 = Atei(t) + ^
1
-
—
J
~
Zl
(t +
At)
If we put this in the form of (3) and take the limit as
At 0 we obtain the differential equation
dzi 1 !
It
=
Ti
Zi
~
ei
~2^
T
WijCr
(Xj)Zj
-
(8)
For boundary conditions note that by (5) and (6) z
t
(fO =
Atei(t\), so in the limit as At
—•
0 we have zi(t\) = 0.
Consider making an infinitesimal change dwij to for
a period At starting at r. This will cause a corresponding
infinitesimal change in £ of
At
yMcrXxjitV—ZjiOdwij.
Unlwrsity Libraries
Carnegie Mellon
University
Pittsburgh,
Pennsylvania 152W
Figure 2: A lattice representation of (4).
Since we wish to know the effect of making this infinites-
imal change to w
x
y throughout time, we integrate over the
entire interval yielding
dE 1 f
h
sj-U
*****
<9>
If we substitute p, = 7f
1
into (4), find dE/dpi by pro-
ceeding analogously, and substitute 7, back in we get
dE 1 f
h
dy
L
m,-7,h«i*
<
I0)
One can also derive (8), (9) and (10) using the calculus
of variations and Lagrange multipliers (William Skaggs,
personal communication), or from the continuous form of
dynamic programming [5].
3
Simulation Results
Using first order finite difference approximations, we inte-
grated the system y forward from to to t\
9
set the boundary
conditions z;(fi) = 0, and integrated the system z back-
wards from t\ to
to
while numerically integrating z
;
cr
/
(x
/
)y
l
and
Zidyi/du
thus computing dE/dwij and dE/dTi. Since
computing
dzjdt
requires knowing <7'(jcj), we stored it and
replayed it backwards as well. We also stored and replayed
y, as it is used in expressions being numerically integrated.
We used the error functional
£-5£^\<*-4)
2
*
(11)
where d
t
(t) is the desired state of unit / at time t and si(t)
is the importance of unit i achieving that state at that time.
Throughout, we used
<r(£)
= (1 + e"*)"
1
. Time constants
were initialized to 1, weights were initialized to uniformly
distributed random values between 1 and
-1,
and the initial
values
y
t
(to)
were set to /,(*<)) +
cr(0).
The simulator used
the approximations (4) and (7) with At = 0.1.
All of these networks have an extra unit which has no
incoming connections, an external input of 0.5, and out-
going connections to all other units. This unit provides a
bias,
which is equivalent to the negative of a threshold.
This detail is suppressed below.
3.1 Exclusive Or
The network of figure 3 was trained to solve the xor prob-
lem. Aside from the addition of time constants, the net-
work topology was that used by Pineda in [11]. We defined
E - Y^
k
j fify^ " d^
k)
)
2
dt where k ranges over the four
cases,
d is the correct output, and y
0
is the state of the
output unit. The inputs to the net and range over
the four possible boolean combinations in the four differ-
ent cases. With suitable choice of step size and momentum
training time was comparable to standard backpropagation,
averaging about one hundred epochs.
For this task it is to the network's benefit for units to
attain their final values as quickly as possible, so there
was a tendency to lower the time constants towards 0. In
an effort to avoid small time constants, which degrade the
numerical accuracy of the simulation, we introduced a term
to decay the time constants towards 1. This decay factor
was not used in the other simulations described below, and
was not really necessary in this task if a suitably small At
was used in the simulation.
It is interesting that even for this binary task, the network
made use of dynamical behavior. After extensive training
the network behaved as expected, saturating the output unit
to the correct value. Earlier in training, however, we oc-
casionally (about one out of every ten training sessions)
observed the output unit at nearly the correct value be-
tween t = 2 and t = 3, but then saw it move in the wrong
direction at t = 3 and end up stabilizing at a wildly incor-
rect value. Another dynamic effect, which was present in
almost every run, is shown in figure 4. Here, the output
unit heads in the wrong direction initially and then corrects
itself before the error window. A very minor case of diving
towards the correct value and then moving away is seen in
the lower left hand corner of figure 4.
input
hidden output
Figure 3: The XOR network.
3