scispace - formally typeset
Open AccessJournal ArticleDOI

Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks

K. P. Unnikrishnan, +1 more
- 01 May 1994 - 
- Vol. 6, Iss: 3, pp 469-490
TLDR
This paper presents a learning algorithm for neural networks, called Alopex, which uses local correlations between changes in individual weights and changes in the global error measure, and shows that learning times are comparable to those for standard gradient descent methods.
Abstract
We present a learning algorithm for neural networks, called Alopex. Instead of error gradient, Alopex uses local correlations between changes in individual weights and changes in the global error measure. The algorithm does not make any assumptions about transfer functions of individual neurons, and does not explicitly depend on the functional form of the error measure. Hence, it can be used in networks with arbitrary transfer functions and for minimizing a large class of error measures. The learning algorithm is the same for feedforward and recurrent networks. All the weights in a network are updated simultaneously, using only local computations. This allows complete parallelization of the algorithm. The algorithm is stochastic and it uses a “temperature” parameter in a manner similar to that in simulated annealing. A heuristic “annealing schedule” is presented that is effective in finding global minima of error surfaces. In this paper, we report extensive simulation studies illustrating these advantages and show that learning times are comparable to those for standard gradient descent methods. Feedforward networks trained with Alopex are used to solve the MONK's problems and symmetry problems. Recurrent networks trained with the same algorithm are used for solving temporal XOR problems. Scaling properties of the algorithm are demonstrated using encoder problems of different sizes and advantages of appropriate error measures are illustrated using a variety of problems.

read more

Content maybe subject to copyright    Report

Alopex: A Correlation-Based Learning Algorithm for
Feed-Forward and Recurrent Neural Networks
K. P. Unnikrishnan, and K. P. Venugopal
W
Abstract
e present a learning algorithm for neural networks, called Alopex. Instead of
a
error gradient, Alopex uses local correlations between changes in individual weights
nd changes in the global error measure. The algorithm does not make any assump-
t
tions about transfer functions of individual neurons, and does not explicitly depend on
he functional form of the error measure. Hence, it can be used in networks with arbi-
-
i
trary transfer functions and for minimizing a large class of error measures. The learn
ng algorithm is the same for feed-forward and recurrent networks. All the weights in a
-
p
network are updated simultaneously, using only local computations. This allows com
lete parallelization of the algorithm. The algorithm is stochastic and it uses a ‘tem-
perature’ parameter in a manner similar to that in simulated annealing. A heuristic
annealing schedule’ is presented which is effective in finding global minima of error
-
t
surfaces. In this paper, we report extensive simulation studies illustrating these advan
ages and show that learning times are comparable to those for standard gradient des-
M
cent methods. Feed-forward networks trained with Alopex are used to solve the
ONK’s problems and symmetry problems. Recurrent networks trained with the
e
a
same algorithm are used for solving temporal XOR problems. Scaling properties of th
lgorithm are demonstrated using encoder problems of different sizes and advantages
of appropriate error measures are illustrated using a variety of problems.

a
K. P. Unnikrishnan is with the Computer Science Department, GM Research Laboratories, Warren, MI 48090;
nd with the Artificial Intelligence Laboratory, University of Michigan, Ann Arbor, MI 48109.
.K. P. Venugopal is with the Medical Image Processing Group, University of Pennsylvania, Philadelphia, PA 19104

- 2 -
1. Introduction
Artificial neural networks are very useful because they can represent complex
g
a
classification functions and can discover these representations using powerful learnin
lgorithms. Multi-layer perceptrons using sigmoidal non-linearities at their computing
.
I
nodes can represent large classes of functions (Hornik, Stichcomb, and White, 1989)
n general, an optimum set of weights in these networks are learned by minimizing an
)
c
error functional. But many of these functions (that give error as a function of weights
ontain local minima, making the task of learning in these networks difficult (Hinton,
i
1989). This problem can be mitigated by (i) choosing appropriate transfer functions at
ndividual neurons and appropriate error functional for minimization and (ii) by using
powerful learning algorithms.
Learning algorithms for neural networks can be categorized into two classes. The
1
-
d
popular back-propagation (BP) and other related algorithms calculate explicit gra
ients of the error with respect to the weights. These require detailed knowledge of the
s
l
network architecture and involve calculating derivatives of transfer functions. Thi
imits the original version of BP (Rumelhart, Hinton, and Williams, 1986) to feed-
t
forward networks with neurons containing smooth, differentiable and non-saturating
ransfer functions. Some variations of this algorithm (Williams and Zipser, 1989, for
-
l
example) have been used in networks with feedback; but, these algorithms need non
ocal information, and are computationally expensive.
-
f
A general purpose learning algorithm, without these limitations, can be very use
ul for neural networks. Such an algorithm, ideally, should use only locally available
r
t
information; impose no restrictions on the network architecture, error measures o
ransfer functions of individual neurons; and should be able to to find global minima of
t
error surfaces. It should also allow simultaneous updating of weights and hence reduce
he overhead on hardware implementations.
Learning algorithms that do not require explicit gradient calculations may offer a
e
e
a better choice in this respect. These algorithms usually estimate the gradient of th
rror by local measurements. One method is to systematically change the parameters

Methods that are not explicitly based on gradient concepts have also been used for training
l
1
ayered networks (Minsky, 1954; Rosenblatt, 1962). These methods are limited in their perfor-
mance and applicability and hence are not considered in our discussions.

- 3 -
n
t
(weights) to be optimized and measure the effect of these changes (perturbations) o
he error to be minimized. Parameter perturbation methods have a long history in adap-
,
1
tive control, where they were commonly known as the "MIT rule" (Draper, and Li
951; Whitaker, 1959). Many others have recently used perturbations of single weights
t
a
(Jabri, and Flower, 1991), multiple weights (Dembo, and Kailath, 1990; Alspector e
l., 1993), or single neurons (Widrow, and Lehr, 1990).
a
(
A set of closely related techniques in machine learning are Learning Automat
Narendra, and Thathachar, 1989) and Reinforcement Learning (Barto, Sutton, and
sBrouwer, 1981). In this paper we present an algorithm called ‘Alopex’ that is in thi
2
r
s
general category. Alopex has had one of the longest history of such methods, eve
ince its introduction for mapping visual receptive fields (Harth, and Tzanakou, 1974).
U
It has subsequently been modified and used in models of visual perception (Harth, and
nnikrishnan, 1985; Harth, Unnikrishnan, and Pandya, 1987; Harth, Pandya, and
-
n
Unnikrishnan, 1990), visual development (Nine, and Unnikrishnan, 1993; Unnikrish
an, and Nine, 1993), for solving combinatorial optimization problems (Harth, Pandya,
,
1
and Unnikrishnan, 1986), for pattern classification (Venugopal, Pandya, and Sudhakar
991 & 1992b), and for control (Venugopal, Pandya, and Sudhakar, 1992b). In this
-
p
paper we present a very brief description of the algorithm and show results of com
uter simulations where it has been used for training feed-forward and recurrent net-
r
c
works. Detailed theoretical analysis of the algorithm and comparisons with othe
losely related algorithms such as reinforcement learning will appear elsewhere (Sas-
2
try, and Unnikrishnan, 1993).
. The Alopex Algorithm
Learning in a neural network is treated as an optimization problem. The objec-
3
a
g
tive is to minimize an error measure, E , with respect to network weights w, for
iven set of training samples. The algorithm can be described as follows: consider a
-neuron i with an interconnection strength w from neuron j . During the n itera
ij
th
ij
4
tion, the weight w is updated according to the rule,

Alopex is an acronym for Algorithm for pattern extraction, and refers to the alopecic per-
f
2
ormance of the algorithm.
Earlier versions of this have been presented at conferences (Unnikrishnan, and Pandit,
1
3
991; Unnikrishnan, and Venugopal, 1992).
.
4
For the first two iterations, weights are chosen randomly

- 4 -
w (n ) = w (n 1) + δ (n ) (1)
i
ij ij ij
j
where δ (n ) is a small positive or negative step of size δ with the following proba-
bilities:
5
ij
ij
ij
)
−δ with probability p (n )
(2
)
T
δ (n ) =

with probability 1p (n
he probability p (n ) for a negative step is given by the Boltzmann distribution:
ij
ij
T (n )
C (n )

ij
p (n ) =
1 + e
1

(3)
where C (n ) is given by the correlation:
ij
ij ij
C (n ) = w (n ) . E (n ) (4)
tand T (n ) is a positive ‘temperature’. w (n ) and E (n ) are the changes in weigh
ij
w
ij
and the error measure E over the previous two iterations (Eqs. 5a and 5b).
)w (n ) = w (n 1) w (n 2) (5a
ij ij ij
E (n ) = E (n 1) E (n 2) (5b)
The ‘temperature’ T in Eq. (3) is updated every N iterations using the following
annealing schedule’:
T (n ) =
N
.
M
1

C (n )
if n is a multiple of N (6a)
n 1
ij
N
Σ
i j n ′=n
Σ Σ
T (n ) = T (n 1) otherwise. (6b)
f
M in the above equation is the total number of connections. Since the magnitude o
w is the same for all weights, Eq. (6a) reduces to:
T (n ) =
N
δ

E (n )
(6c)
n 1
Nn ′=n
Σ
2.1 Behavior of the Algorithm
Equations (1) - (5) can be rewritten to make the essential computations clearer.

In simulations, this is done by generating a uniform random number between 0 and 1 and
5
ij
)comparing it with p (n

- 5 -
w (n ) = w (n 1) + δ
.
x (n 1) (7)
i
ij ij ij
j
δ is the step size and x is either +1 or -1 (randomly assigned for the first two itera-
tions).
x (n 1) =

x (n 2) with probability 1p (n )
x (n 2) with probability p (n )
(8)
ij ij
j
where
ij
ij i
p (n ) =
1 + e
1

(9)

)E (n
)
F
ij
δ
.
T (n
rom Eqs. (7) - (9) we can see that if E is negative, the probability of moving each
f
m
weight in the same direction is greater than 0.5. If E is positive, the probability o
oving each weight in the opposite direction is greater than 0.5. In other words, the
algorithm favors weight changes that will decrease the error E .
The temperature T in Eq. (3) determines the stochasticity of the algorithm. With
e
t
a non-zero value for T , the algorithm takes biased random walks in the weight spac
owards decreasing E . If T is too large, the probabilities are close to 0.5 and the algo-
i
rithm does not settle into the global minimum of E . If T is too small, it gets trapped
n local minima of E . Hence the value of T for each iteration is chosen very carefully.
s
We have successfully used the heuristic ‘annealing schedule’ shown in Eq. (6). We
tart the simulations with a large T , and at regular intervals, set it equal to the average
absolute value of the correlation C over that interval. This method automatically
ij
reduces T when the correlations are small (which is likely to be near minima of error
a
surfaces) and increases T in regions of large correlations. The correlations need to be
veraged over sufficiently large number of iterations so that the annealing does not
r
freeze the algorithm at local minima. Towards the end, the step size δ can also be
educed for precise convergence.
The use of a controllable ‘temperature’ and the use of probabilistic parameter
-
c
updates are similar to the method of simulated annealing (Kirkpatrick, Gelatt, and Vec
hi, 1983). But Alopex differs from simulated annealing in three important aspects: (i)
;
(
the correlation (E
.
w ) is used instead of the change in error E for weight updates
ii) all weight changes are accepted at every iteration; and (iii) during an iteration, all
weights are updated simultaneously.

Citations
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Book

The new cognitive neurosciences

TL;DR: The relationship between the structural and physiological mechanisms of the brain/nervous system has been studied in this paper, from the molecular level up to that of human consciousness, and contributions cover one of the most fascinating areas of science.

New cognitive neurosciences

TL;DR: From the molecular level up to that of human consciousness, the contributions cover one of the most fascinating areas of science - the relationship between the structural and physiological mechanisms of the brain/nervous system.
Journal ArticleDOI

Competition for consciousness among visual events : The psychophysics of reentrant visual processes

TL;DR: In this paper, two masking processes were found: an early process affected by physical factors such as adapting luminance, and a later process influenced by attentional factors, called masking by object substitution, which occurs whenever there is a mismatch between the reentrant visual representation and the ongoing lower level activity.
Journal ArticleDOI

Learning in spiking neural networks by reinforcement of stochastic synaptic transmission.

TL;DR: The hypothesis that the randomness of synaptic transmission is harnessed by the brain for learning, in analogy to the way that genetic mutation is utilized by Darwinian evolution is considered.
References
More filters
Journal ArticleDOI

Optimization by Simulated Annealing

TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Journal ArticleDOI

Multilayer feedforward networks are universal approximators

TL;DR: It is rigorously established that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available.
Book ChapterDOI

Learning internal representations by error propagation

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Book

Learning internal representations by error propagation

TL;DR: In this paper, the problem of the generalized delta rule is discussed and the Generalized Delta Rule is applied to the simulation results of simulation results in terms of the generalized delta rule.
Related Papers (5)