Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks

doi:10.1162/NECO.1994.6.3.469

Alopex: A Correlation-Based Learning Algorithm for

Feed-Forward and Recurrent Neural Networks

K. P. Unnikrishnan, and K. P. Venugopal

W

Abstract

e present a learning algorithm for neural networks, called Alopex. Instead of

a

error gradient, Alopex uses local correlations between changes in individual weights

nd changes in the global error measure. The algorithm does not make any assump-

t

tions about transfer functions of individual neurons, and does not explicitly depend on

he functional form of the error measure. Hence, it can be used in networks with arbi-

-

i

trary transfer functions and for minimizing a large class of error measures. The learn

ng algorithm is the same for feed-forward and recurrent networks. All the weights in a

-

p

network are updated simultaneously, using only local computations. This allows com

lete parallelization of the algorithm. The algorithm is stochastic and it uses a ‘tem-

‘

perature’ parameter in a manner similar to that in simulated annealing. A heuristic

annealing schedule’ is presented which is effective in ﬁnding global minima of error

-

t

surfaces. In this paper, we report extensive simulation studies illustrating these advan

ages and show that learning times are comparable to those for standard gradient des-

M

cent methods. Feed-forward networks trained with Alopex are used to solve the

ONK’s problems and symmetry problems. Recurrent networks trained with the

e

a

same algorithm are used for solving temporal XOR problems. Scaling properties of th

lgorithm are demonstrated using encoder problems of different sizes and advantages



of appropriate error measures are illustrated using a variety of problems.



a

K. P. Unnikrishnan is with the Computer Science Department, GM Research Laboratories, Warren, MI 48090;

nd with the Artiﬁcial Intelligence Laboratory, University of Michigan, Ann Arbor, MI 48109.

.K. P. Venugopal is with the Medical Image Processing Group, University of Pennsylvania, Philadelphia, PA 19104

- 2 -

1. Introduction

Artiﬁcial neural networks are very useful because they can represent complex

g

a

classiﬁcation functions and can discover these representations using powerful learnin

lgorithms. Multi-layer perceptrons using sigmoidal non-linearities at their computing

.

I

nodes can represent large classes of functions (Hornik, Stichcomb, and White, 1989)

n general, an optimum set of weights in these networks are learned by minimizing an

)

c

error functional. But many of these functions (that give error as a function of weights

ontain local minima, making the task of learning in these networks difﬁcult (Hinton,

i

1989). This problem can be mitigated by (i) choosing appropriate transfer functions at

ndividual neurons and appropriate error functional for minimization and (ii) by using

powerful learning algorithms.

Learning algorithms for neural networks can be categorized into two classes. The

1

-

d

popular back-propagation (BP) and other related algorithms calculate explicit gra

ients of the error with respect to the weights. These require detailed knowledge of the

s

l

network architecture and involve calculating derivatives of transfer functions. Thi

imits the original version of BP (Rumelhart, Hinton, and Williams, 1986) to feed-

t

forward networks with neurons containing smooth, differentiable and non-saturating

ransfer functions. Some variations of this algorithm (Williams and Zipser, 1989, for

-

l

example) have been used in networks with feedback; but, these algorithms need non

ocal information, and are computationally expensive.

-

f

A general purpose learning algorithm, without these limitations, can be very use

ul for neural networks. Such an algorithm, ideally, should use only locally available

r

t

information; impose no restrictions on the network architecture, error measures o

ransfer functions of individual neurons; and should be able to to ﬁnd global minima of

t

error surfaces. It should also allow simultaneous updating of weights and hence reduce

he overhead on hardware implementations.

Learning algorithms that do not require explicit gradient calculations may offer a

e

a better choice in this respect. These algorithms usually estimate the gradient of th

rror by local measurements. One method is to systematically change the parameters

 

Methods that are not explicitly based on gradient concepts have also been used for training

l

1

ayered networks (Minsky, 1954; Rosenblatt, 1962). These methods are limited in their perfor-

mance and applicability and hence are not considered in our discussions.

- 3 -

n

t

(weights) to be optimized and measure the effect of these changes (perturbations) o

he error to be minimized. Parameter perturbation methods have a long history in adap-

,

1

tive control, where they were commonly known as the "MIT rule" (Draper, and Li

951; Whitaker, 1959). Many others have recently used perturbations of single weights

t

a

(Jabri, and Flower, 1991), multiple weights (Dembo, and Kailath, 1990; Alspector e

l., 1993), or single neurons (Widrow, and Lehr, 1990).

a

(

A set of closely related techniques in machine learning are Learning Automat

Narendra, and Thathachar, 1989) and Reinforcement Learning (Barto, Sutton, and

sBrouwer, 1981). In this paper we present an algorithm called ‘Alopex’ that is in thi

2

r

s

general category. Alopex has had one of the longest history of such methods, eve

ince its introduction for mapping visual receptive ﬁelds (Harth, and Tzanakou, 1974).

U

It has subsequently been modiﬁed and used in models of visual perception (Harth, and

nnikrishnan, 1985; Harth, Unnikrishnan, and Pandya, 1987; Harth, Pandya, and

-

n

Unnikrishnan, 1990), visual development (Nine, and Unnikrishnan, 1993; Unnikrish

an, and Nine, 1993), for solving combinatorial optimization problems (Harth, Pandya,

,

1

and Unnikrishnan, 1986), for pattern classiﬁcation (Venugopal, Pandya, and Sudhakar

991 & 1992b), and for control (Venugopal, Pandya, and Sudhakar, 1992b). In this

-

p

paper we present a very brief description of the algorithm and show results of com

uter simulations where it has been used for training feed-forward and recurrent net-

r

c

works. Detailed theoretical analysis of the algorithm and comparisons with othe

losely related algorithms such as reinforcement learning will appear elsewhere (Sas-

2

try, and Unnikrishnan, 1993).

. The Alopex Algorithm

Learning in a neural network is treated as an optimization problem. The objec-

3

a

g

tive is to minimize an error measure, E , with respect to network weights w, for

iven set of training samples. The algorithm can be described as follows: consider a

-neuron i with an interconnection strength w from neuron j . During the n itera

ij

th

ij

4



tion, the weight w is updated according to the rule,



Alopex is an acronym for Algorithm for pattern extraction, and refers to the alopecic per-

f

2

ormance of the algorithm.

Earlier versions of this have been presented at conferences (Unnikrishnan, and Pandit,

1

3

991; Unnikrishnan, and Venugopal, 1992).

.

4

For the ﬁrst two iterations, weights are chosen randomly

- 4 -

w (n ) = w (n −1) + δ (n ) (1)

i

ij ij ij

j

where δ (n ) is a small positive or negative step of size δ with the following proba-

bilities:

5

ij

)

−δ with probability p (n )

(2

)

T

δ (n ) =





+δ with probability 1−p (n

he probability p (n ) for a negative step is given by the Boltzmann distribution:

ij

−

T (n )

C (n )



ij

p (n ) =

1 + e

1



(3)

where C (n ) is given by the correlation:

ij

ij ij

C (n ) = ∆w (n ) . ∆E (n ) (4)

tand T (n ) is a positive ‘temperature’. ∆w (n ) and ∆E (n ) are the changes in weigh

ij

w

ij

and the error measure E over the previous two iterations (Eqs. 5a and 5b).

)∆w (n ) = w (n −1) − w (n −2) (5a

ij ij ij

∆E (n ) = E (n −1) − E (n −2) (5b)

‘

The ‘temperature’ T in Eq. (3) is updated every N iterations using the following

annealing schedule’:

T (n ) =

N

.

M



1

 

C (n ′)



if n is a multiple of N (6a)

n −1

ij

N

Σ

i j n ′=n −

Σ Σ

T (n ) = T (n −1) otherwise. (6b)

f

∆

M in the above equation is the total number of connections. Since the magnitude o

w is the same for all weights, Eq. (6a) reduces to:

T (n ) =

N

δ

 

∆E (n ′)



(6c)

n −1

Nn ′=n −

Σ

2.1 Behavior of the Algorithm

Equations (1) - (5) can be rewritten to make the essential computations clearer.

 

In simulations, this is done by generating a uniform random number between 0 and 1 and

5

ij

)comparing it with p (n

- 5 -

w (n ) = w (n −1) + δ

.

x (n −1) (7)

i

ij ij ij

j

δ is the step size and x is either +1 or -1 (randomly assigned for the ﬁrst two itera-

tions).

x (n −1) =





−x (n −2) with probability 1−p (n )

x (n −2) with probability p (n )

(8)

ij ij

j

where

ij

ij i

p (n ) =

1 + e

1

 

(9)

 

)∆E (n

)

F

ij

δ

.

T (n

rom Eqs. (7) - (9) we can see that if ∆E is negative, the probability of moving each

f

m

weight in the same direction is greater than 0.5. If ∆E is positive, the probability o

oving each weight in the opposite direction is greater than 0.5. In other words, the

algorithm favors weight changes that will decrease the error E .

The temperature T in Eq. (3) determines the stochasticity of the algorithm. With

e

t

a non-zero value for T , the algorithm takes biased random walks in the weight spac

owards decreasing E . If T is too large, the probabilities are close to 0.5 and the algo-

i

rithm does not settle into the global minimum of E . If T is too small, it gets trapped

n local minima of E . Hence the value of T for each iteration is chosen very carefully.

s

We have successfully used the heuristic ‘annealing schedule’ shown in Eq. (6). We

tart the simulations with a large T , and at regular intervals, set it equal to the average

absolute value of the correlation C over that interval. This method automatically

ij

reduces T when the correlations are small (which is likely to be near minima of error

a

surfaces) and increases T in regions of large correlations. The correlations need to be

veraged over sufﬁciently large number of iterations so that the annealing does not

r

freeze the algorithm at local minima. Towards the end, the step size δ can also be

educed for precise convergence.

The use of a controllable ‘temperature’ and the use of probabilistic parameter

-

c

updates are similar to the method of simulated annealing (Kirkpatrick, Gelatt, and Vec

hi, 1983). But Alopex differs from simulated annealing in three important aspects: (i)

;

(

the correlation (∆E

.

∆w ) is used instead of the change in error ∆E for weight updates

ii) all weight changes are accepted at every iteration; and (iii) during an iteration, all

weights are updated simultaneously.

Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks

Citations

Reinforcement Learning: An Introduction

The new cognitive neurosciences

New cognitive neurosciences

Competition for consciousness among visual events : The psychophysics of reentrant visual processes

Learning in spiking neural networks by reinforcement of stochastic synaptic transmission.

References

Optimization by Simulated Annealing

Multilayer feedforward networks are universal approximators

Learning internal representations by error propagation

Learning internal representations by error propagation

Multilayer feedforward networks are universal approximators

Related Papers (5)

Optimization by Simulated Annealing

A learning algorithm for continually running fully recurrent neural networks

Learning internal representations by error propagation

Learning representations by back-propagating errors

Neural Networks: A Comprehensive Foundation