Brill Academic Publishers

P.O. Box 9000, 2300 PA Leiden,

The Netherlands

Lecture Series on Computer

and Computational Sciences

Volume 1, 2005, pp. 1-3

Parameter Optimization Algorithm with Improved

Convergence Properties for Adaptive Learning

G.D. Magoulas

†1

, M.N. Vrahatis

‡2

†

School of Computer Science and Information Systems,

Birkbeck University of London, London WC1E 7HX, UK

‡

Computational Intelligence Laboratory, Department of Mathematics,

University of Patras, GR-26110 Patras, Greece

Abstract: The error in an artiﬁcial neural network is a function of adaptive parameters

(weights and biases) that needs to be minimized. Research on adaptive learning usually

focuses on gradient algorithms that employ problem–dependent heuristic learning param-

eters. This fact usually results in a trade–oﬀ between the convergence speed and the

stability of the learning algorithm. The paper investigates gradient–based adaptive algo-

rithms and discusses their limitations. It then describes a new algorithm that does not need

user–deﬁned learning parameters. The convergence properties of this method are discussed

from both theoretical and practical p erspective. The algorithm has been implemented and

tested on real life applications exhibiting improved stability and high performance.

Keywords: Feedforward neural networks, Supervised training, Back-propagation algorithm,

Heuristic learning parameters, Non-linear Jacobi process, Globally convergent algorithms,

Local convergence analysis.

Mathematics Subject Classiﬁcation: 65K10, 49D10, 68T05, 68G05

1 Introduction

Let us ﬁrst deﬁne the notation we will use in the paper. We use a uniﬁed notation for the

weights. Thus, for a feedforward neural network (FNN) with a total of n weights, IR

n

is the

n–dimensional real space of column weight vectors w with components w

1

, w

2

, . . . , w

n

and w

∗

is

the optimal weight vector with c omponents w

∗

1

, w

∗

2

, . . . , w

∗

n

; E is the batch error measure deﬁned

as the sum–of–squared–diﬀerences error function over the entire training set; ∂

i

E(w) denotes the

partial derivative of E(w) with respect to the ith variable w

i

; g(w) =

g

1

(w), . . . , g

n

(w)

deﬁnes

the gradient ∇E(w) of the sum-of-squared-diﬀerences error function E at w, which is computed

by applying the chain rule on the layers of an FNN (see [48]), while H = [H

ij

] deﬁnes the Hessian

∇

2

E(w) of E at w. Also, throughout this paper diag{e

1

, . . . , e

n

} deﬁnes the n × n diagonal matrix

with elements e

1

, . . . , e

n

, Θ

n

= (0, 0, . . . , 0) denotes the origin of IR

n

and ρ(A) is the spectral radius

of matrix A.

The Back-Propagation (BP) algorithm [48] is widely recognized as a powerful tool for training

FNNs. It minimizes the error function using the Steep es t Descent (SD) method [15] with constant

learning rate η:

w

k+1

= w

k

− ηg(w

k

). (1)

1

E-mail: gmagoulas@ dcs .bbk.ac .uk

2

E-mail: vrahatis@mat h.up atras .gr

2 G.D. Magoulas and M.N. Vrahatis

The SD method requires the assumption that E is twice continuously diﬀerentiable on an

open neighborhood S(w

0

), where S(w

0

) = {w : E(w) ≤ E(w

0

)} is bounded, for some initial

weight vector w

0

. It also requires that η is chosen to satisfy the relation sup kH(w)k ≤ η

−1

< ∞

in the level set S(w

0

) [16, 17]. The approach adopted in practice is to apply a small constant

learning rate value (0 < η < 1) in order to secure the convergence of the BP training algorithm

and avoid oscillations in a direction where the error function is steep. However, this approach

considerably slows down training since, in general, a small learning rate may not be appropriate for

all the portions of the error surface. Furthermore, it aﬀects the convergence properties of training

algorithms (see [25, 29]). Nevertheless, there are theoretical results that guarantee convergence

when the learning rate is constant. This happens when the learning rate is proportional to the

inverse of the Lipschitz constant which, in practice, is not easily available [2, 31].

Attempts to adaptive learning are usually based on the following approaches: (i) start with

a small learning rate and increase it exponentially, if successive iterations reduce the error, or

rapidly decrease it, if a signiﬁcant error increase occurs [4, 56], (ii) start with a small learning rate

and increase it, if successive iterations keep gradient direction fairly constant, or rapidly dec rease

it, if the direction of the gradient varies greatly at e ach iteration [8] and (iii) for each weight an

individual learning rate is given, which increases if the successive changes in the weights are in

the same direction and decreases otherwise. The well known delta-bar-delta metho d [18] and Silva

and Almeida’s method [49] follow this approach. Another method, named quickprop, has been

presented in [11]. Quickprop is based on independent secant steps in the direction of each weight.

Riedmiller and Braun in 1993 proposed the Rprop algorithm. The algorithm updates the weights

using the learning rate and the sign of the partial derivative of the error function with respect to

each weight. Note that all these adaptation methods employ heuristic learning parameters in an

attempt to secure converge of the BP algorithm to a minimizer of E and avoid oscillations.

A diﬀerent approach is to exploit the local shape of the error surface as described by the direction

cosines or the Lipschitz constant. In the ﬁrst case the learning rate is a weighted average of the

direction cosines of the weight changes at the current and several previous successive iterations [23],

while in the second case the learning rate is an approximation of the Lipschitz constant [31].

A variety of approaches adapted from numerical analysis have also been applied, in an attempt

to use not only the gradient of the error function but also the second derivative in construct-

ing eﬃcient supervised training algorithms to accelerate the learning process. However, train-

ing algorithms that apply nonlinear conjugate gradient me thods, such as the Fletcher–Reeves or

the Polak–Ribiere methods [34, 53], or variable metric methods, such as the Broyden–Fletcher–

Goldfarb–Shanno method [5, 58], or even Newton’s method [6, 39], are com putationally intensive

for FNNs with several hundred weights: derivative calculations as well as subminimization pro-

cedures (for the case of nonlinear conjugate gradient metho ds) and approximations of various

matrices (for the case of variable metric and quasi-Newton methods) are required. Furthermore,

it is not certain that the extra computational cost speeds up the minimization process for noncon-

vex functions when far from a minimizer, as is usually the case with the neural network training

problem [5, 9, 35]. Thus, the development of improved gradient-based BP algorithms receives

signiﬁcant attention of neural network researchers and practitioners.

The training algorithm introduced in this paper does not use a user–deﬁned initial learning rate,

instead it self–determinates the search direction and the learning rates at each epoch. It provides

stable learning and robustness to oscillations. The paper is organized as follows. In Section 2

the class of adaptive learning algorithms that employ a diﬀerent learning rate for each weight is

presented and the advantages as well as the disadvantages of these algorithms are disc ussed. The

new algorithm is introduced in Section 3 and its convergence properties are investigated in Section

4. Experimental results are presented in Section 5 to evaluate and compare the performance of

the new algorithms with several other B P methods. The paper ends, in Section 6, with concluding

remarks.

Instructions to the Authors of LSCCSE 3

2 Adaptive Learning and the Error Surface

The eigensystem of the Hessian matrix can be used to determine the shape of the e rror function E

in the neighborhood of a local minimizer [1, 18]. Thus, studying the sensitivity of the minimizer

to small changes by approximating the error function with a quadratic one, it is known that, in

a suﬃciently small neighborhood of w

∗

, the directions of the principal axes of the corresponding

elliptical contours (n–dimensional ellipsoids) will be given by the eigenvectors of H(w

∗

), while the

lengths of the axes will be inversely proportional to the square roots of the corresponding eigenval-

ues. Furthermore, a variation along the eigenvector corresponding to the maximum eigenvalue will

cause the largest change in E, while the eigenvector corresponding to the minimum eigenvalue gives

the least sensitive direction. Therefore, a value for the learning rate which yields a large variation

along the eigenvector corresponding to the maximum eigenvalue may result in oscillations. On

the other hand, a value for the le arning rate which yields a small variation along the eigenvector

corresponding to the minimum eigenvalue may result in small steps along this direction and thus,

in a slight reduction of the error function. In general, a learning rate appropriate for any one

weight direction is not necessarily appropriate for other directions.

Various adaptive learning algorithms with a diﬀerent learning rate for each weight have been

suggested in the literature [11, 18, 33, 41, 46, 49]. This approach allows us to ﬁnd the proper

learning rate that compensates for the small magnitude of the gradient in a ﬂat direction in order

to avoid slow convergence, and dampens a large weight change in a steep direction in order to avoid

oscillations. Moreover, it exploits the parallelism inherent in the evaluation of E(w) and g(w) by

the BP algorithm.

Following this approach Eq. (1) is reformulated to the following scheme:

w

k+1

= w

k

− diag{η

k

1

, . . . , η

k

n

} g(w

k

). (2)

The weight vector in Eq. (2) is not updated in the direction of the negative of the gradient;

instead, an alternative adaptive search direction is obtained by taking into consideration the weight

change, evaluated by multiplying the length of the search step, i.e. the value of the learning rate,

along each weight direction by the partial derivative of E(w) with respect to the corresponding

weight, i.e. −η

i

∂

i

E(w). In other words, these algorithms try to decreas e the error in each direction,

by searching the local m inimum with small weight steps. These steps are usually constraint by

problem–dep e ndent heuristic parameters, in order to ensure subminimization of the error function

in each weight direction.

A well known diﬃculty of this approach is that the use of inappropriate heuristic values for a

weight direction misguides the resultant search direction. In such case s, the training algorithms

with an adaptive learning rate for each weight cannot exploit the global information obtained by

taking into consideration all the directions. This is the case of many well known training algorithms

that e mploy heuristic parameters for properly tuning the adaptive learning rates [11, 18, 33, 41,

46, 49] and no guarantee is provided that the weight updates will converge to a minimizer of E.

In certain cases the aforementioned methods, although originally developed for batch training, can

be used for on-line training by minimizing a pattern–based error measure.

3 A Theoretical Derivation of the Adaptive Learning Process

An adaptive learning rate algorithm (see Eq. (2)) seeks for a minimum w

∗

of the error function

and generates with every training epoch a discretized path in the nth dimensional weight space.

The limiting value of this path, lim

k→∞

w

k

, corresponds to a stationary point of E(w). This path

depends on the values of the learning rates chosen in each epoch. Appropriate learning rates help

to avoid convergence to a saddle p oint or a maximum. In the framework of Eq. (2) the learning

process can theoretically be interpreted as follows.

4 G.D. Magoulas and M.N. Vrahatis

Starting from an arbitrary initial weight vector w

0

∈ D (a speciﬁc domain of E), the training

algorithm subminimizes, at the kth epoch, in parallel, the n one–dimensional function:

E(w

k

1

, . . . , w

k

i−1

, w

i

, w

k

i+1

, . . . , w

k

n

). (3)

First, each function is minimized along the direction i and the corresponding subminimizer ˆw

i

is obtained. Obviously for this ˆw

i

∂

i

E(w

k

1

, . . . , w

k

i−1

, ˆw

i

, w

k

i+1

, . . . , w

k

n

) = 0. (4)

This is a one–dimensional subminimization because all other components of the weight vector,

except the ith, are kept constant. Then each weight is updated according to the relation:

w

k+1

i

= ˆw

i

. (5)

In order to be consistent with Eq. (2) only a single iteration of the one–dimensional method

in each weight direction is proposed. It is worth noticing that the number of the iterations of

the subminimization method is related to the requested accuracy in obtaining the subminimizer

approximations. Thus, signiﬁcant computational eﬀort is needed in order to ﬁnd very accurate

approximations of the subminimizer in each weight direction at each ep och. Moreover, the compu-

tational eﬀort for the subminimization method is increased for FNNs with several hundred weights.

On the other hand, it is not certain that this large computational eﬀort speeds up the minimization

process for nonconvex functions when the algorithm is away from a minimizer w

∗

[37]. Thus, we

prop os e to obtain ˆw

i

by minimizing Eq.(3) with one iteration of a minimization me thod.

The problem of minimizing the error function E along the ith direction

min

η

i

≥0

E(w + η

i

· e

i

) (6)

is equivalent to the minimization of the one–dimensional function

φ

i

(η) = E(w + η

i

e

i

). (7)

Since we have n directions we consider n one–dimensional functions φ

i

(η). Note that according

to experimental work [61] these functions can be approximated for certain learning tasks with

quadratic functions in the neighborhood of η ≥ 0. In general, we can also use the formulation

φ(η) = E(w + ηd), where d is the search direction vector and φ

0

(η) = ∇E(w + ηd)

>

d.

In our case, we want at the kth epoch to ﬁnd the learning rate η

i

that minimizes φ

i

(η) along

the ith direction. Since E(w) ≥ 0, ∀w ∈ IR

n

then the w

∗

, such that E(w

∗

), minimizes E(w). Thus,

by applying one iteration of the Newton method to the one–dimensional equation φ

i

(η) = 0 we

obtain:

η

1

i

= η

0

i

−

φ

i

(η

0

)

φ

0

i

(η

0

)

. (8)

But η

0

i

= 0 and φ

0

i

(η

0

) = g(w)

>

d

i

. Since, d

i

= e

i

the Eq.(8) is reformulated as

η

i

= −

E(w

k

)

∂

i

E(w

k

)

. (9)

This is done at the kth epoch in parallel, for all weight directions to evaluate the corresponding

learning rates. Then, Eq.(5) takes the form

w

k+1

i

= w

k

i

−

E(w

k

)

∂

i

E(w

k

)

. (10)

Instructions to the Authors of LSCCSE 5

Eq.(10) constitutes the weight update formula of the new BP training algorithm with an adaptive

learning rate for each weight.

The iterative scheme (10) takes into c onsideration information from both the error function

and the magnitude of the gradient components. When the gradient magnitude is small, the local

shape of E in this direction is ﬂat, otherwise it is steep. The value of the error function indicates

how close to the global minimizer this local shape is. The above pieces of information help the

iterative scheme (10) to escape from ﬂat regions with high error values, which are located far from

a desired minimizer.

4 Convergence Analysis

First, we recall two concepts which will be used in our convergence analysis.

1) The Property A

π

: Young [60] has discovered a class of matrice s described as having property

A that can be partitioned into block–tridiagonal form, possibly after a suitable pe rmutation. In

Young’s original presentation, the elements of a matrix A = [a

ij

] are partitioned into two groups.

In general, any partitioning of an n–dimensional vector x = (x

(1)

, . . . , x

(m)

) into block components

x

(p)

of dimensions n

p

, p = 1, . . . , m (with

P

m

p=1

n

p

= n) is uniquely determined by a partitioning

π = {π

p

}

m

p=1

of the set of the ﬁrst n integers, where π

p

contains the integers s

p

+ 1, . . . , s

p

+ n

p

,

s

p

=

P

k−1

j=1

n

j

. The same partitioning π also induces a partitioning of any n × n matrix A into

block matrix components A

ij

of dimensions n

i

× n

j

. Note that the matrices A

ii

are square.

Deﬁnition 1 [3]: The matrix A has the property A

π

if A can be permuted by P AP

>

into a

form that can be partitioned into block–tridiagonal form, that is,

P AP

>

=

D

1

L

>

1

O

L

1

D

2

L

>

2

.

.

.

.

.

.

.

.

.

L

r−2

D

r−1

L

>

r−1

O L

r−1

D

r

,

where the matrices D

i

, i = 1, . . . , r are nonsingular.

2) The Root–convergence factor: It is useful for any iterative procedure to have a measure of

the rate of its convergence. In our case, we are interested in how fast the weight update equation

(10), denoted P, converge to w

∗

. A measure of the rate of its conve rgence is obtained by taking

appropriate roots of successive errors. To this end we use the following deﬁnition.

Deﬁnition 2 [37]: Let {w

k

}

∞

k=0

be any sequence that converges to w

∗

. Then the number

R{w

k

} = lim

k→∞

sup kw

k

− w

∗

k

1/k

, (11)

is the root–convergence factor, or R–factor of the sequence of the weights. If the iterative procedure

P converges to w

∗

and C(P, w

∗

) is the set of all sequences generated by P which convergence to

w

∗

, then

R(P, w

∗

) = sup{R{w

k

}; {w

k

} ∈ C(P, w

∗

)}, (12)

is the R–factor of P at w

∗

.

Our convergence analysis consists of two parts. In ﬁrst part results regarding the local con-

vergence properties of the algorithm are presented. In the second part appropriate conditions are

proposed to guarantee the global convergence of the algorithm, i.e. the convergence to a minimizer

from any starting point.