scispace - formally typeset
Search or ask a question
Book ChapterDOI

Parameter Optimization Algorithm with Improved Convergence Properties for Adaptive Learning

About: The article was published on 2005-10-01 and is currently open access. It has received None citations till now. The article focuses on the topics: Adaptive learning & Convergence (routing).

Summary (3 min read)

1 Introduction

  • Let us first define the notation the authors will use in the paper.
  • This happens when the learning rate is proportional to the inverse of the Lipschitz constant which, in practice, is not easily available [2, 31].
  • A variety of approaches adapted from numerical analysis have also been applied, in an attempt to use not only the gradient of the error function but also the second derivative in constructing efficient supervised training algorithms to accelerate the learning process.
  • Experimental results are presented in Section 5 to evaluate and compare the performance of the new algorithms with several other BP methods.
  • The paper ends, in Section 6, with concluding remarks.

2 Adaptive Learning and the Error Surface

  • The eigensystem of the Hessian matrix can be used to determine the shape of the error function E in the neighborhood of a local minimizer [1, 18].
  • On the other hand, a value for the learning rate which yields a small variation along the eigenvector corresponding to the minimum eigenvalue may result in small steps along this direction and thus, in a slight reduction of the error function.
  • Moreover, it exploits the parallelism inherent in the evaluation of E(w) and g(w) by the BP algorithm.
  • These steps are usually constraint by problem–dependent heuristic parameters, in order to ensure subminimization of the error function in each weight direction.
  • This is the case of many well known training algorithms that employ heuristic parameters for properly tuning the adaptive learning rates [11, 18, 33, 41, 46, 49] and no guarantee is provided that the weight updates will converge to a minimizer of E.

3 A Theoretical Derivation of the Adaptive Learning Process

  • An adaptive learning rate algorithm (see Eq. (2)) seeks for a minimum w∗ of the error function and generates with every training epoch a discretized path in the nth dimensional weight space.
  • Then each weight is updated according to the relation: wk+1i = ŵi.
  • Thus, significant computational effort is needed in order to find very accurate approximations of the subminimizer in each weight direction at each epoch.
  • Since E(w) ≥ 0, ∀w ∈ IRn then the w∗, such that E(w∗), minimizes E(w). (9) This is done at the kth epoch in parallel, for all weight directions to evaluate the corresponding learning rates.
  • The iterative scheme (10) takes into consideration information from both the error function and the magnitude of the gradient components.

4 Convergence Analysis

  • First, the authors recall two concepts which will be used in their convergence analysis.
  • Young [60] has discovered a class of matrices described as having property A that can be partitioned into block–tridiagonal form, possibly after a suitable permutation, also known as 1) The Property Aπ.
  • To this end the authors use the following definition.
  • In first part results regarding the local convergence properties of the algorithm are presented.

4.1 Local Convergence

  • In the first part, which concerns the local convergence of the algorithm the objective is to show that there is a neighborhood of a minimizer of the error function for which convergence to the minimizer can be guaranteed.
  • Now, consider the decomposition of H(w∗) into its diagonal, strictly lower–triangular and strictly upper–triangular parts H(w∗) = D(w∗)− L(w∗)− L>(w∗).
  • Therefore the authors want that the iterative scheme (10) will generate weight iterates that achieve a sufficient reduction in the error function at each epoch.
  • This is due to the fact that training starts way from a local minimum of the error function and exact minimization steps along the search direction do not usually help, because of the nonlinearity of the error function.
  • This issues are investigated below in the framework of global convergence of the algorithm.

4.2 Globally Convergent Algorithms

  • In order to ensure global convergence of the adaptive algorithm, i.e. convergence to a local minimizer of the error function from any starting point, the following assumptions are needed [10, 24]: a).
  • This has the effect that each ηi is decreased by the largest number in the sequence {q−m}∞m=1, so that the Condition (17) is satisfied.
  • Thus, some training problems respond well to one or two reductions in the learning rates by modest amounts (such as 1/2) and others require many such reductions, but might respond well to a more aggressive learning rate reduction (for example by factors of 1/10, or even 1/20).
  • When seeking to satisfy the Condition (17) it is important to ensure that the learning rates are not reduced unnecessarily so that the Condition (18) is not satisfied.
  • Relative techniques have been proposed by Dennis and Schnabel [10] and Battiti [4].

5 Applications

  • Named BPM ; Adaptive Back-propagation with adaptive momentum (ABP) proposed by Vogl [56]; Back-propagation with an adaptive learning rate for each weight proposed by Silva and Almeida [49], named SA; Back-propagation with a self–determined learning rate for each weight (BPS).the authors.
  • A well known initialization heuristic for FNNs is to select the weights with uniform probability from an interval (wmin, wmax), where usually wmin = −wmax.
  • If the initial weights are very small the backpropagated error is so small that practically no change takes place for some weights and more iterations are necessary to decrease the error [47, 48].
  • Then all other heuristics have been tuned.
  • To be more specific, various different values, up to 2, have been tested for the learning rate increment factor, while different values between 0.1 and 0.9 have been tried for the learning rate decrement factor.

5.1 Texture Classification Problem

  • The first experiment is a texture classification problem.
  • In the co–occurence method, the relative frequencies of gray–level pairs of pixels at certain relative displacements are computed and stored in a matrix.
  • The results of Tables 2 suggest that BPS significantly outperforms BP and BPM in the number of gradient and error function evaluations as well as in the percentage of successful simulations.
  • SA is also faster than BPS needing only tuning three heuristic parameters, but has smaller percentage of success than BPS.
  • The successfully trained FNNs are tested for their generalization capability using patterns from 20 subimages of the same size randomly selected from each image.

5.2 Vowel Spotting Problem

  • In the second experiment a 15–15–1 FNN (240 weights and 16 biases), based on neurons of hyperbolic tangent activations, is used for vowel spotting.
  • After applying a Hamming window, each frame is analyzed using the Perceptual Linear Predictive (PLP) speech analysis technique to obtain the characteristic features of the signal.
  • A mistaken decision regarding a non-vowel will produce unpredictable errors to the speaker classification module of the system that uses the response of the FNN and is trained only with vowels [12, 13].
  • The resuts of Table 3 show that BPS compares favourably on the number of gradient evaluations to BP, BPM and ABP.
  • With regards to generalization, the adaptive methods provide comparable performance which on the the average improves of the error rate percentage achieved by BP–tranined FNNs by 2%.

6 Conclusions

  • The papers analyzed adaptive learning with a different learning rate for each weight as a nonlinear Jacobi process.
  • A model for the analysis of the local and global convergence of the algorithm has been proposed.
  • However this result is not necessarily related with the convergence speed when the algorithm is away from the minimizer.
  • The algorithm compares satisfactory with other popular training algorithms without using highly problem–dependent heuristic parameters.
  • Its performance in the two reported experiments is promising.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Brill Academic Publishers
P.O. Box 9000, 2300 PA Leiden,
The Netherlands
Lecture Series on Computer
and Computational Sciences
Volume 1, 2005, pp. 1-3
Parameter Optimization Algorithm with Improved
Convergence Properties for Adaptive Learning
G.D. Magoulas
1
, M.N. Vrahatis
2
School of Computer Science and Information Systems,
Birkbeck University of London, London WC1E 7HX, UK
Computational Intelligence Laboratory, Department of Mathematics,
University of Patras, GR-26110 Patras, Greece
Abstract: The error in an artificial neural network is a function of adaptive parameters
(weights and biases) that needs to be minimized. Research on adaptive learning usually
focuses on gradient algorithms that employ problem–dependent heuristic learning param-
eters. This fact usually results in a trade–off between the convergence speed and the
stability of the learning algorithm. The paper investigates gradient–based adaptive algo-
rithms and discusses their limitations. It then describes a new algorithm that does not need
user–defined learning parameters. The convergence properties of this method are discussed
from both theoretical and practical p erspective. The algorithm has been implemented and
tested on real life applications exhibiting improved stability and high performance.
Keywords: Feedforward neural networks, Supervised training, Back-propagation algorithm,
Heuristic learning parameters, Non-linear Jacobi process, Globally convergent algorithms,
Local convergence analysis.
Mathematics Subject Classification: 65K10, 49D10, 68T05, 68G05
1 Introduction
Let us first define the notation we will use in the paper. We use a unified notation for the
weights. Thus, for a feedforward neural network (FNN) with a total of n weights, IR
n
is the
n–dimensional real space of column weight vectors w with components w
1
, w
2
, . . . , w
n
and w
is
the optimal weight vector with c omponents w
1
, w
2
, . . . , w
n
; E is the batch error measure defined
as the sum–of–squared–differences error function over the entire training set;
i
E(w) denotes the
partial derivative of E(w) with respect to the ith variable w
i
; g(w) =
g
1
(w), . . . , g
n
(w)
defines
the gradient E(w) of the sum-of-squared-differences error function E at w, which is computed
by applying the chain rule on the layers of an FNN (see [48]), while H = [H
ij
] defines the Hessian
2
E(w) of E at w. Also, throughout this paper diag{e
1
, . . . , e
n
} defines the n × n diagonal matrix
with elements e
1
, . . . , e
n
, Θ
n
= (0, 0, . . . , 0) denotes the origin of IR
n
and ρ(A) is the spectral radius
of matrix A.
The Back-Propagation (BP) algorithm [48] is widely recognized as a powerful tool for training
FNNs. It minimizes the error function using the Steep es t Descent (SD) method [15] with constant
learning rate η:
w
k+1
= w
k
ηg(w
k
). (1)
1
E-mail: gmagoulas@ dcs .bbk.ac .uk
2
E-mail: vrahatis@mat h.up atras .gr

2 G.D. Magoulas and M.N. Vrahatis
The SD method requires the assumption that E is twice continuously differentiable on an
open neighborhood S(w
0
), where S(w
0
) = {w : E(w) E(w
0
)} is bounded, for some initial
weight vector w
0
. It also requires that η is chosen to satisfy the relation sup kH(w)k η
1
<
in the level set S(w
0
) [16, 17]. The approach adopted in practice is to apply a small constant
learning rate value (0 < η < 1) in order to secure the convergence of the BP training algorithm
and avoid oscillations in a direction where the error function is steep. However, this approach
considerably slows down training since, in general, a small learning rate may not be appropriate for
all the portions of the error surface. Furthermore, it affects the convergence properties of training
algorithms (see [25, 29]). Nevertheless, there are theoretical results that guarantee convergence
when the learning rate is constant. This happens when the learning rate is proportional to the
inverse of the Lipschitz constant which, in practice, is not easily available [2, 31].
Attempts to adaptive learning are usually based on the following approaches: (i) start with
a small learning rate and increase it exponentially, if successive iterations reduce the error, or
rapidly decrease it, if a significant error increase occurs [4, 56], (ii) start with a small learning rate
and increase it, if successive iterations keep gradient direction fairly constant, or rapidly dec rease
it, if the direction of the gradient varies greatly at e ach iteration [8] and (iii) for each weight an
individual learning rate is given, which increases if the successive changes in the weights are in
the same direction and decreases otherwise. The well known delta-bar-delta metho d [18] and Silva
and Almeida’s method [49] follow this approach. Another method, named quickprop, has been
presented in [11]. Quickprop is based on independent secant steps in the direction of each weight.
Riedmiller and Braun in 1993 proposed the Rprop algorithm. The algorithm updates the weights
using the learning rate and the sign of the partial derivative of the error function with respect to
each weight. Note that all these adaptation methods employ heuristic learning parameters in an
attempt to secure converge of the BP algorithm to a minimizer of E and avoid oscillations.
A different approach is to exploit the local shape of the error surface as described by the direction
cosines or the Lipschitz constant. In the first case the learning rate is a weighted average of the
direction cosines of the weight changes at the current and several previous successive iterations [23],
while in the second case the learning rate is an approximation of the Lipschitz constant [31].
A variety of approaches adapted from numerical analysis have also been applied, in an attempt
to use not only the gradient of the error function but also the second derivative in construct-
ing efficient supervised training algorithms to accelerate the learning process. However, train-
ing algorithms that apply nonlinear conjugate gradient me thods, such as the Fletcher–Reeves or
the Polak–Ribiere methods [34, 53], or variable metric methods, such as the Broyden–Fletcher–
Goldfarb–Shanno method [5, 58], or even Newton’s method [6, 39], are com putationally intensive
for FNNs with several hundred weights: derivative calculations as well as subminimization pro-
cedures (for the case of nonlinear conjugate gradient metho ds) and approximations of various
matrices (for the case of variable metric and quasi-Newton methods) are required. Furthermore,
it is not certain that the extra computational cost speeds up the minimization process for noncon-
vex functions when far from a minimizer, as is usually the case with the neural network training
problem [5, 9, 35]. Thus, the development of improved gradient-based BP algorithms receives
significant attention of neural network researchers and practitioners.
The training algorithm introduced in this paper does not use a user–defined initial learning rate,
instead it self–determinates the search direction and the learning rates at each epoch. It provides
stable learning and robustness to oscillations. The paper is organized as follows. In Section 2
the class of adaptive learning algorithms that employ a different learning rate for each weight is
presented and the advantages as well as the disadvantages of these algorithms are disc ussed. The
new algorithm is introduced in Section 3 and its convergence properties are investigated in Section
4. Experimental results are presented in Section 5 to evaluate and compare the performance of
the new algorithms with several other B P methods. The paper ends, in Section 6, with concluding
remarks.

Instructions to the Authors of LSCCSE 3
2 Adaptive Learning and the Error Surface
The eigensystem of the Hessian matrix can be used to determine the shape of the e rror function E
in the neighborhood of a local minimizer [1, 18]. Thus, studying the sensitivity of the minimizer
to small changes by approximating the error function with a quadratic one, it is known that, in
a sufficiently small neighborhood of w
, the directions of the principal axes of the corresponding
elliptical contours (n–dimensional ellipsoids) will be given by the eigenvectors of H(w
), while the
lengths of the axes will be inversely proportional to the square roots of the corresponding eigenval-
ues. Furthermore, a variation along the eigenvector corresponding to the maximum eigenvalue will
cause the largest change in E, while the eigenvector corresponding to the minimum eigenvalue gives
the least sensitive direction. Therefore, a value for the learning rate which yields a large variation
along the eigenvector corresponding to the maximum eigenvalue may result in oscillations. On
the other hand, a value for the le arning rate which yields a small variation along the eigenvector
corresponding to the minimum eigenvalue may result in small steps along this direction and thus,
in a slight reduction of the error function. In general, a learning rate appropriate for any one
weight direction is not necessarily appropriate for other directions.
Various adaptive learning algorithms with a different learning rate for each weight have been
suggested in the literature [11, 18, 33, 41, 46, 49]. This approach allows us to find the proper
learning rate that compensates for the small magnitude of the gradient in a flat direction in order
to avoid slow convergence, and dampens a large weight change in a steep direction in order to avoid
oscillations. Moreover, it exploits the parallelism inherent in the evaluation of E(w) and g(w) by
the BP algorithm.
Following this approach Eq. (1) is reformulated to the following scheme:
w
k+1
= w
k
diag{η
k
1
, . . . , η
k
n
} g(w
k
). (2)
The weight vector in Eq. (2) is not updated in the direction of the negative of the gradient;
instead, an alternative adaptive search direction is obtained by taking into consideration the weight
change, evaluated by multiplying the length of the search step, i.e. the value of the learning rate,
along each weight direction by the partial derivative of E(w) with respect to the corresponding
weight, i.e. η
i
i
E(w). In other words, these algorithms try to decreas e the error in each direction,
by searching the local m inimum with small weight steps. These steps are usually constraint by
problem–dep e ndent heuristic parameters, in order to ensure subminimization of the error function
in each weight direction.
A well known difficulty of this approach is that the use of inappropriate heuristic values for a
weight direction misguides the resultant search direction. In such case s, the training algorithms
with an adaptive learning rate for each weight cannot exploit the global information obtained by
taking into consideration all the directions. This is the case of many well known training algorithms
that e mploy heuristic parameters for properly tuning the adaptive learning rates [11, 18, 33, 41,
46, 49] and no guarantee is provided that the weight updates will converge to a minimizer of E.
In certain cases the aforementioned methods, although originally developed for batch training, can
be used for on-line training by minimizing a pattern–based error measure.
3 A Theoretical Derivation of the Adaptive Learning Process
An adaptive learning rate algorithm (see Eq. (2)) seeks for a minimum w
of the error function
and generates with every training epoch a discretized path in the nth dimensional weight space.
The limiting value of this path, lim
k→∞
w
k
, corresponds to a stationary point of E(w). This path
depends on the values of the learning rates chosen in each epoch. Appropriate learning rates help
to avoid convergence to a saddle p oint or a maximum. In the framework of Eq. (2) the learning
process can theoretically be interpreted as follows.

4 G.D. Magoulas and M.N. Vrahatis
Starting from an arbitrary initial weight vector w
0
D (a specific domain of E), the training
algorithm subminimizes, at the kth epoch, in parallel, the n one–dimensional function:
E(w
k
1
, . . . , w
k
i1
, w
i
, w
k
i+1
, . . . , w
k
n
). (3)
First, each function is minimized along the direction i and the corresponding subminimizer ˆw
i
is obtained. Obviously for this ˆw
i
i
E(w
k
1
, . . . , w
k
i1
, ˆw
i
, w
k
i+1
, . . . , w
k
n
) = 0. (4)
This is a one–dimensional subminimization because all other components of the weight vector,
except the ith, are kept constant. Then each weight is updated according to the relation:
w
k+1
i
= ˆw
i
. (5)
In order to be consistent with Eq. (2) only a single iteration of the one–dimensional method
in each weight direction is proposed. It is worth noticing that the number of the iterations of
the subminimization method is related to the requested accuracy in obtaining the subminimizer
approximations. Thus, significant computational effort is needed in order to find very accurate
approximations of the subminimizer in each weight direction at each ep och. Moreover, the compu-
tational effort for the subminimization method is increased for FNNs with several hundred weights.
On the other hand, it is not certain that this large computational effort speeds up the minimization
process for nonconvex functions when the algorithm is away from a minimizer w
[37]. Thus, we
prop os e to obtain ˆw
i
by minimizing Eq.(3) with one iteration of a minimization me thod.
The problem of minimizing the error function E along the ith direction
min
η
i
0
E(w + η
i
· e
i
) (6)
is equivalent to the minimization of the one–dimensional function
φ
i
(η) = E(w + η
i
e
i
). (7)
Since we have n directions we consider n one–dimensional functions φ
i
(η). Note that according
to experimental work [61] these functions can be approximated for certain learning tasks with
quadratic functions in the neighborhood of η 0. In general, we can also use the formulation
φ(η) = E(w + ηd), where d is the search direction vector and φ
0
(η) = E(w + ηd)
>
d.
In our case, we want at the kth epoch to find the learning rate η
i
that minimizes φ
i
(η) along
the ith direction. Since E(w) 0, w IR
n
then the w
, such that E(w
), minimizes E(w). Thus,
by applying one iteration of the Newton method to the one–dimensional equation φ
i
(η) = 0 we
obtain:
η
1
i
= η
0
i
φ
i
(η
0
)
φ
0
i
(η
0
)
. (8)
But η
0
i
= 0 and φ
0
i
(η
0
) = g(w)
>
d
i
. Since, d
i
= e
i
the Eq.(8) is reformulated as
η
i
=
E(w
k
)
i
E(w
k
)
. (9)
This is done at the kth epoch in parallel, for all weight directions to evaluate the corresponding
learning rates. Then, Eq.(5) takes the form
w
k+1
i
= w
k
i
E(w
k
)
i
E(w
k
)
. (10)

Instructions to the Authors of LSCCSE 5
Eq.(10) constitutes the weight update formula of the new BP training algorithm with an adaptive
learning rate for each weight.
The iterative scheme (10) takes into c onsideration information from both the error function
and the magnitude of the gradient components. When the gradient magnitude is small, the local
shape of E in this direction is flat, otherwise it is steep. The value of the error function indicates
how close to the global minimizer this local shape is. The above pieces of information help the
iterative scheme (10) to escape from flat regions with high error values, which are located far from
a desired minimizer.
4 Convergence Analysis
First, we recall two concepts which will be used in our convergence analysis.
1) The Property A
π
: Young [60] has discovered a class of matrice s described as having property
A that can be partitioned into block–tridiagonal form, possibly after a suitable pe rmutation. In
Young’s original presentation, the elements of a matrix A = [a
ij
] are partitioned into two groups.
In general, any partitioning of an n–dimensional vector x = (x
(1)
, . . . , x
(m)
) into block components
x
(p)
of dimensions n
p
, p = 1, . . . , m (with
P
m
p=1
n
p
= n) is uniquely determined by a partitioning
π = {π
p
}
m
p=1
of the set of the first n integers, where π
p
contains the integers s
p
+ 1, . . . , s
p
+ n
p
,
s
p
=
P
k1
j=1
n
j
. The same partitioning π also induces a partitioning of any n × n matrix A into
block matrix components A
ij
of dimensions n
i
× n
j
. Note that the matrices A
ii
are square.
Definition 1 [3]: The matrix A has the property A
π
if A can be permuted by P AP
>
into a
form that can be partitioned into block–tridiagonal form, that is,
P AP
>
=
D
1
L
>
1
O
L
1
D
2
L
>
2
.
.
.
.
.
.
.
.
.
L
r2
D
r1
L
>
r1
O L
r1
D
r
,
where the matrices D
i
, i = 1, . . . , r are nonsingular.
2) The Root–convergence factor: It is useful for any iterative procedure to have a measure of
the rate of its convergence. In our case, we are interested in how fast the weight update equation
(10), denoted P, converge to w
. A measure of the rate of its conve rgence is obtained by taking
appropriate roots of successive errors. To this end we use the following definition.
Definition 2 [37]: Let {w
k
}
k=0
be any sequence that converges to w
. Then the number
R{w
k
} = lim
k→∞
sup kw
k
w
k
1/k
, (11)
is the root–convergence factor, or R–factor of the sequence of the weights. If the iterative procedure
P converges to w
and C(P, w
) is the set of all sequences generated by P which convergence to
w
, then
R(P, w
) = sup{R{w
k
}; {w
k
} C(P, w
)}, (12)
is the R–factor of P at w
.
Our convergence analysis consists of two parts. In first part results regarding the local con-
vergence properties of the algorithm are presented. In the second part appropriate conditions are
proposed to guarantee the global convergence of the algorithm, i.e. the convergence to a minimizer
from any starting point.

References
More filters
Journal ArticleDOI
01 Nov 1973
TL;DR: These results indicate that the easily computable textural features based on gray-tone spatial dependancies probably have a general applicability for a wide variety of image-classification applications.
Abstract: Texture is one of the important characteristics used in identifying objects or regions of interest in an image, whether the image be a photomicrograph, an aerial photograph, or a satellite image. This paper describes some easily computable textural features based on gray-tone spatial dependancies, and illustrates their application in category-identification tasks of three different kinds of image data: photomicrographs of five kinds of sandstones, 1:20 000 panchromatic aerial photographs of eight land-use categories, and Earth Resources Technology Satellite (ERTS) multispecial imagery containing seven land-use categories. We use two kinds of decision rules: one for which the decision regions are convex polyhedra (a piecewise linear decision rule), and one for which the decision regions are rectangular parallelpipeds (a min-max decision rule). In each experiment the data set was divided into two parts, a training set and a test set. Test set identification accuracy is 89 percent for the photomicrographs, 82 percent for the aerial photographic imagery, and 83 percent for the satellite imagery. These results indicate that the easily computable textural features probably have a general applicability for a wide variety of image-classification applications.

20,442 citations

01 Jan 1994
TL;DR: The Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08.
Abstract: Note: Includes bibliographical references, 3 appendixes and 2 indexes.- Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08

19,881 citations

Book
01 Jun 1970
TL;DR: In this article, the authors present a list of basic reference books for convergence of Minimization Methods in linear algebra and linear algebra with a focus on convergence under partial ordering.
Abstract: Preface to the Classics Edition Preface Acknowledgments Glossary of Symbols Introduction Part I. Background Material. 1. Sample Problems 2. Linear Algebra 3. Analysis Part II. Nonconstructive Existence Theorems. 4. Gradient Mappings and Minimization 5. Contractions and the Continuation Property 6. The Degree of a Mapping Part III. Iterative Methods. 7. General Iterative Methods 8. Minimization Methods Part IV. Local Convergence. 9. Rates of Convergence-General 10. One-Step Stationary Methods 11. Multistep Methods and Additional One-Step Methods Part V. Semilocal and Global Convergence. 12. Contractions and Nonlinear Majorants 13. Convergence under Partial Ordering 14. Convergence of Minimization Methods An Annotated List of Basic Reference Books Bibliography Author Index Subject Index.

7,669 citations

Book
01 Mar 1983
TL;DR: Newton's Method for Nonlinear Equations and Unconstrained Minimization and methods for solving nonlinear least-squares problems with Special Structure.
Abstract: Preface 1. Introduction. Problems to be considered Characteristics of 'real-world' problems Finite-precision arithmetic and measurement of error Exercises 2. Nonlinear Problems in One Variable. What is not possible Newton's method for solving one equation in one unknown Convergence of sequences of real numbers Convergence of Newton's method Globally convergent methods for solving one equation in one uknown Methods when derivatives are unavailable Minimization of a function of one variable Exercises 3. Numerical Linear Algebra Background. Vector and matrix norms and orthogonality Solving systems of linear equations-matrix factorizations Errors in solving linear systems Updating matrix factorizations Eigenvalues and positive definiteness Linear least squares Exercises 4. Multivariable Calculus Background Derivatives and multivariable models Multivariable finite-difference derivatives Necessary and sufficient conditions for unconstrained minimization Exercises 5. Newton's Method for Nonlinear Equations and Unconstrained Minimization. Newton's method for systems of nonlinear equations Local convergence of Newton's method The Kantorovich and contractive mapping theorems Finite-difference derivative methods for systems of nonlinear equations Newton's method for unconstrained minimization Finite difference derivative methods for unconstrained minimization Exercises 6. Globally Convergent Modifications of Newton's Method. The quasi-Newton framework Descent directions Line searches The model-trust region approach Global methods for systems of nonlinear equations Exercises 7. Stopping, Scaling, and Testing. Scaling Stopping criteria Testing Exercises 8. Secant Methods for Systems of Nonlinear Equations. Broyden's method Local convergence analysis of Broyden's method Implementation of quasi-Newton algorithms using Broyden's update Other secant updates for nonlinear equations Exercises 9. Secant Methods for Unconstrained Minimization. The symmetric secant update of Powell Symmetric positive definite secant updates Local convergence of positive definite secant methods Implementation of quasi-Newton algorithms using the positive definite secant update Another convergence result for the positive definite secant method Other secant updates for unconstrained minimization Exercises 10. Nonlinear Least Squares. The nonlinear least-squares problem Gauss-Newton-type methods Full Newton-type methods Other considerations in solving nonlinear least-squares problems Exercises 11. Methods for Problems with Special Structure. The sparse finite-difference Newton method Sparse secant methods Deriving least-change secant updates Analyzing least-change secant methods Exercises Appendix A. A Modular System of Algorithms for Unconstrained Minimization and Nonlinear Equations (by Robert Schnabel) Appendix B. Test Problems (by Robert Schnabel) References Author Index Subject Index.

6,217 citations

Book
30 Nov 1961
TL;DR: In this article, the authors propose Matrix Methods for Parabolic Partial Differential Equations (PPDE) and estimate of Acceleration Parameters, and derive the solution of Elliptic Difference Equations.
Abstract: Matrix Properties and Concepts.- Nonnegative Matrices.- Basic Iterative Methods and Comparison Theorems.- Successive Overrelaxation Iterative Methods.- Semi-Iterative Methods.- Derivation and Solution of Elliptic Difference Equations.- Alternating-Direction Implicit Iterative Methods.- Matrix Methods for Parabolic Partial Differential Equations.- Estimation of Acceleration Parameters.

5,317 citations

Frequently Asked Questions (1)
Q1. What are the contributions in "Parameter optimization algorithm with improved convergence properties for adaptive learning" ?

The paper investigates gradient–based adaptive algorithms and discusses their limitations. The convergence properties of this method are discussed from both theoretical and practical perspective.