scispace - formally typeset
Search or ask a question
Journal ArticleDOI

On the limited memory BFGS method for large scale optimization

01 Dec 1989-Mathematical Programming (Springer-Verlag New York, Inc.)-Vol. 45, Iss: 3, pp 503-528
TL;DR: The numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence, and the convergence properties are studied to prove global convergence on uniformly convex problems.
Abstract: We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.

Summary (3 min read)

Preliminaries

  • The method of Buckley and LeNir combines cycles of BFGS and conjugate gradient steps.
  • It starts by performing the usual BFGS method but stores the corrections to the initial matrix separately to avoid using O n storage.
  • Hk are not formed explicitly but the m previous values of yj and sj are stored separately.
  • The partitioned quasi Newton method PQN requires that the user supply detailed information about the objective function and is particularly e ective if the correct range of the Hessian of each element function is known.
  • When the number of variables is very large in the hundreds or thousands the computational e ort of the iteration sometimes dominates the cost of evaluating the function and gradient.

Table Set of test problems

  • Problems and and the starting points used for them are described in Liu and Nocedal All the runs reported in this paper were terminated when kgkk max kxkk where k k denotes the Euclidean norm.
  • The authors require low accuracy in the solution because this is common in practical applications.
  • Since the authors have performed a very large number of tests they describe the results fully in an accompanying report Liu and Nocedal.
  • The authors should note that all the comments and conclusions made in this paper are based on data presented here and in the accompanying report Comparison with the method of Buckley and LeNir When m the method of Buckley and LeNir reduces to Shanno s method and when m both methods are identical to the BFGS method.
  • For a given value of m the two methods require roughly the same amount of storage but the L BFGS method requires slightly less arithmetic work per iteration than the B L method as implemented by Buckley and LeNir.

Table Storage locations

  • The tests described below were made on a SUN in double precision arithmetic for which the unit roundo is approximately.
  • For each run the authors veri ed that both methods converged to the same solution point.
  • The authors will therefore now consider the number of iterations and the total amount of time required by the two limited memory methods L BFGS P N m m m Table.

L BFGS

  • For most problems the number of iterations is markedly reduced compare Tables and We now compare this implementation of the L BFGS method with the method of Buckley and LeNir and for simplicity the authors will use total cpu time as a measure.the authors.
  • Fur thermore an examination of the results given in Liu and Nocedal shows that the di erences are very substantial in many cases.

Scaling the L BFGS method

  • It is known that simple scalings of the variables can improve the performance of quasi Newton methods on small problems.
  • The authors numerical experience appears to indicate that these two scalings are comparable in e ciency and therefore M should be preferred since it is less expensive to implement.
  • In their tests this formula performed well sometimes but was very ine cient in many problems.
  • The authors have observed in general that when solving very large problems increasing the storage from m or m gives only a P N m m m m m Table L BFGS method with scaling strategy M marginal improvement of performance Gilbert and Lemar echal report similar results.
  • The authors tested three methods the algorithm CONMIN developed by Shanno and Phua the conjugate gradient method CG using the Polak Ribi"ere for mula see for example Powell restarting every n steps and with and in and the L BFGS method M for which they tried both accurate and inaccurate line searches.

Table CONMIN CG and L BFGS methods

  • The next two tables summarize the results of Table.
  • In these two problems the PQN method is vastly superior in terms of function eval uations to the L BFGS method.
  • The number of variables entering into the element functions is nve and nve vr is the number obtained after applying variable reduction Using the results of Table the authors give the average time required to perform an iteration it time.
  • For the PQN method the authors have used the results corresponding to B I and they recall that the L BFGS method used scaling M and m P N ne nve nve vr PQN L BFGS it time it time Table Separability of the objective functions and average iteration time.

Convergence Analysis

  • In this section the authors show that the limited memory BFGS method is globally convergent on uniformly convex problems and that its rate of convergence is R linear.
  • These results are easy to establish after noting that all Hessian approximations.
  • Hk are obtained by updating a bounded matrix m times using the BFGS formula.
  • The authors make the following assumptions about the objective function.

Assumptions

  • Therefore from and the authors conclude that there is a constant such that cos k sTkBksk kskkkBkskk.
  • It is possible to prove this result for several other line search strategies including backtracking by adapting the arguments of Byrd and Nocedal see the proof of their Theorem Note from and that M k M.

Final Remarks

  • The authors tests indicate that a simple implementation of the L BFGS method performs better than the code of Buckley and LeNir and that the L BFGS method can be greatly improved by means of a simple dynamic scaling such as M.
  • It is highly recommended if the user is able and willing to supply the information on the objective function that the method requires and it is particularly e ective when the element functions depend on a small number of variables less than or say.
  • The L BFGS method is appealing for several reasons it is very simple to implement it requires only function and gradient values # and no other information on the problem # and it can be faster than the partitioned quasi Newton method on problems where the element functions depend on more than or variables.
  • The authors would like to thank Andreas Griewank and Claude Lemar echal for several helpful conversations and Richard Byrd for suggesting the scaling used in method M.
  • The authors are grateful to Jorge Mor e who encouraged us to pursue this investiga tion and who made many valuable suggestions and to the three referees for their helpful comments.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

NORTHWESTERN UNIVERSITY
Department of Electrical Engineering
and Computer Science
ON THE LIMITED MEMORYBFGS METHOD FOR
LARGE SCALE OPTIMIZATION
1
by
Dong C. Liu
2
and Jorge Nocedal
3
this pap er has app eared in
Mathematical Programming
45 (1989), pp. 503-528.
1
This work was supported by the Applied Mathematical Sciences subprogram of the Oce of Energy
Research, U.S. DepartmentofEnergy, under contract DE-FG02-87ER25047, and by National Science
Foundation Grant No. DCR-86-02071.
2
Department of Electrical Engineering and Computer Science, Northwestern University,Evanston Il
60208.
3
Department of Electrical Engineering and Computer Science, Northwestern University,Evanston Il
60208.

ON THE LIMITED MEMORYBFGS METHOD FOR
LARGE SCALE OPTIMIZATION
by
Dong C. Liu and Jorge Nocedal
ABSTRACT
We study the numerical performance of a limited memory quasi-Newton metho d
for large scale optimization, whichwe call the L-BFGS metho d. We compare its
performance with that of the metho d developed byBuckley and LeNir (1985), which
combines cyles of BFGS steps and conjugate direction steps. Our numerical tests
indicate that the L-BFGS method is faster than the method of Buckley and LeNir,
and is b etter able to use additional storage to accelerate convergence. Weshow that
the L-BFGS method can b e greatly accelerated by means of a simple scaling. Wethen
compare the L-BFGS method with the partitioned quasi-Newton metho d of Griewank
and Toint (1982a). The results show that, for some problems, the partitioned quasi-
Newton method is clearly sup erior to the L-BFGS metho d. However we nd that
for other problems the L-BFGS method is very competitive due to its low iteration
cost. We also study the convergence prop erties of the L-BFGS metho d, and prove
global convergence on uniformly convex problems.
Key words:
large scale nonlinear optimization, limited memory metho ds, partitioned
quasi-Newton metho d, conjugate gradient metho d.
Abbreviated title:
Limited memory BFGS.
1. Intro duction.
We consider the minimization of a smooth nonlinear function
f
:
R
n
!
R
,
min
f
(
x
)
(1
:
1)
in the case where the number of variables
n
is large, and where analytic expressions
for the function
f
and the gradient
g
are available. Among the most useful methods
for solving this problems are: (i) Newton's metho d and variations of it see for example
1

Steihaug (1983), O'Leary (1982), Toint (1981) and Nash (1985) (ii) the partitioned quasi-
Newton metho d of Griewank and Toint (1982a) (iii) the conjugate gradient method see
for example Fletcher (1980) and Gill, Murray and Wright (1981) (iv) limited memory
quasi-Newton metho ds.
This pap er is devoted to the study of limited memory quasi-Newton metho ds for large
scale optimization. These methods can b e seen as extensions of the conjugate gradient
method, in which additional storage is used to accelerate convergence. They are suitable
for large scale problems because the amount of storage required by the algorithms (and
thus the cost of the iteration) can be controlled by the user. Alternatively, limited
memory metho ds can be viewed as implementations of quasi-Newton methods, in which
storage is restricted. Their simplicity is one of their main appeals: they do not require
knowledge of the sparsity structure of the Hessian, or knowledge of the separability of the
ob jective function, and as we will see in this pap er, they can b e very simple to program.
Limited memory metho ds originated with the work of Perry (1977) and Shanno
(1978b), and were subsequently developed and analyzed byBuckley (1978), Nazareth
(1979), Nocedal (1980), Shanno (1978a), Gill and Murray (1979), and Buckley and LeNir
(1983). Numerical tests performed during the last ten years on medium size problems
have shown that limited memory methods require substantially fewer function evalua-
tions than the conjugate gradient method, even when little additional storage is added.
However little is known regarding the relative p erformance of these metho ds with respect
to Newton's metho d or the partitioned quasi-Newton algorithm, when solving large prob-
lems. Moreover, since the study by Gill and Murray (1979), there have b een no attempts
to compare the various limited memory metho ds with each other, and it is therefore not
known which is their most eective implementation.
In this paper we present and analyze the results of extensivenumerical tests of two
limited memory methods and of the partitioned quasi-Newton algorithm. We compare
the combined CG-QN metho d of Buckley and LeNir (1983) as implemented in Buckley
and LeNir (1985), the limited memory BFGS metho d describ ed by No cedal (1980), and
the partitioned quasi-Newton metho d, as implemented byToint (1983b). The results
indicate that the limited memory BFGS metho d (L-BFGS) is sup erior to the metho d
of Buckley and LeNir. They also show that for many problems the partitioned quasi-
Newton metho d is extremely eective, and is sup erior to the limited memory metho ds.
However we nd that for other problems the L-BFGS metho d is very competitive, in
terms of cpu time, with the partitioned quasi-Newton metho d.
We briey review the methods to b e tested in
x
2, where we also describe the problems
used in our experiments. In
x
3we present results that indicate that the limited memory
BFGS method is faster than the method of Buckley and LeNir (1985), and is b etter able to
use additional storage to accelerate convergence. In
x
4we explore ways of improving the
performance of the L-BFGS method, bychoosing suitable diagonal scalings, and study
its behavior on very large problems (where the number of variables is in the thousands).
In
x
5we compare the L-BFGS metho d with twowell-known conjugate gradient metho ds,
paying particular attention to execution times. In
x
6we compare the L-BFGS method
and the partitioned quasi-Newton method, and in
x
7wegiveaconvergence analysis of
2

the L-BFGS method.
While this work was in progress webecameaware that Gilbert and Lemarechal (1988)
had p erformed exp eriments that are similar to some of the ones rep orted here. They used
a newer implementation by Buckley (1987) of the Buckley-LeNir metho d this new co de is
more ecient than the ACM TOMS co de of Buckley and LeNir (1985) used in our tests.
Gilbert and Lemarechal's implementation of the L-BFGS method is almost identical to
ours. They conclude that the L-BFGS method p erforms better than Buckley's new co de,
but the dierences are less pronounced than the ones reported in this pap er.
Our L-BFGS code will be made available through the Harwell library under the name
VA15.
2. Preliminaries
We begin by briey reviewing the methods tested in this paper.
The method of Buckley and LeNir combines cycles of BFGS and conjugate gradient
steps. It starts by p erforming the usual BFGS metho d, but stores the corrections to
the initial matrix separately to avoid using
O
(
n
2
) storage. When the available storage
is used up, the currentBFGS matrix is used as a xed preconditioner, and the method
performs preconditioned conjugate gradient steps. These steps are continued until the
criterion of Powell (1977) indicates that a restart is desirable all BFGS corrections are
then discarded and the metho d performs a restart. This b egins a new BFGS cycle.
To understand some of the details of this metho d one must note that Powell's restart
criterion is based on the fact that, when the ob jective function is quadratic and the
line search is exact, the gradients are orthogonal. Therefore to use Powell restarts, it is
necessary that the line search b e exact for quadratic ob jective functions, which means
that the line search algorithm must p erform at least one interpolation. This is exp ensive
in terms of function evaluations, and some alternatives are discussed by Buckley and
LeNir (1983).
The metho d of Buckley and LeNir generalizes an earlier algorithm of Shanno (1978b),
by allowing additional storage to be used, and is regarded as an eective metho d see
Dennis and Schnabel (1987) and Toint (1986).
The limited memory BFGS method (L-BFGS) is describ ed by Nocedal (1980), where
it is called the SQN metho d. It is almost identical in its implementation to the well known
BFGS metho d. The only dierence is in the matrix update: the BFGS corrections are
stored separately, and when the available storage is used up, the oldest correction is
deleted to make space for the new one. All subsequent iterations are of this form: one
correction is deleted and a new one inserted. Another description of the method, which
will be useful in this pap er, is as follows. The user species the number
m
of BFGS
corrections that are to b e kept, and provides a sparse symmetric and positive denite
matrix
H
0
, which approximates the inverse Hessian of
f
. During the rst
m
iterations
the method is identical to the BFGS metho d. For
k>m
,
H
k
is obtained by applying
m
BFGS up dates to
H
0
using information from the
m
previous iterations.
3

Togive a precise description of the L-BFGS method we rst need to introduce some
notation. The iterates will be denoted by
x
k
, and we dene
s
k
=
x
k
+1
;
x
k
and
y
k
=
g
k
+1
;
g
k
. The metho d uses the inverse BFGS formula in the form
H
k
+1
=
V
T
k
H
k
V
k
+
k
s
k
s
T
k
(2
:
1)
where
k
=1
=y
T
k
s
k
, and
V
k
=
I
;
k
y
k
s
T
k
see Dennis and Schnabel (1983).
Algorithm 2.1
(L-BFGS Metho d)
(1) Cho ose
x
0
,
m
,0
<
0
<
1
=
2,
0
< <
1, and a symmetric and p ositive denite
starting matrix
H
0
. Set
k
=0.
(2) Compute
d
k
=
;
H
k
g
k
(2
:
2)
x
k
+1
=
x
k
+
k
d
k
(2
:
3)
where
k
satises the Wolfe conditions:
f
(
x
k
+
k
d
k
)
f
(
x
k
)+
0
k
g
T
k
d
k
(2.4)
g
(
x
k
+
k
d
k
)
T
d
k
g
T
k
d
k
:
(2.5)
(We always try the steplength
k
= 1 rst).
(3) Let ^
m
= min
f
k m
;
1
g
.Update
H
0
^
m
+ 1 times using the pairs
f
y
j
s
j
g
k
j
=
k
;
^
m
, i.e.
let
H
k
+1
=
V
T
k

V
T
k
;
^
m
H
0
(
V
k
;
^
m

V
k
)
+
k
;
^
m
V
T
k

V
T
k
;
^
m
+1
s
k
;
^
m
s
T
k
;
^
m
(
V
k
;
^
m
+1

V
k
)
+
k
;
^
m
+1
V
T
k

V
T
k
;
^
m
+2
s
k
;
^
m
+1
s
T
k
;
^
m
+1
(
V
k
;
^
m
+2

V
k
)
.
.
.
+
k
s
k
s
T
k
:
(2.6)
(4) Set
k
:=
k
+1andgoto2.
We note that the matrices
H
k
are not formed explicitly, but the ^
m
+ 1 previous
values of
y
j
and
s
j
are stored separately. There is an ecientformula, due to Strang,
for computing the pro duct
H
k
g
k
see Nocedal (1980). Note that this algorithm is very
simple to program it is similar in length and complexitytoaBFGS code that uses the
inverse formula.
4

Citations
More filters
Book
23 May 2011
TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Abstract: Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for l1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

17,433 citations

Book
01 Nov 2008
TL;DR: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization, responding to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems.
Abstract: Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization. It responds to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems. For this new edition the book has been thoroughly updated throughout. There are new chapters on nonlinear interior methods and derivative-free methods for optimization, both of which are used widely in practice and the focus of much current research. Because of the emphasis on practical methods, as well as the extensive illustrations and exercises, the book is accessible to a wide audience. It can be used as a graduate text in engineering, operations research, mathematics, computer science, and business. It also serves as a handbook for researchers and practitioners in the field. The authors have strived to produce a text that is pleasant to read, informative, and rigorous - one that reveals both the beautiful nature of the discipline and its practical side.

17,420 citations


Cites methods from "On the limited memory BFGS method f..."

  • ...Jorge Nocedal Stephen J. Wright Numerical Optimization With 85 Illustrations , Springer Contents Preface vii 1 Introduction, 1 Mathematical Formulation 2 Example: A Transportation Problem 4 Continuous versus Discrete Optimization 4 Constrained and Unconstrained Optimization 6 Global and Local Optimization 6 Stochastic and Deterministic Optimization 7 Optimization Algorithms 7 Convexity 8 Notes and References 9 2 Fundamentals of Unconstrained Optimization 10 2.1 What Is a Solution?...

    [...]

  • ...Limited-memory BFGS methods are implemented in LBFGS [194] and M1QN3 [122]; see Gill and Leonard [125] for a variant that requires less storage and appears to be quite efficient....

    [...]

  • ...For further discussion on the L-BFGS method see Nocedal [228], Liu and Nocedal [194], and Gilbert and Lemaréchal [122]....

    [...]

Journal ArticleDOI
TL;DR: The Marquardt algorithm for nonlinear least squares is presented and is incorporated into the backpropagation algorithm for training feedforward neural networks and is found to be much more efficient than either of the other techniques when the network contains no more than a few hundred weights.
Abstract: The Marquardt algorithm for nonlinear least squares is presented and is incorporated into the backpropagation algorithm for training feedforward neural networks. The algorithm is tested on several function approximation problems, and is compared with a conjugate gradient algorithm and a variable learning rate algorithm. It is found that the Marquardt algorithm is much more efficient than either of the other techniques when the network contains no more than a few hundred weights. >

6,899 citations

Journal ArticleDOI
TL;DR: In this article, the authors introduce physics-informed neural networks, which are trained to solve supervised learning tasks while respecting any given laws of physics described by general nonlinear partial differential equations.

5,448 citations

Journal ArticleDOI
TL;DR: An algorithm for solving large nonlinear optimization problems with simple bounds is described, based on the gradient projection method and uses a limited memory BFGS matrix to approximate the Hessian of the objective function.
Abstract: An algorithm for solving large nonlinear optimization problems with simple bounds is described. It is based on the gradient projection method and uses a limited memory BFGS matrix to approximate the Hessian of the objective function. It is shown how to take advantage of the form of the limited memory approximation to implement the algorithm efficiently. The results of numerical tests on a set of large problems are reported.

5,079 citations


Cites methods from "On the limited memory BFGS method f..."

  • ...[6] R. H. Byrd, J. Nocedal, and R. B. Schnabel, \Representation of quasi-Newton matrices andtheir use in limited memory methods," Technical report, EECS Department, NorthwesternUniversity, 1991, to appear in Mathematical Programming....

    [...]

  • ...The new algorithm therefore has computational demandssimilar to those of the limited-memory algorithm (L-BFGS) for unconstrained problems describedby Liu and Nocedal [19] and Gilbert and Lemar echal [14]....

    [...]

  • ...The Hessian approximations Bk used in our algorithm are limited-memory BFGS matrices(Nocedal [21] and Byrd, Nocedal, and Schnabel [6])....

    [...]

  • ...We nd that by making use of the compact representationsof limited-memory matrices described by Byrd, Nocedal, and Schnabel [6], the computationalcost of one iteration of the algorithm can be kept to be of order n.We used the gradient projection approach [16], [18], [3] to determine the active set, becauserecent studies [7], [5] indicate that it possesses good theoretical properties, and because it alsoappears to be e cient on many large problems [8], [20]....

    [...]

  • ...The new algorithm therefore has computational demands similar to those of the limited-memory algorithm (L-BFGS) for unconstrained problems described by Liu and Nocedal [19] and Gilbert and Lemar echal [14]....

    [...]

References
More filters
Book
01 Jan 2009
TL;DR: The aim of this book is to provide a Discussion of Constrained Optimization and its Applications to Linear Programming and Other Optimization Problems.
Abstract: Preface Table of Notation Part 1: Unconstrained Optimization Introduction Structure of Methods Newton-like Methods Conjugate Direction Methods Restricted Step Methods Sums of Squares and Nonlinear Equations Part 2: Constrained Optimization Introduction Linear Programming The Theory of Constrained Optimization Quadratic Programming General Linearly Constrained Optimization Nonlinear Programming Other Optimization Problems Non-Smooth Optimization References Subject Index.

7,278 citations

Book
01 Feb 1996
TL;DR: In this paper, Schnabel proposed a modular system of algorithms for unconstrained minimization and nonlinear equations, based on Newton's method for solving one equation in one unknown convergence of sequences of real numbers.
Abstract: Preface 1. Introduction. Problems to be considered Characteristics of 'real-world' problems Finite-precision arithmetic and measurement of error Exercises 2. Nonlinear Problems in One Variable. What is not possible Newton's method for solving one equation in one unknown Convergence of sequences of real numbers Convergence of Newton's method Globally convergent methods for solving one equation in one uknown Methods when derivatives are unavailable Minimization of a function of one variable Exercises 3. Numerical Linear Algebra Background. Vector and matrix norms and orthogonality Solving systems of linear equations-matrix factorizations Errors in solving linear systems Updating matrix factorizations Eigenvalues and positive definiteness Linear least squares Exercises 4. Multivariable Calculus Background Derivatives and multivariable models Multivariable finite-difference derivatives Necessary and sufficient conditions for unconstrained minimization Exercises 5. Newton's Method for Nonlinear Equations and Unconstrained Minimization. Newton's method for systems of nonlinear equations Local convergence of Newton's method The Kantorovich and contractive mapping theorems Finite-difference derivative methods for systems of nonlinear equations Newton's method for unconstrained minimization Finite difference derivative methods for unconstrained minimization Exercises 6. Globally Convergent Modifications of Newton's Method. The quasi-Newton framework Descent directions Line searches The model-trust region approach Global methods for systems of nonlinear equations Exercises 7. Stopping, Scaling, and Testing. Scaling Stopping criteria Testing Exercises 8. Secant Methods for Systems of Nonlinear Equations. Broyden's method Local convergence analysis of Broyden's method Implementation of quasi-Newton algorithms using Broyden's update Other secant updates for nonlinear equations Exercises 9. Secant Methods for Unconstrained Minimization. The symmetric secant update of Powell Symmetric positive definite secant updates Local convergence of positive definite secant methods Implementation of quasi-Newton algorithms using the positive definite secant update Another convergence result for the positive definite secant method Other secant updates for unconstrained minimization Exercises 10. Nonlinear Least Squares. The nonlinear least-squares problem Gauss-Newton-type methods Full Newton-type methods Other considerations in solving nonlinear least-squares problems Exercises 11. Methods for Problems with Special Structure. The sparse finite-difference Newton method Sparse secant methods Deriving least-change secant updates Analyzing least-change secant methods Exercises Appendix A. A Modular System of Algorithms for Unconstrained Minimization and Nonlinear Equations (by Robert Schnabel) Appendix B. Test Problems (by Robert Schnabel) References Author Index Subject Index.

6,831 citations

Book
01 Mar 1983
TL;DR: Newton's Method for Nonlinear Equations and Unconstrained Minimization and methods for solving nonlinear least-squares problems with Special Structure.
Abstract: Preface 1. Introduction. Problems to be considered Characteristics of 'real-world' problems Finite-precision arithmetic and measurement of error Exercises 2. Nonlinear Problems in One Variable. What is not possible Newton's method for solving one equation in one unknown Convergence of sequences of real numbers Convergence of Newton's method Globally convergent methods for solving one equation in one uknown Methods when derivatives are unavailable Minimization of a function of one variable Exercises 3. Numerical Linear Algebra Background. Vector and matrix norms and orthogonality Solving systems of linear equations-matrix factorizations Errors in solving linear systems Updating matrix factorizations Eigenvalues and positive definiteness Linear least squares Exercises 4. Multivariable Calculus Background Derivatives and multivariable models Multivariable finite-difference derivatives Necessary and sufficient conditions for unconstrained minimization Exercises 5. Newton's Method for Nonlinear Equations and Unconstrained Minimization. Newton's method for systems of nonlinear equations Local convergence of Newton's method The Kantorovich and contractive mapping theorems Finite-difference derivative methods for systems of nonlinear equations Newton's method for unconstrained minimization Finite difference derivative methods for unconstrained minimization Exercises 6. Globally Convergent Modifications of Newton's Method. The quasi-Newton framework Descent directions Line searches The model-trust region approach Global methods for systems of nonlinear equations Exercises 7. Stopping, Scaling, and Testing. Scaling Stopping criteria Testing Exercises 8. Secant Methods for Systems of Nonlinear Equations. Broyden's method Local convergence analysis of Broyden's method Implementation of quasi-Newton algorithms using Broyden's update Other secant updates for nonlinear equations Exercises 9. Secant Methods for Unconstrained Minimization. The symmetric secant update of Powell Symmetric positive definite secant updates Local convergence of positive definite secant methods Implementation of quasi-Newton algorithms using the positive definite secant update Another convergence result for the positive definite secant method Other secant updates for unconstrained minimization Exercises 10. Nonlinear Least Squares. The nonlinear least-squares problem Gauss-Newton-type methods Full Newton-type methods Other considerations in solving nonlinear least-squares problems Exercises 11. Methods for Problems with Special Structure. The sparse finite-difference Newton method Sparse secant methods Deriving least-change secant updates Analyzing least-change secant methods Exercises Appendix A. A Modular System of Algorithms for Unconstrained Minimization and Nonlinear Equations (by Robert Schnabel) Appendix B. Test Problems (by Robert Schnabel) References Author Index Subject Index.

6,217 citations


"On the limited memory BFGS method f..." refers methods in this paper

  • ...…that the limited memory BFGS method L BFGS is superior to the method of Buckley and LeNir They also show that for many problems the partitioned quasi Newton method is extremely e ective and is superior to the limited memory methods However we nd that for other problems the L BFGS method is very…...

    [...]

Journal ArticleDOI
TL;DR: An update formula which generates matrices using information from the last m iterations, where m is any number supplied by the user, and the BFGS method is considered to be the most efficient.
Abstract: We study how to use the BFGS quasi-Newton matrices to precondition minimization methods for problems where the storage is critical. We give an update formula which generates matrices using information from the last m iterations, where m is any number supplied by the user. The quasi-Newton matrix is updated at every iteration by dropping the oldest information and replacing it by the newest informa- tion. It is shown that the matrices generated have some desirable properties. The resulting algorithms are tested numerically and compared with several well- known methods. 1. Introduction. For the problem of minimizing an unconstrained function / of n variables, quasi-Newton methods are widely employed (4). They construct a se- quence of matrices which in some way approximate the hessian of /(or its inverse). These matrices are symmetric; therefore, it is necessary to have n(n + l)/2 storage locations for each one. For large dimensional problems it will not be possible to re- tain the matrices in the high speed storage of a computer, and one has to resort to other kinds of algorithms. For example, one could use the methods (Toint (15), Shanno (12)) which preserve the sparsity structure of the hessian, or conjugate gradient methods (CG) which only have to store 3 or 4 vectors. Recently, some CG algorithms have been developed which use a variable amount of storage and which do not require knowledge about the sparsity structure of the problem (2), (7), (8). A disadvantage of these methods is that after a certain number of iterations the quasi-Newton matrix is discarded, and the algorithm is restarted using an initial matrix (usually a diagonal matrix). We describe an algorithm which uses a limited amount of storage and where the quasi-Newton matrix is updated continuously. At every step the oldest information contained in the matrix is discarded and replaced by new one. In this way we hope to have a more up to date model of our function. We will concentrate on the BFGS method since it is considered to be the most efficient. We believe that similar algo- rithms cannot be developed for the other members of the Broyden 0-class (1). Let / be the function to be nnnimized, g its gradient and h its hessian. We define

2,711 citations

Frequently Asked Questions (7)
Q1. What are the contributions in this paper?

The authors study the numerical performance of a limited memory quasi Newton method for large scale optimization which they call the L BFGS method The authors compare its performance with that of the method developed by Buckley and LeNir which combines cyles of BFGS steps and conjugate direction steps The authors show that the L BFGS method can be greatly accelerated by means of a simple scaling The authors also study the convergence properties of the L BFGS method and prove global convergence on uniformly convex problems 

For large problems scaling becomes much more important see Beale Griewank and Toint a and Gill and Murray Indeed Griewank and Toint report that a simple scaling can dramatically reduce the number of iterations of their partitioned quasi Newton method in some problems 

This begins a new BFGS cycleTo understand some of the details of this method one must note that Powell s restart criterion is based on the fact that when the objective function is quadratic and the line search is exact the gradients are orthogonal 

They also show that for many problems the partitioned quasi Newton method is extremely e ective and is superior to the limited memory methods 

The average number of corrections used during the BFGS cycle is only m since corrections are added one by one Indeed what may be particularly detrimental to the algorithm is that the rst two or three iterations of the BFGS cycle use a small amount of information 

The authors also conclude that for large problems with inexpensive functions the simple CG method can still be considered among the best methods available to date Based on their experience the authors recommend to the user of Harwell code VA which implements the M L BFGS method to use low storage and accurate line searches when function evaluation is inexpensive and to set m and use an inaccurate line search when the function is expensiveComparison with the partitioned quasi Newton methodThe authors now compare the performance of the L BFGS method with that of the partitioned quasi Newton method PQN of Griewank and Toint which is also designed for solving large problems 

The authors have observed in general that when solving very large problems increasing the storage from m or m gives only aP N m m m m mTable L BFGS method with scaling strategy Mmarginal improvement of performance Gilbert and Lemar echal report similar results