# On the limited memory BFGS method for large scale optimization

## Summary (3 min read)

### Preliminaries

- The method of Buckley and LeNir combines cycles of BFGS and conjugate gradient steps.
- It starts by performing the usual BFGS method but stores the corrections to the initial matrix separately to avoid using O n storage.
- Hk are not formed explicitly but the m previous values of yj and sj are stored separately.
- The partitioned quasi Newton method PQN requires that the user supply detailed information about the objective function and is particularly e ective if the correct range of the Hessian of each element function is known.
- When the number of variables is very large in the hundreds or thousands the computational e ort of the iteration sometimes dominates the cost of evaluating the function and gradient.

### Table Set of test problems

- Problems and and the starting points used for them are described in Liu and Nocedal All the runs reported in this paper were terminated when kgkk max kxkk where k k denotes the Euclidean norm.
- The authors require low accuracy in the solution because this is common in practical applications.
- Since the authors have performed a very large number of tests they describe the results fully in an accompanying report Liu and Nocedal.
- The authors should note that all the comments and conclusions made in this paper are based on data presented here and in the accompanying report Comparison with the method of Buckley and LeNir When m the method of Buckley and LeNir reduces to Shanno s method and when m both methods are identical to the BFGS method.
- For a given value of m the two methods require roughly the same amount of storage but the L BFGS method requires slightly less arithmetic work per iteration than the B L method as implemented by Buckley and LeNir.

### Table Storage locations

- The tests described below were made on a SUN in double precision arithmetic for which the unit roundo is approximately.
- For each run the authors veri ed that both methods converged to the same solution point.
- The authors will therefore now consider the number of iterations and the total amount of time required by the two limited memory methods L BFGS P N m m m Table.

### L BFGS

- For most problems the number of iterations is markedly reduced compare Tables and We now compare this implementation of the L BFGS method with the method of Buckley and LeNir and for simplicity the authors will use total cpu time as a measure.the authors.
- Fur thermore an examination of the results given in Liu and Nocedal shows that the di erences are very substantial in many cases.

### Scaling the L BFGS method

- It is known that simple scalings of the variables can improve the performance of quasi Newton methods on small problems.
- The authors numerical experience appears to indicate that these two scalings are comparable in e ciency and therefore M should be preferred since it is less expensive to implement.
- In their tests this formula performed well sometimes but was very ine cient in many problems.
- The authors have observed in general that when solving very large problems increasing the storage from m or m gives only a P N m m m m m Table L BFGS method with scaling strategy M marginal improvement of performance Gilbert and Lemar echal report similar results.
- The authors tested three methods the algorithm CONMIN developed by Shanno and Phua the conjugate gradient method CG using the Polak Ribi"ere for mula see for example Powell restarting every n steps and with and in and the L BFGS method M for which they tried both accurate and inaccurate line searches.

### Table CONMIN CG and L BFGS methods

- The next two tables summarize the results of Table.
- In these two problems the PQN method is vastly superior in terms of function eval uations to the L BFGS method.
- The number of variables entering into the element functions is nve and nve vr is the number obtained after applying variable reduction Using the results of Table the authors give the average time required to perform an iteration it time.
- For the PQN method the authors have used the results corresponding to B I and they recall that the L BFGS method used scaling M and m P N ne nve nve vr PQN L BFGS it time it time Table Separability of the objective functions and average iteration time.

### Convergence Analysis

- In this section the authors show that the limited memory BFGS method is globally convergent on uniformly convex problems and that its rate of convergence is R linear.
- These results are easy to establish after noting that all Hessian approximations.
- Hk are obtained by updating a bounded matrix m times using the BFGS formula.
- The authors make the following assumptions about the objective function.

### Assumptions

- Therefore from and the authors conclude that there is a constant such that cos k sTkBksk kskkkBkskk.
- It is possible to prove this result for several other line search strategies including backtracking by adapting the arguments of Byrd and Nocedal see the proof of their Theorem Note from and that M k M.

### Final Remarks

- The authors tests indicate that a simple implementation of the L BFGS method performs better than the code of Buckley and LeNir and that the L BFGS method can be greatly improved by means of a simple dynamic scaling such as M.
- It is highly recommended if the user is able and willing to supply the information on the objective function that the method requires and it is particularly e ective when the element functions depend on a small number of variables less than or say.
- The L BFGS method is appealing for several reasons it is very simple to implement it requires only function and gradient values # and no other information on the problem # and it can be faster than the partitioned quasi Newton method on problems where the element functions depend on more than or variables.
- The authors would like to thank Andreas Griewank and Claude Lemar echal for several helpful conversations and Richard Byrd for suggesting the scaling used in method M.
- The authors are grateful to Jorge Mor e who encouraged us to pursue this investiga tion and who made many valuable suggestions and to the three referees for their helpful comments.

Did you find this useful? Give us your feedback

##### Citations

17,433 citations

17,420 citations

### Cites methods from "On the limited memory BFGS method f..."

...Jorge Nocedal Stephen J. Wright Numerical Optimization With 85 Illustrations , Springer Contents Preface vii 1 Introduction, 1 Mathematical Formulation 2 Example: A Transportation Problem 4 Continuous versus Discrete Optimization 4 Constrained and Unconstrained Optimization 6 Global and Local Optimization 6 Stochastic and Deterministic Optimization 7 Optimization Algorithms 7 Convexity 8 Notes and References 9 2 Fundamentals of Unconstrained Optimization 10 2.1 What Is a Solution?...

[...]

...Limited-memory BFGS methods are implemented in LBFGS [194] and M1QN3 [122]; see Gill and Leonard [125] for a variant that requires less storage and appears to be quite efficient....

[...]

...For further discussion on the L-BFGS method see Nocedal [228], Liu and Nocedal [194], and Gilbert and Lemaréchal [122]....

[...]

6,899 citations

5,448 citations

5,079 citations

### Cites methods from "On the limited memory BFGS method f..."

...[6] R. H. Byrd, J. Nocedal, and R. B. Schnabel, \Representation of quasi-Newton matrices andtheir use in limited memory methods," Technical report, EECS Department, NorthwesternUniversity, 1991, to appear in Mathematical Programming....

[...]

...The new algorithm therefore has computational demandssimilar to those of the limited-memory algorithm (L-BFGS) for unconstrained problems describedby Liu and Nocedal [19] and Gilbert and Lemar echal [14]....

[...]

...The Hessian approximations Bk used in our algorithm are limited-memory BFGS matrices(Nocedal [21] and Byrd, Nocedal, and Schnabel [6])....

[...]

...We nd that by making use of the compact representationsof limited-memory matrices described by Byrd, Nocedal, and Schnabel [6], the computationalcost of one iteration of the algorithm can be kept to be of order n.We used the gradient projection approach [16], [18], [3] to determine the active set, becauserecent studies [7], [5] indicate that it possesses good theoretical properties, and because it alsoappears to be e cient on many large problems [8], [20]....

[...]

...The new algorithm therefore has computational demands similar to those of the limited-memory algorithm (L-BFGS) for unconstrained problems described by Liu and Nocedal [19] and Gilbert and Lemar echal [14]....

[...]

##### References

7,278 citations

6,831 citations

6,217 citations

### "On the limited memory BFGS method f..." refers methods in this paper

...…that the limited memory BFGS method L BFGS is superior to the method of Buckley and LeNir They also show that for many problems the partitioned quasi Newton method is extremely e ective and is superior to the limited memory methods However we nd that for other problems the L BFGS method is very…...

[...]

2,711 citations

##### Related Papers (5)

##### Frequently Asked Questions (7)

###### Q2. How can the L BFGS method be used to reduce the number of iterations?

For large problems scaling becomes much more important see Beale Griewank and Toint a and Gill and Murray Indeed Griewank and Toint report that a simple scaling can dramatically reduce the number of iterations of their partitioned quasi Newton method in some problems

###### Q3. What is the criterion for a restart?

This begins a new BFGS cycleTo understand some of the details of this method one must note that Powell s restart criterion is based on the fact that when the objective function is quadratic and the line search is exact the gradients are orthogonal

###### Q4. What is the main appeal of limited memory methods?

They also show that for many problems the partitioned quasi Newton method is extremely e ective and is superior to the limited memory methods

###### Q5. how many corrections are used during the BFGS cycle?

The average number of corrections used during the BFGS cycle is only m since corrections are added one by one Indeed what may be particularly detrimental to the algorithm is that the rst two or three iterations of the BFGS cycle use a small amount of information

###### Q6. What is the method for calculating function calls?

The authors also conclude that for large problems with inexpensive functions the simple CG method can still be considered among the best methods available to date Based on their experience the authors recommend to the user of Harwell code VA which implements the M L BFGS method to use low storage and accurate line searches when function evaluation is inexpensive and to set m and use an inaccurate line search when the function is expensiveComparison with the partitioned quasi Newton methodThe authors now compare the performance of the L BFGS method with that of the partitioned quasi Newton method PQN of Griewank and Toint which is also designed for solving large problems

###### Q7. How does the L BFGS method perform?

The authors have observed in general that when solving very large problems increasing the storage from m or m gives only aP N m m m m mTable L BFGS method with scaling strategy Mmarginal improvement of performance Gilbert and Lemar echal report similar results