What is the main purpose of preconditioning?

As the main purpose of preconditioning is to reduce the number of CG iterations (and hence the number of matrix-vector products) it is useful to consider the number of products before and after preconditioning.

Why did the authors omit the 5 problems fminsurf, penalty1, penalty2?

The authors also omitted the 5 problems fminsurf, penalty1, penalty2, power and vareigvl because of Matlab memory limitations when extracting the diagonals from the CUTEr Hessian.

What was the preconditioner for the icfs software?

For IP-SSM, the preconditioner was the incomplete Cholesky factorization of the positive-definite matrix Hj +σaI, where σa is the initial value of the accelerator variable (usually σe, see Algorithm ipAccelerator).

What is the key to the success of the method?

The authors have considered an interior-point sequential subspace minimization method (IP-SSM) that solves the inequality constrained trust-region subproblem over a sequence of evolving low-dimensional subspaces.

How many functions are evaluated after preconditioning?

If results from these problems are excluded from the totals, the overall increase in function evaluations for Steihaug-Toint, GLTR and IP-SSM decreases to 82%, 88% and 18% respectively.

What is the function s for each method?

For each method s the authors define the function πs : [0, rM ] 7→ R+ such thatπs(τ) = 1card(P) card({p ∈ P : log2(rp,s) ≤ τ}),where rp,s denotes the ratio of the number of function evaluations needed to solve problem p with method s and the least number of function evaluations needed to solve problem p.

What is the way to reduce the number of function evaluations?

a preconditioner should reduce the number of matrix-vector products without increasing the number of function evaluations.

What is the percentage of problems that were solved using Cholesky preconditioning?

If the statistics for the 9 solved problems from dixmaana–dixmaanl are excluded, thepercentage of cases for which function evaluations increased becomes 68%, 75% and 20% for Steihaug-Toint, GLTR and IP-SSM respectively.

What is the result of the CG scaling?

This scaling leads to a severely ill-conditioned Hessian for which a matrix-vector product has little or no precision, often causing a compete breakdown of the CG iterations.

What is the corresponding value for the fe and prds?

A method was considered to have solved a problem successfully when the iterate xj satisfied‖g(xj)‖2 ≤ max{ǫ‖g(x0)‖2, ǫ|f(x0)|, 10−5}, (3.3)with ǫ = 10−6.

How many function evaluations did GLTR and Steihaug-Toint?

Of the 46 problems included in the summary of Table 4, diagonally preconditioned IP-SSM required more function evaluations than unpreconditioned IP-SSM for9 problems only (20% of the cases).

What was the result of the tests?

The methods were tested with a diagonal preconditioner based on the matrix D = diag(d1, d2, . . . , dn) of diagonals of the Hessian evaluated at xj .

How many problems were solved using a Cholesky preconditioning?

Of the 49 problems included in the summary of Table 9, incomplete Cholesky preconditioned IP-SSM required more function evaluations than unpreconditioned IPSSM for 12 problems (25% of the cases).

(Open Access) A Subspace Minimization Method for the Trust-Region Step (2009) | Jennifer B. Erway

Q: What contributions have the authors mentioned in the paper "A subspace minimization method for the trust-region step" ?

The authors consider methods for large-scale unconstrained minimization based on finding an approximate minimizer of a quadratic function subject to a two-norm trust-region constraint. In this paper the authors propose a method that allows the trust-region norm to be defined independently of the preconditioner.

A SUBSPACE MINIMIZATION METHOD FOR THE

TRUST-REGION STEP

∗

JENNIFER B. ERWAY

†

AND PHILIP E. GILL

‡

Abstract. We consider methods for large-scale unconstrained minimization based on ﬁnding an

approximate minimizer of a quadratic function subject to a two-norm trust-region constraint. The

Steihaug-Toint method uses the conjugate-gradient (CG) algorithm to minimize the quadratic over

a sequence of expanding subspaces until the iterates either converge to an interior point or cross the

constraint boundary. However, if the CG method is used with a preconditioner, the Steihaug-Toint

method requires that the trust-region norm be deﬁned in terms of the preconditioning matrix. If

a diﬀerent preconditioner is used for each subproblem, the shape of the trust-region can change

substantially from one subproblem to the next, which invalidates many of the assumptions on which

standard methods for adjusting the trust-region radius are based. In this paper we propose a method

that allows the trust-region norm to be deﬁned independently of the preconditioner. The method

solves the inequality constrained trust-region subproblem over a sequence of evolving low-dimensional

subspaces. Each subspace includes an accelerator direction deﬁned by a regularized Newton method

for satisfying the optimality conditions of a primal-dual interior method. A crucial property of

this direction is that it can be computed by applying the preconditioned CG method to a positive-

deﬁnite system in both the primal and dual variables of the trust-region subproblem. Numerical

experiments on problems from the CUTEr test collection indicate that the method can require

signiﬁcantly fewer function evaluations than other methods. In addition, experiments with general-

purpose preconditioners show that it is possible to signiﬁcantly reduce the number of matrix-vector

products relative to those required without preconditioning.

Key words. Large-scale unconstrained optimization, trust-region methods, conjugate-gradient

methods, Krylov methods, preconditioning.

AMS subject classiﬁcations. 49M37, 65F10, 65K05, 65K10, 90C06, 90C26, 90C30

1. Introduction. The jth iteration of a trust-region method for unconstrained

minimization involves ﬁnding an approximate solution of the trust-region subproblem:

minimize

s∈R

(s) ≡ g

s +

s subject to ksk ≤ δ

, (1.1)

where δ

is a given positive trust-region radius and Q

(s) is the quadratic model of

a scalar-valued function with gradient g

and Hessian H

. The focus of this paper is

on the solution of (1.1) when the matrix H

is best accessed as an operator for the

deﬁnition of matrix-vector products of the form H

In this context, Steihaug [22] and Toint [23] independently proposed methods

for solving (1.1) when the trust-region is deﬁned in terms of the two-norm, i.e., the

constraint is ksk

≤ δ

. If H

is positive deﬁnite, the Newton equations H

s = −g

deﬁne the unconstrained minimizer of (1.1). The Steihaug-Toint method begins with

the application of the conjugate-gradient (CG) method to the Newton equations.

This process is equivalent to minimizing Q

over a sequence of expanding subspaces

generated by the conjugate-gradient directions. As long as the curvature of Q

remains

positive on each of these subspaces, the CG iterates steadily increase in norm and the

CG iterates either converge inside the trust region or form a piecewise-linear path with

a unique intersection-point on the trust-region boundary. When H

is not positive

∗

Research supported by the National Science Foundation grant DMS-0511766.

†

Department of Mathematics, PO Box 7388, Wake Forest University, Winston-Salem, NC 27109

(erwayjb@wfu.edu).

‡

Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112

(pgill@ucsd.edu).

2 J. B. ERWAY AND P. E. GILL

deﬁnite, a solution of (1.1) must lie on the boundary of the trust region and the CG

method may generate a direction p along which Q

has zero or negative curvature. In

this case, the algorithm is terminated at the point on p that intersects the boundary

of the trust region.

If the Steihaug-Toint method is terminated on the boundary of the trust region,

the step may bear little relation to an optimal solution of (1.1). This means that, in

contrast to line-search methods, it is not possible to choose an approximate solution

that balances the cost of computing the problem functions with the cost of computing

the trust-region step (see, e.g., [4] for more discussion of this issue). Several exten-

sions of the Steihaug-Toint method have been proposed that allow the accuracy of a

constrained solution to be speciﬁed. Gould, Lucidi, Roma, and Toint [10] proposed

the generalized Lanczos trust-region (GLTR) algorithm, which ﬁnds a constrained

minimizer of (1.1) over a sequence of expanding subspaces associated with the Lanc-

zos process for reducing H

to tridiagonal form. Erway, Gill and Griﬃn [4] continue

to optimize on the trust-region boundary using the sequential subspace minimization

(SSM) method. This method approximates a constrained minimizer over a sequence of

evolving low-dimensional subspaces that do not necessarily form a nested sequence.

Erway, Gill and Griﬃn use a basis for each subspace that includes an accelerator

vector deﬁned by a primal-dual augmented Lagrangian method.

These recent extensions to the Steihaug-Toint method add the ability to increase

the accuracy of the trust-region solution when needed. The result is a reliable and

eﬃcient method for applying the CG method to large-scale optimization. However,

there are some situations where the Steihaug-Toint approach may not be eﬃcient.

Preconditioning the conjugate-gradient method. In many applications the

convergence rate of CG can be signiﬁcantly improved by using a preconditioner, which

is usually available in the form of a positive-deﬁnite operator M

−1

that clusters the

eigenvalues of M

−1

. If a preconditioned CG method is used, the increasing norm

property of the iterates holds only in the weighted norm kxk

= (x

1/2

, which

mandates the use of a trust region of the form ksk

≤ δ

. Unfortunately, if a diﬀerent

preconditioner is used for each trust-region subproblem, the shape of the trust-region

may alter dramatically from one subproblem to the next. Since a fundamental tenet

of trust-region methods is that the value of δ

be used to determine the value of

j+1

, the eﬀectiveness of the trust-region strategy may be seriously compromised.

We emphasize the distinction between the constant weighted trust region kNsk

Ns)

1/2

≤ δ

typically associated with a constant nonsingular scaling matrix N,

and the varying trust region ksk

≤ δ

induced by the preconditioner.

Convergence to second-order points. The Steihaug-Toint method and its

extensions are ﬁrst-order methods, in the sense that they are guaranteed to converge

to points that satisfy the ﬁrst-order necessary conditions for optimality (i.e., g = 0). If

direct matrix factorizations are used, it is possible to approximate a global minimizer

of the trust-region subproblem and thereby guarantee convergence to points that

satisfy the second-order conditions for optimality, i.e., points at which the gradient is

zero and the Hessian is positive semideﬁnite (see, e.g., Mor´e and Sorensen [15]). We

know of no method based on the conjugate-gradient method that is guaranteed to ﬁnd

a global solution of (1.1) in ﬁnite-precision. For example, the Steihaug-Toint method

is not guaranteed to compute a solution on the boundary when Q

is unbounded

below. (Suppose that H

is indeﬁnite and Q

(s) has a stationary point ˆs such that

kˆsk < δ

. If H

is positive deﬁnite on the Krylov subspace spanned by g

, H

, . . . , then CG will terminate at the interior point ˆs.) Notwithstanding these

A SUBSPACE MINIMIZATION METHOD FOR THE TRUST-REGION STEP 3

theoretical diﬃculties, it seems worthwhile devising strategies that have the potential

of providing convergence to a global solution in “most cases”.

Eﬃciency for repeated constrained subproblems. When solving a diﬃcult

problem, it is often the case that a sequence of problems of the form (1.1) must be

solved in which only the trust-region radius δ

changes. However, the Steihaug-Toint

method is unable to exploit this information during the generation of the expanding

sequence of subspaces.

In this paper we consider an interior-point sequential subspace minimization (IP-

SSM) method that is designed to mitigate these ill-eﬀects. (i) The method allows the

use of CG preconditioning in conjunction with a standard method for updating the

trust-region radius. (ii) The likelihood of approximating the global minimizer of (1.1)

is increased by the computation of an approximate left-most eigenpair of H

that is

not based on the CG Krylov subspace. In particular, it allows the computation of a

nonzero step when g

= 0 and H

is indeﬁnite. (iii) Information garnered during the

solution of one subproblem may be used to expedite the solution of the next.

The IP-SSM method is a member of the class of sequential subspace minimiza-

tion (SSM) methods ﬁrst proposed for the equality-constraint case by Hager [12, 13].

These methods approximate a constrained minimizer over a sequence of evolving low-

dimensional subspaces that include an “accelerator” direction designed to increase

the rate of convergence. Broadly speaking, SSM methods diﬀer in the composition of

the basis for the subspace and in the deﬁnition of the accelerator direction. Hager

employs a subspace based on the gradient vector and the left-most eigenvector. The

accelerator direction is found by applying the CG method with a constraint precon-

ditioner to the KKT optimality conditions. (Hager’s method uses a very accurate

approximation to the left-most eigenvector and is not designed to ﬁnd the low-cost

approximate solutions needed in the trust-region context.) An important property of

the IP-SSM method is that the accelerator direction is computed by applying the pre-

conditioned CG method to a regularized positive-deﬁnite system in both the primal

and dual variables of the inequality constrained problem.

Finally, we mention several Krylov-based iterative methods that are intended to

ﬁnd a solution of the problem of minimizing Q

(s) = g

s +

s subject to the

equality constraint ksk

= δ

. The methods of Sorensen [21], Rojas and Sorensen [19],

Rojas, Santos and Sorensen [18], and Rendl and Wolkowicz [17] approximate the

eigenvalues of a matrix obtained by augmenting H

by a row and column.

The paper is organized in four sections. In Section 2 we formulate the proposed

SSM method and consider some properties of the regularized Newton equations used

to generate the SSM accelerator direction. Section 3 includes numerical comparisons

with the Steihaug-Toint and GLTR methods on unconstrained problems from the

CUTEr test collection (see Bongartz et al. [1] and Gould, Orban and Toint [11]).

Finally, Section 4 includes some concluding remarks and observations.

1.1. Notation and Glossary. Unless explicitly indicated, k · k denotes the

vector two-norm or its subordinate matrix norm. The symbol e

denotes the ith

column of the identity matrix I, where the dimensions of e

and I depend on the

context. The eigenvalues of a real symmetric matrix H are denoted by {λ

}, where

≤ λ

n−1

≤ ··· ≤ λ

. The associated eigenvectors are denoted by {u

}. An eigen-

value λ and a corresponding normalized eigenvector u such that λ = λ

are known as

a left-most eigenpair of H. The Moore-Penrose pseudoinverse of a matrix A is denoted

by A

†

. Some sections include algorithms written in a Matlab-style pseudocode. In

these algorithms, brackets will be used to diﬀerentiate between computed and stored

4 J. B. ERWAY AND P. E. GILL

quantities. For example, the expression [Ax] := Ax signiﬁes that the matrix-vector

product of A with x is computed and assigned to the vector [Ax]. Similarly, if P is

a matrix with columns p

, p

, . . . , p

, then [AP ] denotes the matrix with columns

[Ap

], [Ap

], . . . , [Ap

2. A SSM Method with Interior-Point Acceleration. In this section we

omit the suﬃx j and focus on a typical trust-region subproblem of the form

minimize

s∈R

Q(s) ≡ g

s +

Hs subject to ksk ≤ δ. (2.1)

The Steihaug-Toint method and its extensions start with the unconstrained minimiza-

tion of Q and consider the constraint only if the unconstrained solution lies outside

the trust-region. In the proposed interior-point sequential subspace minimization (IP-

SSM) method, the inequality constrained problem (2.1) is minimized directly over a

sequence of low-dimensional subspaces, giving a sequence of reduced inequality con-

straint problems of the form

minimize

s∈R

Q(s) subject to ksk ≤ δ, s ∈ S

= span{s

k−1

, z

, s

}, (2.2)

where s

k−1

is the current best estimate of the subproblem solution, z

is the current

best estimate of u

(the left-most eigenvector of H), and s

is an interior-point accel-

erator direction. The Lanczos-CG algorithm is used to deﬁne the accelerator direction

and to provide basis vectors for the low-dimensional subspaces associated with the

reduced versions of the left-most eigenvalue problem.

2.1. Deﬁnition of the accelerator direction. The accelerator direction s

an approximate Newton direction for the perturbed optimality conditions associated

with the problem

minimize

s∈R

Q(s) = g

s +

Hs subject to

−

s ≥ 0. (2.3)

This is an inequality constrained optimization problem with Lagrange multiplier σ

and Lagrangian function

L(s, σ) = Q(s) − σ(

−

s) = Q(s) − σc(s),

where c(s) denotes the constraint residual c(s) =

−

s. The necessary and

suﬃcient conditions for a global solution of (2.3) imply the existence of a vector s and

scalar σ such that

(H + σI)s = −g, with H + σI positive semideﬁnite,

c(s)σ = 0, with σ ≥ 0 and c(s) ≥ 0.

(2.4)

(For a proof, see, e.g., Gay [7], Sorensen [20], Mor´e and Sorensen [16], or Conn, Gould

and Toint [2].) The conventional primal-dual interior-point approach to solving (2.3)

is based on ﬁnding s and σ that satisfy the perturbed optimality conditions

(H + σI)s = −g, σ > 0,

c(s)σ = µ, c(s) > 0

(2.5)

for a sequence of decreasing values of the positive parameter µ. Let F (s, σ) denote

the vector-valued function whose components are the residuals (H + σI)s + g and

A SUBSPACE MINIMIZATION METHOD FOR THE TRUST-REGION STEP 5

c(s)σ − µ. Given an approximate zero (s, σ) of F such that c(s) > 0 and σ > 0, the

Newton equations for the next iterate (s + p, σ + q) are:

H + σI s

−σs

c(s)

¶µ

= −

g + (H + σI)s

c(s)σ − µ

The assumption that σ > 0 implies that it is safe to divide the last equation by −σ

to give the symmetrized equations:

H + σI s

−d

¶µ

= −

g + (H + σI)s

d( ˆσ −σ)

where d = c(s)/σ and ˆσ = µ/c(s). The presence of the nonzero (2, 2) block implies

that the conventional interior-point approach deﬁnes a regularization of Newton’s

method for a solution of the optimality conditions (2.4). The regularized solution lies

on the central path of solutions

s(µ), σ(µ)

of (2.5) (see, e.g., [6]). This implies that

the regularized solution

s(µ), σ(µ)

will be diﬀerent from (s

∗

, σ

∗

) for a given nonzero

µ. Moreover, the inﬂuence of the regularization on the Newton equations diminishes

as µ → 0.

These considerations suggest that we seek an alternative “exact” regularization

that allows the use of a ﬁxed value of µ, but does not perturb the regularized solution.

Consider the perturbed optimality conditions

(H + σI)s = −g, σ > 0,

c(s)σ = µ(σ

− σ), c(s) > −µ,

(2.6)

where σ

is a nonnegative estimate of σ

∗

. If σ

= σ

∗

, these conditions are satisﬁed

by (s

∗

, σ

∗

) for any positive µ. The symmetrized Newton equations associated with

conditions (2.6) are

H + σI s

−d

¶µ

= −

g + (H + σI)s

d( ˆσ −σ)

where now, d = (c(s) + µ)/σ and ˆσ = µσ

/(c(s) + µ). Forsgren, Gill and Griﬃn [5]

show that these equations are equivalent to the so-called doubly-augmented system:

H(σ) + (2/d)ss

−s

¶µ

= −

g + H(σ)s − 2(σ − ˆσ)s

d(σ − ˆσ)

where H(σ) = H + σI. Finally, we multiply the last equation and last variable by

−

and d

, respectively, to improve the scaling when σ → 0. This gives

H(σ) + 2¯s¯s

−¯s

¶µ

¯q

= −

g + H(σ)s − 2d

σ − ˆσ

¯s

(σ − ˆσ)

, (2.7)

where ¯s = d

−

s and q = d

−

¯q. These equations are positive deﬁnite in a neighbor-

hood of a minimizer (s, σ) such that σ ∈ (−λ

, ∞), and they may be solved using the

CG method. If a direction of negative or zero curvature is detected, the direction is

used to update a lower bound on the best estimate of σ (see Section 2.2).

It is not necessary to solve the perturbed equations (2.6) to high accuracy because

the quality of the accelerator step aﬀects only the rate of convergence of the SSM

method. In the runs described in Section 3 only one Newton iteration was performed.

A Subspace Minimization Method for the Trust-Region Step

Figures

Citations

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

On solving trust-region and other regularised subproblems in optimization

Solving the Trust-Region Subproblem By a Generalized Eigenvalue Problem

References

Trust Region Methods

Solution of Sparse Indefinite Systems of Linear Equations

Computing a Trust Region Step

Society for Industrial and Applied Mathematics(SIAM)

The Conjugate Gradient Method and Trust Regions in Large Scale Optimization

Related Papers (5)

Computing a Trust Region Step

Trust Region Methods

The Conjugate Gradient Method and Trust Regions in Large Scale Optimization

Numerical Optimization

Benchmarking optimization software with performance profiles

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "A subspace minimization method for the trust-region step" ?

Q2. What is the main purpose of preconditioning?

Q3. Why did the authors omit the 5 problems fminsurf, penalty1, penalty2?

Q4. What was the preconditioner for the icfs software?

Q5. What is the key to the success of the method?

Q6. How many functions are evaluated after preconditioning?

Q7. What is the function s for each method?

Q8. What is the way to reduce the number of function evaluations?

Q9. What is the percentage of problems that were solved using Cholesky preconditioning?

Q10. What is the result of the CG scaling?

Q11. What is the corresponding value for the fe and prds?

Q12. How many function evaluations did GLTR and Steihaug-Toint?

Q13. What was the result of the tests?

Q14. How many problems were solved using a Cholesky preconditioning?