scispace - formally typeset
Open AccessJournal ArticleDOI

A Subspace Minimization Method for the Trust-Region Step

TLDR
A method is proposed that allows the trust-region norm to be defined independently of the preconditioner over a sequence of evolving low-dimensional subspaces and shows that the method can require significantly fewer function evaluations than other methods.
Abstract
We consider methods for large-scale unconstrained minimization based on finding an approximate minimizer of a quadratic function subject to a two-norm trust-region constraint. The Steihaug-Toint method uses the conjugate-gradient method to minimize the quadratic over a sequence of expanding subspaces until the iterates either converge to an interior point or cross the constraint boundary. However, if the conjugate-gradient method is used with a preconditioner, the Steihaug-Toint method requires that the trust-region norm be defined in terms of the preconditioning matrix. If a different preconditioner is used for each subproblem, the shape of the trust-region can change substantially from one subproblem to the next, which invalidates many of the assumptions on which standard methods for adjusting the trust-region radius are based. In this paper we propose a method that allows the trust-region norm to be defined independently of the preconditioner. The method solves the inequality constrained trust-region subproblem over a sequence of evolving low-dimensional subspaces. Each subspace includes an accelerator direction defined by a regularized Newton method for satisfying the optimality conditions of a primal-dual interior method. A crucial property of this direction is that it can be computed by applying the preconditioned conjugate-gradient method to a positive-definite system in both the primal and dual variables of the trust-region subproblem. Numerical experiments on problems from the CUTEr test collection indicate that the method can require significantly fewer function evaluations than other methods. In addition, experiments with general-purpose preconditioners show that it is possible to significantly reduce the number of matrix-vector products relative to those required without preconditioning.

read more

Content maybe subject to copyright    Report

A SUBSPACE MINIMIZATION METHOD FOR THE
TRUST-REGION STEP
JENNIFER B. ERWAY
AND PHILIP E. GILL
Abstract. We consider methods for large-scale unconstrained minimization based on finding an
approximate minimizer of a quadratic function subject to a two-norm trust-region constraint. The
Steihaug-Toint method uses the conjugate-gradient (CG) algorithm to minimize the quadratic over
a sequence of expanding subspaces until the iterates either converge to an interior point or cross the
constraint boundary. However, if the CG method is used with a preconditioner, the Steihaug-Toint
method requires that the trust-region norm be defined in terms of the preconditioning matrix. If
a different preconditioner is used for each subproblem, the shape of the trust-region can change
substantially from one subproblem to the next, which invalidates many of the assumptions on which
standard methods for adjusting the trust-region radius are based. In this paper we propose a method
that allows the trust-region norm to be defined independently of the preconditioner. The method
solves the inequality constrained trust-region subproblem over a sequence of evolving low-dimensional
subspaces. Each subspace includes an accelerator direction defined by a regularized Newton method
for satisfying the optimality conditions of a primal-dual interior method. A crucial property of
this direction is that it can be computed by applying the preconditioned CG method to a positive-
definite system in both the primal and dual variables of the trust-region subproblem. Numerical
experiments on problems from the CUTEr test collection indicate that the method can require
significantly fewer function evaluations than other methods. In addition, experiments with general-
purpose preconditioners show that it is possible to significantly reduce the number of matrix-vector
products relative to those required without preconditioning.
Key words. Large-scale unconstrained optimization, trust-region methods, conjugate-gradient
methods, Krylov methods, preconditioning.
AMS subject classifications. 49M37, 65F10, 65K05, 65K10, 90C06, 90C26, 90C30
1. Introduction. The jth iteration of a trust-region method for unconstrained
minimization involves finding an approximate solution of the trust-region subproblem:
minimize
sR
n
Q
j
(s) g
T
j
s +
1
2
s
T
H
j
s subject to ksk δ
j
, (1.1)
where δ
j
is a given positive trust-region radius and Q
j
(s) is the quadratic model of
a scalar-valued function with gradient g
j
and Hessian H
j
. The focus of this paper is
on the solution of (1.1) when the matrix H
j
is best accessed as an operator for the
definition of matrix-vector products of the form H
j
v.
In this context, Steihaug [22] and Toint [23] independently proposed methods
for solving (1.1) when the trust-region is defined in terms of the two-norm, i.e., the
constraint is ksk
2
δ
j
. If H
j
is positive definite, the Newton equations H
j
s = g
j
define the unconstrained minimizer of (1.1). The Steihaug-Toint method begins with
the application of the conjugate-gradient (CG) method to the Newton equations.
This process is equivalent to minimizing Q
j
over a sequence of expanding subspaces
generated by the conjugate-gradient directions. As long as the curvature of Q
j
remains
positive on each of these subspaces, the CG iterates steadily increase in norm and the
CG iterates either converge inside the trust region or form a piecewise-linear path with
a unique intersection-point on the trust-region boundary. When H
j
is not positive
Research supported by the National Science Foundation grant DMS-0511766.
Department of Mathematics, PO Box 7388, Wake Forest University, Winston-Salem, NC 27109
(erwayjb@wfu.edu).
Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112
(pgill@ucsd.edu).
1

2 J. B. ERWAY AND P. E. GILL
definite, a solution of (1.1) must lie on the boundary of the trust region and the CG
method may generate a direction p along which Q
j
has zero or negative curvature. In
this case, the algorithm is terminated at the point on p that intersects the boundary
of the trust region.
If the Steihaug-Toint method is terminated on the boundary of the trust region,
the step may bear little relation to an optimal solution of (1.1). This means that, in
contrast to line-search methods, it is not possible to choose an approximate solution
that balances the cost of computing the problem functions with the cost of computing
the trust-region step (see, e.g., [4] for more discussion of this issue). Several exten-
sions of the Steihaug-Toint method have been proposed that allow the accuracy of a
constrained solution to be specified. Gould, Lucidi, Roma, and Toint [10] proposed
the generalized Lanczos trust-region (GLTR) algorithm, which finds a constrained
minimizer of (1.1) over a sequence of expanding subspaces associated with the Lanc-
zos process for reducing H
j
to tridiagonal form. Erway, Gill and Griffin [4] continue
to optimize on the trust-region boundary using the sequential subspace minimization
(SSM) method. This method approximates a constrained minimizer over a sequence of
evolving low-dimensional subspaces that do not necessarily form a nested sequence.
Erway, Gill and Griffin use a basis for each subspace that includes an accelerator
vector defined by a primal-dual augmented Lagrangian method.
These recent extensions to the Steihaug-Toint method add the ability to increase
the accuracy of the trust-region solution when needed. The result is a reliable and
efficient method for applying the CG method to large-scale optimization. However,
there are some situations where the Steihaug-Toint approach may not be efficient.
Preconditioning the conjugate-gradient method. In many applications the
convergence rate of CG can be significantly improved by using a preconditioner, which
is usually available in the form of a positive-definite operator M
1
j
that clusters the
eigenvalues of M
1
j
H
j
. If a preconditioned CG method is used, the increasing norm
property of the iterates holds only in the weighted norm kxk
M
j
= (x
T
M
j
x)
1/2
, which
mandates the use of a trust region of the form ksk
M
j
δ
j
. Unfortunately, if a different
preconditioner is used for each trust-region subproblem, the shape of the trust-region
may alter dramatically from one subproblem to the next. Since a fundamental tenet
of trust-region methods is that the value of δ
j
be used to determine the value of
δ
j+1
, the effectiveness of the trust-region strategy may be seriously compromised.
We emphasize the distinction between the constant weighted trust region kNsk
2
=
(s
T
N
T
Ns)
1/2
δ
j
typically associated with a constant nonsingular scaling matrix N,
and the varying trust region ksk
M
j
δ
j
induced by the preconditioner.
Convergence to second-order points. The Steihaug-Toint method and its
extensions are first-order methods, in the sense that they are guaranteed to converge
to points that satisfy the first-order necessary conditions for optimality (i.e., g = 0). If
direct matrix factorizations are used, it is possible to approximate a global minimizer
of the trust-region subproblem and thereby guarantee convergence to points that
satisfy the second-order conditions for optimality, i.e., points at which the gradient is
zero and the Hessian is positive semidefinite (see, e.g., Mor´e and Sorensen [15]). We
know of no method based on the conjugate-gradient method that is guaranteed to find
a global solution of (1.1) in finite-precision. For example, the Steihaug-Toint method
is not guaranteed to compute a solution on the boundary when Q
j
is unbounded
below. (Suppose that H
j
is indefinite and Q
j
(s) has a stationary point ˆs such that
kˆsk < δ
j
. If H
j
is positive definite on the Krylov subspace spanned by g
j
, H
j
g
j
,
H
2
j
g
j
, . . . , then CG will terminate at the interior point ˆs.) Notwithstanding these

A SUBSPACE MINIMIZATION METHOD FOR THE TRUST-REGION STEP 3
theoretical difficulties, it seems worthwhile devising strategies that have the potential
of providing convergence to a global solution in “most cases”.
Efficiency for repeated constrained subproblems. When solving a difficult
problem, it is often the case that a sequence of problems of the form (1.1) must be
solved in which only the trust-region radius δ
j
changes. However, the Steihaug-Toint
method is unable to exploit this information during the generation of the expanding
sequence of subspaces.
In this paper we consider an interior-point sequential subspace minimization (IP-
SSM) method that is designed to mitigate these ill-effects. (i) The method allows the
use of CG preconditioning in conjunction with a standard method for updating the
trust-region radius. (ii) The likelihood of approximating the global minimizer of (1.1)
is increased by the computation of an approximate left-most eigenpair of H
j
that is
not based on the CG Krylov subspace. In particular, it allows the computation of a
nonzero step when g
j
= 0 and H
j
is indefinite. (iii) Information garnered during the
solution of one subproblem may be used to expedite the solution of the next.
The IP-SSM method is a member of the class of sequential subspace minimiza-
tion (SSM) methods first proposed for the equality-constraint case by Hager [12, 13].
These methods approximate a constrained minimizer over a sequence of evolving low-
dimensional subspaces that include an “accelerator” direction designed to increase
the rate of convergence. Broadly speaking, SSM methods differ in the composition of
the basis for the subspace and in the definition of the accelerator direction. Hager
employs a subspace based on the gradient vector and the left-most eigenvector. The
accelerator direction is found by applying the CG method with a constraint precon-
ditioner to the KKT optimality conditions. (Hager’s method uses a very accurate
approximation to the left-most eigenvector and is not designed to find the low-cost
approximate solutions needed in the trust-region context.) An important property of
the IP-SSM method is that the accelerator direction is computed by applying the pre-
conditioned CG method to a regularized positive-definite system in both the primal
and dual variables of the inequality constrained problem.
Finally, we mention several Krylov-based iterative methods that are intended to
find a solution of the problem of minimizing Q
j
(s) = g
T
j
s +
1
2
s
T
H
j
s subject to the
equality constraint ksk
2
= δ
j
. The methods of Sorensen [21], Rojas and Sorensen [19],
Rojas, Santos and Sorensen [18], and Rendl and Wolkowicz [17] approximate the
eigenvalues of a matrix obtained by augmenting H
j
by a row and column.
The paper is organized in four sections. In Section 2 we formulate the proposed
SSM method and consider some properties of the regularized Newton equations used
to generate the SSM accelerator direction. Section 3 includes numerical comparisons
with the Steihaug-Toint and GLTR methods on unconstrained problems from the
CUTEr test collection (see Bongartz et al. [1] and Gould, Orban and Toint [11]).
Finally, Section 4 includes some concluding remarks and observations.
1.1. Notation and Glossary. Unless explicitly indicated, k · k denotes the
vector two-norm or its subordinate matrix norm. The symbol e
i
denotes the ith
column of the identity matrix I, where the dimensions of e
i
and I depend on the
context. The eigenvalues of a real symmetric matrix H are denoted by {λ
i
}, where
λ
n
λ
n1
··· λ
1
. The associated eigenvectors are denoted by {u
i
}. An eigen-
value λ and a corresponding normalized eigenvector u such that λ = λ
n
are known as
a left-most eigenpair of H. The Moore-Penrose pseudoinverse of a matrix A is denoted
by A
. Some sections include algorithms written in a Matlab-style pseudocode. In
these algorithms, brackets will be used to differentiate between computed and stored

4 J. B. ERWAY AND P. E. GILL
quantities. For example, the expression [Ax] := Ax signifies that the matrix-vector
product of A with x is computed and assigned to the vector [Ax]. Similarly, if P is
a matrix with columns p
1
, p
2
, . . . , p
m
, then [AP ] denotes the matrix with columns
[Ap
1
], [Ap
2
], . . . , [Ap
m
].
2. A SSM Method with Interior-Point Acceleration. In this section we
omit the suffix j and focus on a typical trust-region subproblem of the form
minimize
sR
n
Q(s) g
T
s +
1
2
s
T
Hs subject to ksk δ. (2.1)
The Steihaug-Toint method and its extensions start with the unconstrained minimiza-
tion of Q and consider the constraint only if the unconstrained solution lies outside
the trust-region. In the proposed interior-point sequential subspace minimization (IP-
SSM) method, the inequality constrained problem (2.1) is minimized directly over a
sequence of low-dimensional subspaces, giving a sequence of reduced inequality con-
straint problems of the form
minimize
sR
n
Q(s) subject to ksk δ, s S
k
= span{s
k1
, z
k
, s
a
k
}, (2.2)
where s
k1
is the current best estimate of the subproblem solution, z
k
is the current
best estimate of u
n
(the left-most eigenvector of H), and s
a
k
is an interior-point accel-
erator direction. The Lanczos-CG algorithm is used to define the accelerator direction
and to provide basis vectors for the low-dimensional subspaces associated with the
reduced versions of the left-most eigenvalue problem.
2.1. Definition of the accelerator direction. The accelerator direction s
a
k
is
an approximate Newton direction for the perturbed optimality conditions associated
with the problem
minimize
sR
n
Q(s) = g
T
s +
1
2
s
T
Hs subject to
1
2
δ
2
1
2
s
T
s 0. (2.3)
This is an inequality constrained optimization problem with Lagrange multiplier σ
and Lagrangian function
L(s, σ) = Q(s) σ(
1
2
δ
2
1
2
s
T
s) = Q(s) σc(s),
where c(s) denotes the constraint residual c(s) =
1
2
δ
2
1
2
s
T
s. The necessary and
sufficient conditions for a global solution of (2.3) imply the existence of a vector s and
scalar σ such that
(H + σI)s = g, with H + σI positive semidefinite,
c(s)σ = 0, with σ 0 and c(s) 0.
(2.4)
(For a proof, see, e.g., Gay [7], Sorensen [20], Mor´e and Sorensen [16], or Conn, Gould
and Toint [2].) The conventional primal-dual interior-point approach to solving (2.3)
is based on finding s and σ that satisfy the perturbed optimality conditions
(H + σI)s = g, σ > 0,
c(s)σ = µ, c(s) > 0
(2.5)
for a sequence of decreasing values of the positive parameter µ. Let F (s, σ) denote
the vector-valued function whose components are the residuals (H + σI)s + g and

A SUBSPACE MINIMIZATION METHOD FOR THE TRUST-REGION STEP 5
c(s)σ µ. Given an approximate zero (s, σ) of F such that c(s) > 0 and σ > 0, the
Newton equations for the next iterate (s + p, σ + q) are:
µ
H + σI s
σs
T
c(s)
µ
p
q
=
µ
g + (H + σI)s
c(s)σ µ
.
The assumption that σ > 0 implies that it is safe to divide the last equation by σ
to give the symmetrized equations:
µ
H + σI s
s
T
d
µ
p
q
=
µ
g + (H + σI)s
d( ˆσ σ)
,
where d = c(s) and ˆσ = µ/c(s). The presence of the nonzero (2, 2) block implies
that the conventional interior-point approach defines a regularization of Newton’s
method for a solution of the optimality conditions (2.4). The regularized solution lies
on the central path of solutions
¡
s(µ), σ(µ)
¢
of (2.5) (see, e.g., [6]). This implies that
the regularized solution
¡
s(µ), σ(µ)
¢
will be different from (s
, σ
) for a given nonzero
µ. Moreover, the influence of the regularization on the Newton equations diminishes
as µ 0.
These considerations suggest that we seek an alternative “exact” regularization
that allows the use of a fixed value of µ, but does not perturb the regularized solution.
Consider the perturbed optimality conditions
(H + σI)s = g, σ > 0,
c(s)σ = µ(σ
e
σ), c(s) > µ,
(2.6)
where σ
e
is a nonnegative estimate of σ
. If σ
e
= σ
, these conditions are satisfied
by (s
, σ
) for any positive µ. The symmetrized Newton equations associated with
conditions (2.6) are
µ
H + σI s
s
T
d
µ
p
q
=
µ
g + (H + σI)s
d( ˆσ σ)
,
where now, d = (c(s) + µ) and ˆσ = µσ
e
/(c(s) + µ). Forsgren, Gill and Griffin [5]
show that these equations are equivalent to the so-called doubly-augmented system:
µ
H(σ) + (2/d)ss
T
s
s
T
d
µ
p
q
=
µ
g + H(σ)s 2(σ ˆσ)s
d(σ ˆσ)
,
where H(σ) = H + σI. Finally, we multiply the last equation and last variable by
d
1
2
and d
1
2
, respectively, to improve the scaling when σ 0. This gives
µ
H(σ) + 2¯s¯s
T
¯s
¯s
T
1
µ
p
¯q
=
µ
g + H(σ)s 2d
1
2
¡
σ ˆσ
¢
¯s
d
1
2
(σ ˆσ)
, (2.7)
where ¯s = d
1
2
s and q = d
1
2
¯q. These equations are positive definite in a neighbor-
hood of a minimizer (s, σ) such that σ (λ
n
, ), and they may be solved using the
CG method. If a direction of negative or zero curvature is detected, the direction is
used to update a lower bound on the best estimate of σ (see Section 2.2).
It is not necessary to solve the perturbed equations (2.6) to high accuracy because
the quality of the accelerator step affects only the rate of convergence of the SSM
method. In the runs described in Section 3 only one Newton iteration was performed.

Citations
More filters
Posted Content

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

TL;DR: The canonical problem of finite-sum minimization is considered, and appropriate uniform and non-uniform sub-sampling strategies are provided to construct such Hessian approximations, and optimal iteration complexity is obtained for the correspondingSub-sampled trust-region and adaptive cubic regularization methods.
Posted Content

Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step

TL;DR: It is proved that, under mild assumptions, gradient descent approximates the $\textit{global minimum}$ to within $\varepsilon$ accuracy in $O(\varpsilon^{-1}\log(1/\vare psilon)$ steps for large $\varPSilon$ and $O(log( 1/ \varePSilon)’ steps for small $\vARpsilon”), with at most logarithmic dependence on
Posted Content

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

TL;DR: Detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems demonstrate that these methods not only can be computationally competitive with hand-tuned SGD with momentum, obtaining comparable or better generalization performance, but also they are highly robust to hyper-parameter settings.
Journal ArticleDOI

On solving trust-region and other regularised subproblems in optimization

TL;DR: Methods that obtain the solution of a sequence of parametrized linear systems by factorization are used, and enhancements using high-order polynomial approximation and inverse iteration ensure that the resulting method is both globally and asymptotically at least superlinearly convergent in all cases.
Journal ArticleDOI

Solving the Trust-Region Subproblem By a Generalized Eigenvalue Problem

TL;DR: It is demonstrated that the resulting algorithm is a general-purpose TRS solver, effective both for dense and large-sparse problems, including the so-called hard case, and obtaining approximate solutions efficiently when high accuracy is unnecessary.
References
More filters
Book

Trust Region Methods

TL;DR: This chapter discusses Trust-Region Mewthods for General Constained Optimization and Systems of Nonlinear Equations and Nonlinear Fitting, and some of the methods used in this chapter dealt with these systems.
Journal ArticleDOI

Solution of Sparse Indefinite Systems of Linear Equations

TL;DR: The method of conjugate gradients for solving systems of linear equations with a symmetric positive definite matrix A is given as a logical development of the Lanczos algorithm for tridiagonalizing...
Journal ArticleDOI

Computing a Trust Region Step

TL;DR: An algorithm for the problem of minimizing a quadratic function subject to an ellipsoidal constraint is proposed and it is shown that this algorithm is guaranteed to produce a nearly optimal solution in a finite number of iterations.
Journal ArticleDOI

The Conjugate Gradient Method and Trust Regions in Large Scale Optimization

TL;DR: It is shown in this paper that an approximate solution of the trust region problem may be found by the preconditioned conjugate gradient method, and it is shown that the method has the same convergence properties as existing methods based on the dogleg strategy using an approximate Hessian.
Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "A subspace minimization method for the trust-region step" ?

The authors consider methods for large-scale unconstrained minimization based on finding an approximate minimizer of a quadratic function subject to a two-norm trust-region constraint. In this paper the authors propose a method that allows the trust-region norm to be defined independently of the preconditioner. 

As the main purpose of preconditioning is to reduce the number of CG iterations (and hence the number of matrix-vector products) it is useful to consider the number of products before and after preconditioning. 

The authors also omitted the 5 problems fminsurf, penalty1, penalty2, power and vareigvl because of Matlab memory limitations when extracting the diagonals from the CUTEr Hessian. 

For IP-SSM, the preconditioner was the incomplete Cholesky factorization of the positive-definite matrix Hj +σaI, where σa is the initial value of the accelerator variable (usually σe, see Algorithm ipAccelerator). 

The authors have considered an interior-point sequential subspace minimization method (IP-SSM) that solves the inequality constrained trust-region subproblem over a sequence of evolving low-dimensional subspaces. 

If results from these problems are excluded from the totals, the overall increase in function evaluations for Steihaug-Toint, GLTR and IP-SSM decreases to 82%, 88% and 18% respectively. 

For each method s the authors define the function πs : [0, rM ] 7→ R+ such thatπs(τ) = 1card(P) card({p ∈ P : log2(rp,s) ≤ τ}),where rp,s denotes the ratio of the number of function evaluations needed to solve problem p with method s and the least number of function evaluations needed to solve problem p. 

a preconditioner should reduce the number of matrix-vector products without increasing the number of function evaluations. 

If the statistics for the 9 solved problems from dixmaana–dixmaanl are excluded, thepercentage of cases for which function evaluations increased becomes 68%, 75% and 20% for Steihaug-Toint, GLTR and IP-SSM respectively. 

This scaling leads to a severely ill-conditioned Hessian for which a matrix-vector product has little or no precision, often causing a compete breakdown of the CG iterations. 

A method was considered to have solved a problem successfully when the iterate xj satisfied‖g(xj)‖2 ≤ max{ǫ‖g(x0)‖2, ǫ|f(x0)|, 10−5}, (3.3)with ǫ = 10−6. 

Of the 46 problems included in the summary of Table 4, diagonally preconditioned IP-SSM required more function evaluations than unpreconditioned IP-SSM for9 problems only (20% of the cases). 

The methods were tested with a diagonal preconditioner based on the matrix D = diag(d1, d2, . . . , dn) of diagonals of the Hessian evaluated at xj . 

Of the 49 problems included in the summary of Table 9, incomplete Cholesky preconditioned IP-SSM required more function evaluations than unpreconditioned IPSSM for 12 problems (25% of the cases).