scispace - formally typeset
Journal ArticleDOI

Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models

TLDR
The worst-case evaluation complexity for smooth (possibly nonconvex) unconstrained optimization is considered and it is shown that an $$epsilon $$ϵ-approximate first-order critical point can be computed in at most O(ϵ-(p+1)/p) evaluations of the problem’s objective function and its derivatives.
Abstract
The worst-case evaluation complexity for smooth (possibly nonconvex) unconstrained optimization is considered. It is shown that, if one is willing to use derivatives of the objective function up to order p (for $$p\ge 1$$pź1) and to assume Lipschitz continuity of the p-th derivative, then an $$\epsilon $$∈-approximate first-order critical point can be computed in at most $$O(\epsilon ^{-(p+1)/p})$$O(∈-(p+1)/p) evaluations of the problem's objective function and its derivatives. This generalizes and subsumes results known for $$p=1$$p=1 and $$p=2$$p=2.

read more

Content maybe subject to copyright    Report

Worst-case evaluation complexity for unconstrained nonlinear
optimization using high-order regularized models
E. G. Birgin
, J. L. Gardenghi
, J. M. Mart´ınez
, S. A. Santos
and Ph. L. Toint
§
21 April 2016
Abstract
The worst-case evaluation complexity for smooth (possibly nonconvex) unconstrained
optimization is considered. It is shown that, if one is willing to use derivatives of the
objective function up to order p (for p 1) and to assume Lipschitz continuity of the p-th
derivative, then an -approximate first-order critical point can be computed in at most
O(
(p+1)/p
) evaluations of the problem’s objective function and its derivatives. This
generalizes and subsumes results known for p = 1 and p = 2.
1 Introduction
Recent years have seen a surge of interest in the analysis of worst-case evaluation complexity
of optimization algorithms for nonconvex problems (see, for instance, Vavasis [17], Nesterov
and Polyak [16], Nesterov [14, 15], Gratton, Sartenaer and Toint [13], Cartis, Gould and Toint
[3, 4, 5, 8], Bian, Chen and Ye [2], Bellavia, Cartis, Gould, Morini and Toint [1], Grapiglia,
Yuan and Yuan [12], Vicente [18]). In particular the paper [16] was the first to show that
a method using second derivatives can find an -approximate first-order critical point for an
unconstrained problem with Lipschitz continuous Hessians in at most O(
3/2
) evaluations of
the objective function (and its derivatives), in contrast with methods using first-derivatives
only, whose evaluation complexity was known [14] to be O(
2
) for problems with Lipschitz
continuous gradients. The purpose of the present short paper is to show that, if one is willing
to use derivatives up to order p (for p 1) and to assume Lipschitz continuity of the p-
th derivative, then an -approximate first-order critical point can be computed in at most
O(
(p+1)/p
) evaluations of the objective function and its derivatives. This is achieved by the
use of a regularization method very much in the spirit of the first- and second-order ARC
methods described in [4, 5].
This work has been partially supported by the Brazilian agencies FAPESP (grants 2010/10133-
0, 2013/03447-6, 2013/05475-7, 2013/07375-0, and 2013/23494-9) and CNPq (grants 304032/2010-7,
309517/2014-1, 303750/2014-6, and 490326/2013-7) and by the Belgian Fund for Scientific Research (FNRS).
Department of Computer Science, Institute of Mathematics and Statistics, University of ao Paulo, Rua
do Mat˜ao, 1010, Cidade Universit´aria, 05508-090, ao Paulo, SP, Brazil. e-mail: {egbirgin | john}@ime.usp.br
Department of Applied Mathematics, Institute of Mathematics, Statistics, and Scientific Computing, Uni-
versity of Campinas, Campinas, SP, Brazil. e-mail: {martinez | sandra}@ime.unicamp.br
§
Namur Center for Complex Systems (naXys) and Department of Mathematics, University of Namur, 61,
rue de Bruxelles, B-5000 Namur, Belgium. Email: philippe.toint@unamur.be
1

Birgin, Gardenghi, Mart´ınez, Santos, Toint Complexity with high-order models 2
2 A regularized p-th order model and algorithm
For p 1, p integer, consider the problem
min
xIR
n
f(x), (2.1)
where we assume that f from IR
n
to IR is bounded below and p-times continuously differen-
tiable. We also assume that its p-th derivative at x, the p-th order tensor
p
x
f(x) =
p
f
x
i
1
. . . x
i
p
i
j
∈{1,...,n},j=1,...,p
(x),
is Lipschitz continuous, i.e. that there exists a constant L 0 such that, for all x, y IR
n
,
k∇
p
x
f(x)
p
x
f(y)k
[p]
(p 1)! Lkx yk. (2.2)
In (2.2), k · k
[p]
is the tensor norm recursively induced by the Euclidean norm k · k on the
space of p-th order tensors, which is given by
kT k
[p]
def
= max
kv
1
k=···=kv
p
k=1
|T [v
1
, . . . , v
p
]|, (2.3)
where T [v
1
, . . . , v
j
] stands for the tensor of order q j 0 resulting from the application
of the q-th order tensor T to the vectors v
1
, . . . , v
j
. Let T
p
(x, s) be the Taylor series of the
function f(x + s) at x truncated at order p
T
p
(x, s)
def
= f(x) +
p
X
j=1
1
j!
j
x
f(x)[s]
j
, (2.4)
where the notation T [s]
j
stands for the tensor T applied j times to the vector s. Then Taylor’s
theorem, the identity
Z
1
0
(1 ξ)
p1
=
1
p
, (2.5)
the induced nature of k · k
[p]
and (2.2) imply that, for all x, s IR
n
,
f(x + s) = T
p1
(x, s) +
1
(p 1)!
Z
1
0
(1 ξ)
p1
p
x
f(x + ξs)[s]
p
T
p
(x, s) +
1
(p 1)!
Z
1
0
(1 ξ)
p1
(
p
x
f(x + ξs)[s]
p
p
x
f(x)[s]
p
)
T
p
(x, s) +
1
(p 1)!
Z
1
0
(1 ξ)
p1
|∇
p
x
f(x + ξs)[s]
p
p
x
f(x)[s]
p
T
p
(x, s) +
Z
1
0
(1 ξ)
p1
(p 1)!
max
ξ[0,1]
|∇
p
x
f(x + ξs)[s]
p
p
x
f(x)[s]
p
|
T
p
(x, s) +
1
p!
ksk
p
max
ξ[0,1]
k∇
p
x
f(x + ξs)
p
x
f(x)k
[p]
T
p
(x, s) +
L
p
ksk
p+1
.
(2.6)

Birgin, Gardenghi, Mart´ınez, Santos, Toint Complexity with high-order models 3
Following the more general argument developed by Cartis, Gould and Toint [10], consider
now, for an arbitrary unit vector v, φ(α) =
1
x
f(x + αs)[v] and τ
p1
(α) =
P
p1
i=0
φ
(i)
(0)α
i
/i!.
Taylor’s identity then gives that
φ(1) τ
p1
(1) =
1
(p 2)!
Z
1
0
(1 ξ)
p2
[φ
(p1)
(ξ) φ
(p1)
(0)] .
Hence, since τ
p1
(1) =
1
s
T
p
(x, s)[v],
(
1
x
f(x + s)
1
s
T
p
(x, s))[v] =
1
(p 2)!
Z
1
0
(1 ξ)
p2
[
p
x
f(x + ξs)
p
x
f(x)][s]
p1
[v] .
Thus, using the symmetry of the derivative tensors, picking v to maximize the absolute value
of the left-hand side and using (2.5), (2.3) and (2.2) successively, we obtain that
k∇
1
x
f(x + s)
1
s
T
p
(x, s)k
=
1
(p 2)!
Z
1
0
(1 ξ)
p2
(
p
x
f(x + ξs)
p
x
f(x))[v]
s
ksk
p1
ksk
p1
1
(p 2)!
Z
1
0
(1 ξ)
p2
max
ξ[0,1]
(
p
x
f(x + ξs)
p
x
f(x))[v]
s
ksk
p1
ksk
p1
1
(p 1)!
max
ξ[0,1]
max
kw
1
k=···=kw
p
k=1
|(
p
x
f(x + ξs)
p
x
f(x))[w
1
, . . . , w
p
]| ksk
p1
=
1
(p 1)!
max
ξ[0,1]
k∇
p
x
f(x + ξs)
p
x
f(x)k
[p]
ksk
p1
Lksk
p
.
(2.7)
In order to describe our algorithm, we also define the regularized Taylor series
m(x, s, σ) = T
p
(x, s) +
σ
p + 1
ksk
p+1
, (2.8)
whose gradient is
1
s
m(x, s, σ) =
1
s
T
p
(x, s) + σksk
p
s
ksk
. (2.9)
Note that
m(x, 0, σ) = T
p
(x, 0) = f(x). (2.10)
The minimization algorithm we consider is now detailed as Algorithm 1 on the following
page.
Each iteration of this algorithm requires the approximate minimization of m(x
k
, s, σ
k
), but
we may note that conditions (2.12) and (2.13) are relatively weak, in that they only require
a decrease of the regularized p-th order model and an approximate first-order stationary
point: no global optimization of this possibly nonconvex model is needed. Fortunately, this
approximate minimization does not involve additional computations of f or of its derivatives
at other points than at x
k
, and therefore the exact method used and the resulting effort spent
in Step 2 have no impact on the evaluation complexity. Also note that the numerator and

Birgin, Gardenghi, Mart´ınez, Santos, Toint Complexity with high-order models 4
Algorithm 1: ARp
Step 0: Initialization. An initial point x
0
and an initial regularization parameter σ
0
>
0 are given, as well as an accuracy level . The constants θ, η
1
, η
2
, γ
1
, γ
2
, γ
3
and
σ
min
are also given and satisfy
θ > 0, σ
min
(0, σ
0
], 0 < η
1
η
2
< 1 and 0 < γ
1
< 1 < γ
2
< γ
3
. (2.11)
Compute f(x
0
) and set k = 0.
Step 1: Test for termination. Evaluate
1
x
f(x
k
). If k∇
1
x
f(x
k
)k , terminate
with the approximate solution x
= x
k
. Otherwise compute derivatives of f from
order 2 to p at x
k
.
Step 2: Step calculation. Compute the step s
k
by approximately minimizing the
model m(x
k
, s, σ
k
) with respect to s in the sense that the conditions
m(x
k
, s
k
, σ
k
) < m(x
k
, 0, σ
k
) (2.12)
and
k∇
1
s
m(x
k
, s
k
, σ
k
)k θks
k
k
p
(2.13)
hold.
Step 3: Acceptance of the trial point. Compute f(x
k
+ s
k
) and define
ρ
k
=
f(x
k
) f(x
k
+ s
k
)
T
p
(x
k
, 0) T
p
(x
k
, s
k
)
. (2.14)
If ρ
k
η
1
, then define x
k+1
= x
k
+ s
k
; otherwise define x
k+1
= x
k
.
Step 4: Regularization parameter update. Set
σ
k+1
[max(σ
min
, γ
1
σ
k
), σ
k
] if ρ
k
η
2
,
[σ
k
, γ
2
σ
k
] if ρ
k
[η
1
, η
2
),
[γ
2
σ
k
, γ
3
σ
k
] if ρ
k
< η
1
.
(2.15)
Increment k by one and go to Step 1 if ρ
k
η
1
or to Step 2 otherwise.

Birgin, Gardenghi, Mart´ınez, Santos, Toint Complexity with high-order models 5
denominator in (2.14) are strictly comparable, the latter being Taylor’s approximation of the
former, without the regularization parameter playing any role.
Iterations for which ρ
k
η
1
(and hence x
k+1
= x
k
+ s
k
) are called “successful” and we
denote by S
k
def
= {0 j k | ρ
j
η
1
} the index set of all successful iterations between 0 and
k. We also denote by U
k
its complement in {0, . . . , k}, which corresponds to the index set
of “unsuccessful” iterations between 0 and k. Note that, before termination, each successful
iteration requires the evaluation of f and its first p derivatives, while only the evaluation of
f is needed at unsuccessful ones.
We first derive a very simple result on the model decrease obtained under condition (2.12).
Lemma 2.1 The mechanism of Algorithm 1 then guarantees that, for all k 0,
T
p
(x
k
, 0) T
p
(x
k
, s
k
)
σ
k
p + 1
ks
k
k
p+1
. (2.16)
Proof. Observe that, because of (2.12) and (2.8),
0 m(x
k
, 0, σ
k
) m(x
k
, s
k
, σ
k
) = T
p
(x
k
, 0) T
p
(x
k
, s
k
)
σ
k
p + 1
ks
k
k
p+1
which implies the desired bound. 2
As a result, we obtain that (2.14) is well-defined for all k 0. We next deduce a simple
upper bound on the regularization parameter σ
k
.
Lemma 2.2 Suppose that f is p times continuously differentiable with Lipschitz con-
tinuous p-th derivative (i.e., that (2.2) holds). Then, for all k 0,
σ
k
σ
max
def
= max
σ
0
,
γ
3
L(p + 1)
p (1 η
2
)
. (2.17)
Proof. Assume that
σ
k
L(p + 1)
p (1 η
2
)
. (2.18)
Using (2.6) and (2.16), we may then deduce that
|ρ
k
1|
|f(x
k
+ s
k
) T
p
(x
k
, s
k
)|
|T
p
(x
k
, 0) T
p
(x
k
, s
k
)|
L(p + 1)
p σ
k
1 η
2
and thus that ρ
k
η
2
. Then iteration k is very successful in that ρ
k
η
2
and σ
k+1
σ
k
.
As a consequence, the mechanism of the algorithm ensures that (2.17) holds. 2
Our next step, very much in the line of the theory proposed in [5], is to show that the
steplength cannot be arbitrarily small compared with the gradient of the objective function
at the trial point x
k
+ s
k
.

Citations
More filters
Journal ArticleDOI

Global rates of convergence for nonconvex optimization on manifolds

TL;DR: The first deterministic results for global rates of convergence to approximate first- and second-order Karush-Kuhn-Tucker points on manifolds apply in particular for optimization constrained to compact submanifolds of $\mathbb{R}^n$, under simpler assumptions.
Journal ArticleDOI

Implementable tensor methods in unconstrained convex optimization.

TL;DR: New tensor methods for unconstrained convex optimization, which solve at each iteration an auxiliary problem of minimizing convex multivariate polynomial, and an efficient technique for solving the auxiliary problem, based on the recently developed relative smoothness condition are developed.
Journal ArticleDOI

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

TL;DR: It is shown that in terms of the order of the accuracy, the evaluation complexity of a line-search method which is based on random first-order models and directions is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good.
Posted Content

Accelerated Methods for Non-Convex Optimization

TL;DR: The method improves upon the complexity of gradient descent and provides the additional second-order guarantee that $ abla^2 f(x) \succeq -O(\epsilon^{1/2})I$ for the computed $x$.
Journal ArticleDOI

Lower bounds for finding stationary points I

TL;DR: The lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newton’s method, and generalized pth order regularization are worst-case optimal within their natural function classes.
References
More filters
Book

Introductory Lectures on Convex Optimization: A Basic Course

TL;DR: A polynomial-time interior-point method for linear optimization was proposed in this paper, where the complexity bound was not only in its complexity, but also in the theoretical pre- diction of its high efficiency was supported by excellent computational results.
Book

Introductory Lectures on Convex Optimization

TL;DR: A polynomial-time interior-point method for linear optimization was proposed in this article, which has a complexity bound of O(n log n log n 2 n 2 ).
Journal ArticleDOI

Cubic regularization of Newton method and its global performance

TL;DR: This paper provides theoretical analysis for a cubic regularization of Newton method as applied to unconstrained minimization problem and proves general local convergence results for this scheme.
Journal ArticleDOI

Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results

TL;DR: An Adaptive Regularisation algorithm using Cubics (ARC) is proposed for unconstrained optimization, generalizing at the same time an unpublished method due to Griewank, an algorithm by Nesterov and Polyak and a proposal by Weiser et al.
Journal ArticleDOI

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

TL;DR: The approach is more general in that it allows the cubic model to be solved only approximately and may employ approximate Hessians, and the orders of these bounds match those proved for Algorithm 3.3 of Nesterov and Polyak which minimizes the cubicmodel globally on each iteration.
Related Papers (5)