Worst-case evaluation complexity for unconstrained nonlinear
optimization using high-order regularized models
∗
E. G. Birgin
†
, J. L. Gardenghi
†
, J. M. Mart´ınez
‡
, S. A. Santos
‡
and Ph. L. Toint
§
21 April 2016
Abstract
The worst-case evaluation complexity for smooth (possibly nonconvex) unconstrained
optimization is considered. It is shown that, if one is willing to use derivatives of the
objective function up to order p (for p ≥ 1) and to assume Lipschitz continuity of the p-th
derivative, then an -approximate first-order critical point can be computed in at most
O(
−(p+1)/p
) evaluations of the problem’s objective function and its derivatives. This
generalizes and subsumes results known for p = 1 and p = 2.
1 Introduction
Recent years have seen a surge of interest in the analysis of worst-case evaluation complexity
of optimization algorithms for nonconvex problems (see, for instance, Vavasis [17], Nesterov
and Polyak [16], Nesterov [14, 15], Gratton, Sartenaer and Toint [13], Cartis, Gould and Toint
[3, 4, 5, 8], Bian, Chen and Ye [2], Bellavia, Cartis, Gould, Morini and Toint [1], Grapiglia,
Yuan and Yuan [12], Vicente [18]). In particular the paper [16] was the first to show that
a method using second derivatives can find an -approximate first-order critical point for an
unconstrained problem with Lipschitz continuous Hessians in at most O(
−3/2
) evaluations of
the objective function (and its derivatives), in contrast with methods using first-derivatives
only, whose evaluation complexity was known [14] to be O(
−2
) for problems with Lipschitz
continuous gradients. The purpose of the present short paper is to show that, if one is willing
to use derivatives up to order p (for p ≥ 1) and to assume Lipschitz continuity of the p-
th derivative, then an -approximate first-order critical point can be computed in at most
O(
−(p+1)/p
) evaluations of the objective function and its derivatives. This is achieved by the
use of a regularization method very much in the spirit of the first- and second-order ARC
methods described in [4, 5].
∗
This work has been partially supported by the Brazilian agencies FAPESP (grants 2010/10133-
0, 2013/03447-6, 2013/05475-7, 2013/07375-0, and 2013/23494-9) and CNPq (grants 304032/2010-7,
309517/2014-1, 303750/2014-6, and 490326/2013-7) and by the Belgian Fund for Scientific Research (FNRS).
†
Department of Computer Science, Institute of Mathematics and Statistics, University of S˜ao Paulo, Rua
do Mat˜ao, 1010, Cidade Universit´aria, 05508-090, S˜ao Paulo, SP, Brazil. e-mail: {egbirgin | john}@ime.usp.br
‡
Department of Applied Mathematics, Institute of Mathematics, Statistics, and Scientific Computing, Uni-
versity of Campinas, Campinas, SP, Brazil. e-mail: {martinez | sandra}@ime.unicamp.br
§
Namur Center for Complex Systems (naXys) and Department of Mathematics, University of Namur, 61,
rue de Bruxelles, B-5000 Namur, Belgium. Email: philippe.toint@unamur.be
1
Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 2
2 A regularized p-th order model and algorithm
For p ≥ 1, p integer, consider the problem
min
x∈IR
n
f(x), (2.1)
where we assume that f from IR
n
to IR is bounded below and p-times continuously differen-
tiable. We also assume that its p-th derivative at x, the p-th order tensor
∇
p
x
f(x) =
∂
p
f
∂x
i
1
. . . ∂x
i
p
i
j
∈{1,...,n},j=1,...,p
(x),
is Lipschitz continuous, i.e. that there exists a constant L ≥ 0 such that, for all x, y ∈ IR
n
,
k∇
p
x
f(x) − ∇
p
x
f(y)k
[p]
≤ (p − 1)! Lkx − yk. (2.2)
In (2.2), k · k
[p]
is the tensor norm recursively induced by the Euclidean norm k · k on the
space of p-th order tensors, which is given by
kT k
[p]
def
= max
kv
1
k=···=kv
p
k=1
|T [v
1
, . . . , v
p
]|, (2.3)
where T [v
1
, . . . , v
j
] stands for the tensor of order q − j ≥ 0 resulting from the application
of the q-th order tensor T to the vectors v
1
, . . . , v
j
. Let T
p
(x, s) be the Taylor series of the
function f(x + s) at x truncated at order p
T
p
(x, s)
def
= f(x) +
p
X
j=1
1
j!
∇
j
x
f(x)[s]
j
, (2.4)
where the notation T [s]
j
stands for the tensor T applied j times to the vector s. Then Taylor’s
theorem, the identity
Z
1
0
(1 − ξ)
p−1
dξ =
1
p
, (2.5)
the induced nature of k · k
[p]
and (2.2) imply that, for all x, s ∈ IR
n
,
f(x + s) = T
p−1
(x, s) +
1
(p − 1)!
Z
1
0
(1 − ξ)
p−1
∇
p
x
f(x + ξs)[s]
p
dξ
≤ T
p
(x, s) +
1
(p − 1)!
Z
1
0
(1 − ξ)
p−1
(∇
p
x
f(x + ξs)[s]
p
− ∇
p
x
f(x)[s]
p
) dξ
≤ T
p
(x, s) +
1
(p − 1)!
Z
1
0
(1 − ξ)
p−1
|∇
p
x
f(x + ξs)[s]
p
− ∇
p
x
f(x)[s]
p
dξ
≤ T
p
(x, s) +
Z
1
0
(1 − ξ)
p−1
(p − 1)!
dξ
max
ξ∈[0,1]
|∇
p
x
f(x + ξs)[s]
p
− ∇
p
x
f(x)[s]
p
|
≤ T
p
(x, s) +
1
p!
ksk
p
max
ξ∈[0,1]
k∇
p
x
f(x + ξs) − ∇
p
x
f(x)k
[p]
≤ T
p
(x, s) +
L
p
ksk
p+1
.
(2.6)
Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 3
Following the more general argument developed by Cartis, Gould and Toint [10], consider
now, for an arbitrary unit vector v, φ(α) = ∇
1
x
f(x + αs)[v] and τ
p−1
(α) =
P
p−1
i=0
φ
(i)
(0)α
i
/i!.
Taylor’s identity then gives that
φ(1) − τ
p−1
(1) =
1
(p − 2)!
Z
1
0
(1 − ξ)
p−2
[φ
(p−1)
(ξ) − φ
(p−1)
(0)] dξ.
Hence, since τ
p−1
(1) = ∇
1
s
T
p
(x, s)[v],
(∇
1
x
f(x + s) − ∇
1
s
T
p
(x, s))[v] =
1
(p − 2)!
Z
1
0
(1 − ξ)
p−2
[∇
p
x
f(x + ξs) − ∇
p
x
f(x)][s]
p−1
[v] dξ.
Thus, using the symmetry of the derivative tensors, picking v to maximize the absolute value
of the left-hand side and using (2.5), (2.3) and (2.2) successively, we obtain that
k∇
1
x
f(x + s) − ∇
1
s
T
p
(x, s)k
=
1
(p − 2)!
Z
1
0
(1 − ξ)
p−2
(∇
p
x
f(x + ξs) − ∇
p
x
f(x))[v]
s
ksk
p−1
ksk
p−1
dξ
≤
1
(p − 2)!
Z
1
0
(1 − ξ)
p−2
dξ
max
ξ∈[0,1]
(∇
p
x
f(x + ξs) − ∇
p
x
f(x))[v]
s
ksk
p−1
ksk
p−1
≤
1
(p − 1)!
max
ξ∈[0,1]
max
kw
1
k=···=kw
p
k=1
|(∇
p
x
f(x + ξs) − ∇
p
x
f(x))[w
1
, . . . , w
p
]| ksk
p−1
=
1
(p − 1)!
max
ξ∈[0,1]
k∇
p
x
f(x + ξs) − ∇
p
x
f(x)k
[p]
ksk
p−1
≤ Lksk
p
.
(2.7)
In order to describe our algorithm, we also define the regularized Taylor series
m(x, s, σ) = T
p
(x, s) +
σ
p + 1
ksk
p+1
, (2.8)
whose gradient is
∇
1
s
m(x, s, σ) = ∇
1
s
T
p
(x, s) + σksk
p
s
ksk
. (2.9)
Note that
m(x, 0, σ) = T
p
(x, 0) = f(x). (2.10)
The minimization algorithm we consider is now detailed as Algorithm 1 on the following
page.
Each iteration of this algorithm requires the approximate minimization of m(x
k
, s, σ
k
), but
we may note that conditions (2.12) and (2.13) are relatively weak, in that they only require
a decrease of the regularized p-th order model and an approximate first-order stationary
point: no global optimization of this possibly nonconvex model is needed. Fortunately, this
approximate minimization does not involve additional computations of f or of its derivatives
at other points than at x
k
, and therefore the exact method used and the resulting effort spent
in Step 2 have no impact on the evaluation complexity. Also note that the numerator and
Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 4
Algorithm 1: ARp
Step 0: Initialization. An initial point x
0
and an initial regularization parameter σ
0
>
0 are given, as well as an accuracy level . The constants θ, η
1
, η
2
, γ
1
, γ
2
, γ
3
and
σ
min
are also given and satisfy
θ > 0, σ
min
∈ (0, σ
0
], 0 < η
1
≤ η
2
< 1 and 0 < γ
1
< 1 < γ
2
< γ
3
. (2.11)
Compute f(x
0
) and set k = 0.
Step 1: Test for termination. Evaluate ∇
1
x
f(x
k
). If k∇
1
x
f(x
k
)k ≤ , terminate
with the approximate solution x
= x
k
. Otherwise compute derivatives of f from
order 2 to p at x
k
.
Step 2: Step calculation. Compute the step s
k
by approximately minimizing the
model m(x
k
, s, σ
k
) with respect to s in the sense that the conditions
m(x
k
, s
k
, σ
k
) < m(x
k
, 0, σ
k
) (2.12)
and
k∇
1
s
m(x
k
, s
k
, σ
k
)k ≤ θks
k
k
p
(2.13)
hold.
Step 3: Acceptance of the trial point. Compute f(x
k
+ s
k
) and define
ρ
k
=
f(x
k
) − f(x
k
+ s
k
)
T
p
(x
k
, 0) − T
p
(x
k
, s
k
)
. (2.14)
If ρ
k
≥ η
1
, then define x
k+1
= x
k
+ s
k
; otherwise define x
k+1
= x
k
.
Step 4: Regularization parameter update. Set
σ
k+1
∈
[max(σ
min
, γ
1
σ
k
), σ
k
] if ρ
k
≥ η
2
,
[σ
k
, γ
2
σ
k
] if ρ
k
∈ [η
1
, η
2
),
[γ
2
σ
k
, γ
3
σ
k
] if ρ
k
< η
1
.
(2.15)
Increment k by one and go to Step 1 if ρ
k
≥ η
1
or to Step 2 otherwise.
Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 5
denominator in (2.14) are strictly comparable, the latter being Taylor’s approximation of the
former, without the regularization parameter playing any role.
Iterations for which ρ
k
≥ η
1
(and hence x
k+1
= x
k
+ s
k
) are called “successful” and we
denote by S
k
def
= {0 ≤ j ≤ k | ρ
j
≥ η
1
} the index set of all successful iterations between 0 and
k. We also denote by U
k
its complement in {0, . . . , k}, which corresponds to the index set
of “unsuccessful” iterations between 0 and k. Note that, before termination, each successful
iteration requires the evaluation of f and its first p derivatives, while only the evaluation of
f is needed at unsuccessful ones.
We first derive a very simple result on the model decrease obtained under condition (2.12).
Lemma 2.1 The mechanism of Algorithm 1 then guarantees that, for all k ≥ 0,
T
p
(x
k
, 0) − T
p
(x
k
, s
k
) ≥
σ
k
p + 1
ks
k
k
p+1
. (2.16)
Proof. Observe that, because of (2.12) and (2.8),
0 ≤ m(x
k
, 0, σ
k
) − m(x
k
, s
k
, σ
k
) = T
p
(x
k
, 0) − T
p
(x
k
, s
k
) −
σ
k
p + 1
ks
k
k
p+1
which implies the desired bound. 2
As a result, we obtain that (2.14) is well-defined for all k ≥ 0. We next deduce a simple
upper bound on the regularization parameter σ
k
.
Lemma 2.2 Suppose that f is p times continuously differentiable with Lipschitz con-
tinuous p-th derivative (i.e., that (2.2) holds). Then, for all k ≥ 0,
σ
k
≤ σ
max
def
= max
σ
0
,
γ
3
L(p + 1)
p (1 − η
2
)
. (2.17)
Proof. Assume that
σ
k
≥
L(p + 1)
p (1 − η
2
)
. (2.18)
Using (2.6) and (2.16), we may then deduce that
|ρ
k
− 1| ≤
|f(x
k
+ s
k
) − T
p
(x
k
, s
k
)|
|T
p
(x
k
, 0) − T
p
(x
k
, s
k
)|
≤
L(p + 1)
p σ
k
≤ 1 − η
2
and thus that ρ
k
≥ η
2
. Then iteration k is very successful in that ρ
k
≥ η
2
and σ
k+1
≤ σ
k
.
As a consequence, the mechanism of the algorithm ensures that (2.17) holds. 2
Our next step, very much in the line of the theory proposed in [5], is to show that the
steplength cannot be arbitrarily small compared with the gradient of the objective function
at the trial point x
k
+ s
k
.