Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models

doi:10.1007/S10107-016-1065-8

Worst-case evaluation complexity for unconstrained nonlinear

optimization using high-order regularized models

∗

E. G. Birgin

†

, J. L. Gardenghi

†

, J. M. Mart´ınez

‡

, S. A. Santos

‡

and Ph. L. Toint

§

21 April 2016

Abstract

The worst-case evaluation complexity for smooth (possibly nonconvex) unconstrained

optimization is considered. It is shown that, if one is willing to use derivatives of the

objective function up to order p (for p ≥ 1) and to assume Lipschitz continuity of the p-th

derivative, then an -approximate ﬁrst-order critical point can be computed in at most

O(

−(p+1)/p

) evaluations of the problem’s objective function and its derivatives. This

generalizes and subsumes results known for p = 1 and p = 2.

1 Introduction

Recent years have seen a surge of interest in the analysis of worst-case evaluation complexity

of optimization algorithms for nonconvex problems (see, for instance, Vavasis [17], Nesterov

and Polyak [16], Nesterov [14, 15], Gratton, Sartenaer and Toint [13], Cartis, Gould and Toint

[3, 4, 5, 8], Bian, Chen and Ye [2], Bellavia, Cartis, Gould, Morini and Toint [1], Grapiglia,

Yuan and Yuan [12], Vicente [18]). In particular the paper [16] was the ﬁrst to show that

a method using second derivatives can ﬁnd an -approximate ﬁrst-order critical point for an

unconstrained problem with Lipschitz continuous Hessians in at most O(

−3/2

) evaluations of

the objective function (and its derivatives), in contrast with methods using ﬁrst-derivatives

only, whose evaluation complexity was known [14] to be O(

−2

) for problems with Lipschitz

continuous gradients. The purpose of the present short paper is to show that, if one is willing

to use derivatives up to order p (for p ≥ 1) and to assume Lipschitz continuity of the p-

th derivative, then an -approximate ﬁrst-order critical point can be computed in at most

O(

−(p+1)/p

) evaluations of the objective function and its derivatives. This is achieved by the

use of a regularization method very much in the spirit of the ﬁrst- and second-order ARC

methods described in [4, 5].

∗

This work has been partially supported by the Brazilian agencies FAPESP (grants 2010/10133-

0, 2013/03447-6, 2013/05475-7, 2013/07375-0, and 2013/23494-9) and CNPq (grants 304032/2010-7,

309517/2014-1, 303750/2014-6, and 490326/2013-7) and by the Belgian Fund for Scientiﬁc Research (FNRS).

†

Department of Computer Science, Institute of Mathematics and Statistics, University of S˜ao Paulo, Rua

do Mat˜ao, 1010, Cidade Universit´aria, 05508-090, S˜ao Paulo, SP, Brazil. e-mail: {egbirgin | john}@ime.usp.br

‡

Department of Applied Mathematics, Institute of Mathematics, Statistics, and Scientiﬁc Computing, Uni-

versity of Campinas, Campinas, SP, Brazil. e-mail: {martinez | sandra}@ime.unicamp.br

§

Namur Center for Complex Systems (naXys) and Department of Mathematics, University of Namur, 61,

rue de Bruxelles, B-5000 Namur, Belgium. Email: philippe.toint@unamur.be

1

Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 2

2 A regularized p-th order model and algorithm

For p ≥ 1, p integer, consider the problem

min

x∈IR

n

f(x), (2.1)

where we assume that f from IR

n

to IR is bounded below and p-times continuously diﬀeren-

tiable. We also assume that its p-th derivative at x, the p-th order tensor

∇

p

x

f(x) =



∂

p

f

∂x

i

1

. . . ∂x

i

p



i

j

∈{1,...,n},j=1,...,p

(x),

is Lipschitz continuous, i.e. that there exists a constant L ≥ 0 such that, for all x, y ∈ IR

n

,

k∇

p

x

f(x) − ∇

p

x

f(y)k

[p]

≤ (p − 1)! Lkx − yk. (2.2)

In (2.2), k · k

[p]

is the tensor norm recursively induced by the Euclidean norm k · k on the

space of p-th order tensors, which is given by

kT k

[p]

def

= max

kv

1

k=···=kv

p

k=1

|T [v

1

, . . . , v

p

]|, (2.3)

where T [v

1

, . . . , v

j

] stands for the tensor of order q − j ≥ 0 resulting from the application

of the q-th order tensor T to the vectors v

1

, . . . , v

j

. Let T

p

(x, s) be the Taylor series of the

function f(x + s) at x truncated at order p

T

p

(x, s)

def

= f(x) +

p

X

j=1

1

j!

∇

j

x

f(x)[s]

j

, (2.4)

where the notation T [s]

j

stands for the tensor T applied j times to the vector s. Then Taylor’s

theorem, the identity

Z

1

0

(1 − ξ)

p−1

dξ =

1

p

, (2.5)

the induced nature of k · k

[p]

and (2.2) imply that, for all x, s ∈ IR

n

,

f(x + s) = T

p−1

(x, s) +

1

(p − 1)!

Z

1

0

(1 − ξ)

p−1

∇

p

x

f(x + ξs)[s]

p

dξ

≤ T

p

(x, s) +

1

(p − 1)!



Z

1

0

(1 − ξ)

p−1

(∇

p

x

f(x + ξs)[s]

p

− ∇

p

x

f(x)[s]

p

) dξ



≤ T

p

(x, s) +

1

(p − 1)!

Z

1

0

(1 − ξ)

p−1

|∇

p

x

f(x + ξs)[s]

p

− ∇

p

x

f(x)[s]

p

dξ

≤ T

p

(x, s) +



Z

1

0

(1 − ξ)

p−1

(p − 1)!

dξ



max

ξ∈[0,1]

|∇

p

x

f(x + ξs)[s]

p

− ∇

p

x

f(x)[s]

p

|

≤ T

p

(x, s) +

1

p!

ksk

p

max

ξ∈[0,1]

k∇

p

x

f(x + ξs) − ∇

p

x

f(x)k

[p]

≤ T

p

(x, s) +

L

p

ksk

p+1

.

(2.6)

Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 3

Following the more general argument developed by Cartis, Gould and Toint [10], consider

now, for an arbitrary unit vector v, φ(α) = ∇

1

x

f(x + αs)[v] and τ

p−1

(α) =

P

p−1

i=0

φ

(i)

(0)α

i

/i!.

Taylor’s identity then gives that

φ(1) − τ

p−1

(1) =

1

(p − 2)!

Z

1

0

(1 − ξ)

p−2

[φ

(p−1)

(ξ) − φ

(p−1)

(0)] dξ.

Hence, since τ

p−1

(1) = ∇

1

s

T

p

(x, s)[v],

(∇

1

x

f(x + s) − ∇

1

s

T

p

(x, s))[v] =

1

(p − 2)!

Z

1

0

(1 − ξ)

p−2

[∇

p

x

f(x + ξs) − ∇

p

x

f(x)][s]

p−1

[v] dξ.

Thus, using the symmetry of the derivative tensors, picking v to maximize the absolute value

of the left-hand side and using (2.5), (2.3) and (2.2) successively, we obtain that

k∇

1

x

f(x + s) − ∇

1

s

T

p

(x, s)k

=

1

(p − 2)!



Z

1

0

(1 − ξ)

p−2

(∇

p

x

f(x + ξs) − ∇

p

x

f(x))[v]



s

ksk



p−1

ksk

p−1

dξ



≤

1

(p − 2)!



Z

1

0

(1 − ξ)

p−2

dξ



max

ξ∈[0,1]



(∇

p

x

f(x + ξs) − ∇

p

x

f(x))[v]



s

ksk



p−1



ksk

p−1

≤

1

(p − 1)!

max

ξ∈[0,1]

max

kw

1

k=···=kw

p

k=1

|(∇

p

x

f(x + ξs) − ∇

p

x

f(x))[w

1

, . . . , w

p

]| ksk

p−1

=

1

(p − 1)!

max

ξ∈[0,1]

k∇

p

x

f(x + ξs) − ∇

p

x

f(x)k

[p]

ksk

p−1

≤ Lksk

p

.

(2.7)

In order to describe our algorithm, we also deﬁne the regularized Taylor series

m(x, s, σ) = T

p

(x, s) +

σ

p + 1

ksk

p+1

, (2.8)

whose gradient is

∇

1

s

m(x, s, σ) = ∇

1

s

T

p

(x, s) + σksk

p

s

ksk

. (2.9)

Note that

m(x, 0, σ) = T

p

(x, 0) = f(x). (2.10)

The minimization algorithm we consider is now detailed as Algorithm 1 on the following

page.

Each iteration of this algorithm requires the approximate minimization of m(x

k

, s, σ

k

), but

we may note that conditions (2.12) and (2.13) are relatively weak, in that they only require

a decrease of the regularized p-th order model and an approximate ﬁrst-order stationary

point: no global optimization of this possibly nonconvex model is needed. Fortunately, this

approximate minimization does not involve additional computations of f or of its derivatives

at other points than at x

k

, and therefore the exact method used and the resulting eﬀort spent

in Step 2 have no impact on the evaluation complexity. Also note that the numerator and

Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 4

Algorithm 1: ARp

Step 0: Initialization. An initial point x

0

and an initial regularization parameter σ

0

>

0 are given, as well as an accuracy level . The constants θ, η

1

, η

2

, γ

1

, γ

2

, γ

3

and

σ

min

are also given and satisfy

θ > 0, σ

min

∈ (0, σ

0

], 0 < η

1

≤ η

2

< 1 and 0 < γ

1

< 1 < γ

2

< γ

3

. (2.11)

Compute f(x

0

) and set k = 0.

Step 1: Test for termination. Evaluate ∇

1

x

f(x

k

). If k∇

1

x

f(x

k

)k ≤  , terminate

with the approximate solution x



= x

k

. Otherwise compute derivatives of f from

order 2 to p at x

k

.

Step 2: Step calculation. Compute the step s

k

by approximately minimizing the

model m(x

k

, s, σ

k

) with respect to s in the sense that the conditions

m(x

k

, s

k

, σ

k

) < m(x

k

, 0, σ

k

) (2.12)

and

k∇

1

s

m(x

k

, s

k

, σ

k

)k ≤ θks

k

p

(2.13)

hold.

Step 3: Acceptance of the trial point. Compute f(x

k

+ s

k

) and deﬁne

ρ

k

=

f(x

k

) − f(x

k

+ s

k

)

T

p

(x

k

, 0) − T

p

(x

k

, s

k

)

. (2.14)

If ρ

k

≥ η

1

, then deﬁne x

k+1

= x

k

+ s

k

; otherwise deﬁne x

k+1

= x

k

.

Step 4: Regularization parameter update. Set

σ

k+1

∈







[max(σ

min

, γ

1

σ

k

), σ

k

] if ρ

k

≥ η

2

,

[σ

k

, γ

2

σ

k

] if ρ

k

∈ [η

1

, η

2

),

[γ

2

σ

k

, γ

3

σ

k

] if ρ

k

< η

1

.

(2.15)

Increment k by one and go to Step 1 if ρ

k

≥ η

1

or to Step 2 otherwise.

Birgin, Gardenghi, Mart´ınez, Santos, Toint — Complexity with high-order models 5

denominator in (2.14) are strictly comparable, the latter being Taylor’s approximation of the

former, without the regularization parameter playing any role.

Iterations for which ρ

k

≥ η

1

(and hence x

k+1

= x

k

+ s

k

) are called “successful” and we

denote by S

k

def

= {0 ≤ j ≤ k | ρ

j

≥ η

1

} the index set of all successful iterations between 0 and

k. We also denote by U

k

its complement in {0, . . . , k}, which corresponds to the index set

of “unsuccessful” iterations between 0 and k. Note that, before termination, each successful

iteration requires the evaluation of f and its ﬁrst p derivatives, while only the evaluation of

f is needed at unsuccessful ones.

We ﬁrst derive a very simple result on the model decrease obtained under condition (2.12).

Lemma 2.1 The mechanism of Algorithm 1 then guarantees that, for all k ≥ 0,

T

p

(x

k

, 0) − T

p

(x

k

, s

k

) ≥

σ

k

p + 1

ks

k

p+1

. (2.16)

Proof. Observe that, because of (2.12) and (2.8),

0 ≤ m(x

k

, 0, σ

k

) − m(x

k

, s

k

, σ

k

) = T

p

(x

k

, 0) − T

p

(x

k

, s

k

) −

σ

k

p + 1

ks

k

p+1

which implies the desired bound. 2

As a result, we obtain that (2.14) is well-deﬁned for all k ≥ 0. We next deduce a simple

upper bound on the regularization parameter σ

k

.

Lemma 2.2 Suppose that f is p times continuously diﬀerentiable with Lipschitz con-

tinuous p-th derivative (i.e., that (2.2) holds). Then, for all k ≥ 0,

σ

k

≤ σ

max

def

= max



σ

0

,

γ

3

L(p + 1)

p (1 − η

2

)



. (2.17)

Proof. Assume that

σ

k

≥

L(p + 1)

p (1 − η

2

)

. (2.18)

Using (2.6) and (2.16), we may then deduce that

|ρ

k

− 1| ≤

|f(x

k

+ s

k

) − T

p

(x

k

, s

k

)|

|T

p

(x

k

, 0) − T

p

(x

k

, s

k

)|

≤

L(p + 1)

p σ

k

≤ 1 − η

2

and thus that ρ

k

≥ η

2

. Then iteration k is very successful in that ρ

k

≥ η

2

and σ

k+1

≤ σ

k

.

As a consequence, the mechanism of the algorithm ensures that (2.17) holds. 2

Our next step, very much in the line of the theory proposed in [5], is to show that the

steplength cannot be arbitrarily small compared with the gradient of the objective function

at the trial point x

k

+ s

k

.

Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models

Citations

Global rates of convergence for nonconvex optimization on manifolds

Implementable tensor methods in unconstrained convex optimization.

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

Accelerated Methods for Non-Convex Optimization

Lower bounds for finding stationary points I

References

Introductory Lectures on Convex Optimization: A Basic Course

Introductory Lectures on Convex Optimization

Cubic regularization of Newton method and its global performance

Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

Related Papers (5)

Cubic regularization of Newton method and its global performance

Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems

A trust region algorithm with a worst-case iteration complexity of $$\mathcal{O}(\epsilon ^{-3/2})$$O(∈-3/2) for nonconvex optimization