A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications

doi:10.1287/MOOR.2016.0817

Journal Article•DOI•

A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications

Heinz H. Bauschke¹, Jérôme Bolte², Marc Teboulle³•Institutions (3)

University of British Columbia¹, University of Toulouse², Tel Aviv University³

01 May 2017-Mathematics of Operations Research (INFORMS)-Vol. 42, Iss: 2, pp 330-348

TL;DR: A framework which allows to circumvent the intricate question of Lipschitz continuity of gradients by using an elegant and easy to check convexity condition which captures the geometry of the constraints is introduced.

read less

Abstract: The proximal gradient and its variants is one of the most attractive first-order algorithm for minimizing the sum of two convex functions, with one being nonsmooth. However, it requires the differentiable part of the objective to have a Lipschitz continuous gradient, thus precluding its use in many applications. In this paper we introduce a framework which allows to circumvent the intricate question of Lipschitz continuity of gradients by using an elegant and easy to check convexity condition which captures the geometry of the constraints. This condition translates into a new descent lemma which in turn leads to a natural derivation of the proximal-gradient scheme with Bregman distances. We then identify a new notion of asymmetry measure for Bregman distances, which is central in determining the relevant step-size. These novelties allow to prove a global sublinear rate of convergence, and as a by-product, global pointwise convergence is obtained. This provides a new path to a broad spectrum of problems arising in key applications which were, until now, considered as out of reach via proximal gradient methods. We illustrate this potential by showing how our results can be applied to build new and simple schemes for Poisson inverse problems.

...read moreread less

Summary (2 min read)

Jump to: [Contribution and Outline] – [2.2. A New Descent Lemma Beyond Lipschitz Continuity] – [Clearly, D] – [2.3. A Symmetry Measure for D h] – [Example 2.] – [Example 3.] – [Assumptions H:] and [5.2. Two Simple Algorithms for Poisson Linear Inverse Problems]

Contribution and Outline

The methodology underlying their approach and leading to a proximal-based algorithm freed from Lipschitz gradient continuity is developed in Section 2.
A key player is a new simple, yet useful descent Lemma which allows to trade Lipschitz continuity of the gradient with an elementary convexity property.
In particular, an important notion of asymmetry coe cient is introduced and shown to play a central role in determining the relevant step size of the proposed scheme.
The method is presented in Section 3 and its analysis is developed in Section 4, where a sublinear O(1/k) rate of convergence is established without the traditional Lipschitz gradient continuity of the smooth function.
To demonstrate the benefits and potential of their new approach, the authors illustrate in Section 5 how it can be successfully applied to a broad class of Poisson linear inverse problems, leading to new proximal-based algorithms for these problems.

2.2. A New Descent Lemma Beyond Lipschitz Continuity

The use of Bregman distances in optimization within various contexts is well spread and cannot be reviewed here.
Many interesting results connecting for example Bregman proximal distance with dynamical systems can be found in [10] and references therein, and much more properties and applications can be found in the fundamental and comprehensive work [2] .

Clearly, D

The authors are ready to establish, the simple but key extended descent lemma.
It is easy to see that the condition (LC) admits various alternative reformulations which can facilitate its checking, and which the authors conveniently collect in the following.
Proposition 1. Consider the pair of functions (g, h) and assume that the above regularity conditions on h and g holds.
The proof easily follows from the definition of the Bregman distance and the usual convexity properties.
This holds with g(x) = x log x which does not have a Lipschitz gradient.

2.3. A Symmetry Measure for D h

Since h is strictly convex the objective in (11) may have at most one minimizer.
Assuming (i), the authors obtain that + i domh is coercive, since Bregman distances are nonnegative, the objective within T is also coercive and thus T is nonempty.
When assuming (ii), the argument follows by the supercoercivity properties of the same objective, see [4] .
It can be seen through the optimality condition for T (x) which implies that @h(T (x)) must be nonempty.

Example 2.

Note that this yields a nonseparable Bregman distance which is relevant for ball constraints.
Concerning NoLips the situation is exactly the same: for a given kernel h, sets and functions which are prox-friendly are scarce and are modeled on h.
However a major advantage in their approach is that one can choose the kernel h to adapt to the geometry of the given function/set.
This situation will be illustrated in Section 5, for a broad class of inverse problems involving Poisson noise.

Example 3.

Here the symmetry coe cient ↵(h) and the relative convexity constant L play central roles.
These issues are strongly related to the geometric features of h, but also to their adequation with the couple (f + g, dom h).

Assumptions H:

(ii) x. Remark 4. (a) All examples given previously, Boltzmann-Shannon, Fermi-Dirac, Hellinger Burg entropies satisfy the above set of assumptions.
For much more general and accurate results on the interplay between Legendre functions and Bregman separation properties on the boundary the authors refer the reader to [2] .
Theorem 1 recovers and extends the complexity/convergence results of [15, Theorem 3.4] .
Formally, the authors are dealing with linear inverse problems that can be conveniently described as follows.
This class of problems is su ciently broad to illustrate the theory and algorithm the authors have developed.

5.2. Two Simple Algorithms for Poisson Linear Inverse Problems

This work outlines in simple and transparent ways the basic ingredients to apply the proximal gradient methodology when the gradient of the smooth part in the composite model (P) is not Lipschitz continuous.
Thanks to a new and natural extension of the descent Lemma and a sharp definition of the step-size through the notion of symmetry, the authors have shown that NoLips shares convergence and complexity results akin to those of the usual proximal gradient.
The last section has illustrated the potential of the new proposed framework when applied to the key research area of linear inverse problems with Poisson noise which arises in image sciences.
On the theoretical side, their approach lays the ground for many new and promising perspectives for gradient-based methods that were not conceivable before.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/308313905

ADescentLemmaBeyondLipschitzGradient

Continuity:First-OrderMethodsRevisitedand

Applications

ArticleinMathematicsofOperationsResearch·July2016

DOI:10.1287/moor.2016.0817

CITATIONS

READS

1,577

3authors,including:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

NonconvexOptimizationViewproject

JérômeBolte

ToulouseSchoolofEconomics

51PUBLICATIONS2,199CITATIONS

SEEPROFILE

AllcontentfollowingthispagewasuploadedbyJérômeBolteon19September2016.

Theuserhasrequestedenhancementofthedownloadedfile.

MATHEMATICS OF OPERATIONS RESEARCH

Vol. 00, No. 0, Xxxxx 0000, pp. 000–000

issn 0364-765X | eissn 1526-5471 | 00 | 0000 | 0001

INFORMS

doi 10.1287/xxxx.0000.0000



0000 INFORMS

A descent Lemma beyond Lipschitz gradient

continuity: ﬁrst-order methods revisited and

applications

Heinz H. Bauschke

Mathematics, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada, heinz.bauschke@ubc.ca

J´erˆome Bolte

Toulouse School of Economics, Universit´e Toulouse Capitole, Manufacture des Tabacs, 21 all´ee de Brienne,

31015 Toulouse, France, jerome.bolte@ut-capitole.fr

Marc Teboulle

School of Mathematical Sciences, Tel Av iv University, Ramat Aviv 69978, Israel, teboulle@post.tau.ac.il

The proximal gradient and its variants is one of the most attractive ﬁrst-order alg ori t h m for minimi zi n g

the sum of two convex functions, with on e being nonsmooth. However, it requires the di↵erentiable part of

the objective to have a Lipschitz continuous gradient, thus precluding its use in many applications. In this

paper we introd u ce a framework which allows to circumvent the intricate question of Lipschitz continuity

of gradients by using an eleg a nt and easy to check convexity condition which captures the geom et ry of t h e

constraints. This conditio n translates into a new descent Lemma which in turn leads to a natural de riva t i on of

the proximal-gradient scheme with Bregman distances. We then identify a new notion of asymmetry measure

for Bregman distances, which is central i n determining the releva nt step-size. These novelties allow to prove

a global subl in e a r rate of convergence, and as a by-product, global pointwise convergence is obtained. This

provides a new path to a broad spectrum of problems arising in key applications which were, until now,

considered as out of reach via proximal gradient methods. We illustrate this potential by showing how our

results can be applied to build new and simple schemes for Poisson inverse problems.

Key words : ﬁrst-order methods, composite nonsmooth convex minimization, descent lemma,

proximal-gradient algorithms, comp l exi ty, Bregman distance, multiplicative Poisson linear inverse

problems

MSC2000 subject classiﬁcation :90C25,65K05

OR/MS sub ject classiﬁcation : Convex Programming/Algorithms

1. Introduction First-order methods have occupied the forefront of res ear ch in continuous

optimization for more than a dec ade . This is due to their wide applicab i li ty in a huge spectrum of

fundament al and disparate applications such as signal processing, image sciences, machine learn-

ing, communi cat i on sy st e ms, and ast r on omy to mention just a few, but also to their comput at ion al

simplicity which makes th em ideal method s for solving big data problems within medium accuracy

levels. Recent research activities in this ﬁeld are still conducted at a furious path in al l the afore-

mentioned applic at ion s (and much more), as testiﬁed by the lar ge volume of literature; see e.g.,

[29, 34] and references th er ei n for an appetizer.

A fundamental generic optimization model that encompasses various classes of smooth/nonsmooth

convex models arising in the allud ed applications is the we l l known composite minimization prob-

lem which consists in minimizing the sum of a possibly nonsmooth extended valued function with

a di↵erentiable one over a real Eucli de an space X (see mor e precise description in §2) :

(P)inf{f( x)+g(x): x 2X}.

Accepted for publication in «!Math. of OR!», July 2016. Online soon

Bauschke, Bolte, and Teboulle: ADescentLemmaBeyondLipschitzContinuityforﬁrst-orderMethods

2 Mathematics of Op e r ati on s Research 00(0), pp. 000–000,



0000 INFORMS

Despite its str i ki n g simplicity, this model is very ri ch and has led to the developm ent of fundamental

and well known algori t h ms. A mother sche me is the so-called forward-backward splitting method,

which goes back at least to Passty [30] and Bruck [12] and which was developed in the more gene ral

setting of maximal monotone operators. W he n specialized to the convex problem (P), this method

is often called the proximal gradient method (PGM), a t er mi nol ogy we adopt in this article. O ne of

the earliest work describin g and analyzing the PGM includes for example the work of Fukushima

and Mil n e [21]. The more recent work by Combettes and Wajs [ 17] provides import ant foundational

insights and has popularized the met h od for a wide audience. More recently, the introduction of

fast ve rs i ons of the PGM such as FISTA by Beck-Teboull e [5] – which extends the seminal and

fundament al work on the optimal gradi ent methods of Nesterov [27]– has resulted in a burst of

research activities.

A central property required in the analysis of gradient methods, like the PGM, is that of th e

Lipschit z continuity of the gradient of the smooth part. Such a property implies (for a convex

function is equivalent to) the so-called descent Lemma, e.g., [9], which provides a quadratic upper

approximat i on to the smooth part . This simple process is at the root of the proximal gradient

method, as well as many other methods. Howeve r, in many applications the di↵erentiable function

does not have such a property, e.g., in the broad class of Poisson inverse problems, (see e.g. the

recent review paper [8] which als o includes over 130 references), thus precluding therefore the use of

the PGM met hodology. When both f and g have an easily computable proximal operator, one could

also consider tackling the composite model (P) by applying the alternating direction of multipliers

ADM scheme [23]. For many problems, these schemes are known to be quite eﬃcient. However,

note that even in simple cases, one faces several serious di ﬃcu l t i es that we now brieﬂy recall. Firs t ,

being a primal-dual splitting method, the ADM scheme m ay considerably increase the dimension

of the problem (by the intr oduction of auxiliary splitting variables). Secondly, the method depends

on on e (or more) unknown penalty parameter that needs to be heuristically chosen. Finally, to

our k n owledge, the convergence rate results of ADM bas ed schemes are weaker, holding only for

primal-dual gap in terms of ergo di c sequences, see [14, 25, 33] and references therein. Moreover,

the complexity bound constant not only depends on the unknown penalty parameter, but also on

the norm of the matrix deﬁning the splitting, wh ich in many applications can be huge.

The main goal of this pape r is to rectify this situati on . We introduce a framework which allows to

derive a class of proximal gradient based algorit h ms which are proven to share most of the conver-

gence properties and complexity of the classical proximal-gradient, yet where the usual restrictive

condition of Lipschitz continuity of th e gradient of the di ↵e re ntiable part of problem (P) is not

required. It is in st e ad traded with a more general and ﬂexible convexity conditi on which involves

the problem’s data and can be speciﬁed by the user for each given problem. This is a new path

to a broad spectrum of optimization models arising in key applications which were not accessible

before. Surprisingly, the derivation and the development of our results starts from a very simple

fact (which appear s to have been overlooked) which underlines that the main ingredient in the

success of PGM is to have an appropriate descent Lemma, i.e., an adequat e upper approximation

of the objective function.

Contribution and Outline The methodology underlying our approach and leading to a

proximal- bas ed algorithm freed from Lipschitz gradient conti nuity is developed in Section 2.Akey

player is a new simple, ye t useful descent Lemma which allows to trade Lipschitz continuity of

the gradient with an elementary convexity property. We further clarify these results by deriving

several properties and examples and high l ighting the key di↵erences with the traditional proximal

gradient method. In particular, an important notion of asymmetry coeﬃcient is introduced and

shown to play a central role in determining the relevant step size of the proposed sche me. The

method is presented in Section 3 and its analysis is developed in Section 4, where a sublin ear

Bauschke, Bolte, and Teboulle: ADescentLemmaBeyondLipschitzContinuityforﬁrst-orderMethods

Mathematics of Op erations Research 00(0), pp. 000–000,



0000 INFORMS 3

O(1/k) rate of convergence is est abl i sh ed without the traditional Lipschitz gradient continuity of

the smooth function. As a by-product, pointwise convergence of the method is also established .

To demonstrate the beneﬁts and potential of ou r ne w app roach, we illustrate in Se ct i on 5 how

it can be successfully applied to a broad class of Poisson linear inverse prob l em s, leading to new

proximal- bas ed algorithms for these problems.

Notation Throughout the paper, the n ot at i on we employ is standard and as in [32] or [4]. We

recall t hat for any set C, i

(·) st and s for the usual indicator function, which is equal to 0 if x 2C

and 1 otherwise, and C denotes the closure of C.WesetR

=(0, +1).

2. A New Look at The Proximal Gradient Method We start by recalling th e basic

elements underlying the proximal gradient method and its analysis which motivates the forthcoming

developme nts.

Let X = R

be a real Euclidean space with inner product h·, ·i and induced norm k·k.Givena

closed convex set C with nonempty interior consider the convex problem

inf{(x):=f(x)+g(x):x 2C}

where f,g are proper, convex an d lower semicontinuous ( l sc) , with g continuously di↵erentiable on

int dom g 6= ;, (see later on below for a pre ci se description).

First consider the case when C = R

. For any ﬁxed given point x 2 X and any >0, the

main step of the proximal gradient method consists in minimizing an upper approxima tio n of the

objective obtained by summing a quadratic majorant of the di↵erentiable part g and f,leaving

thus untouched t he nonsmooth part f of :

= arg min{g(x)+hrg(x),uxi+

2

ku xk

+ f(u): u 2R

This is the proximal gradient algorithm, see e.g. [6]. Clearly, the minimizer x

exists and is unique,

and ignoring the constant terms in x reduces to

= arg min

{f( u)+

2

ku (x rg(x))k

}⌘prox

f

(

x rg(x)

)

, (1)

where prox

(·) stands for the so cal l ed Moreau’s proximal map [26] of a proper lsc c onvex function

'. Thus, the PG scheme consists of the composition of a proximal (implicit/backward) step on f

with a gradient (explicit/forward) st e p of g.

A key assumption need ed in t h e very construction and in the anal ys i s of PG scheme is that

g admits a Lipschitz continuous gradient L

. As a simple consequence of this assumption (for a

convex function g, this is an equivalence), we obtain the so-called descent Lemma, see e.g., [9],

namely for any L L

g(x)  g(y)+hx y,rg(y)i+

kx yk

, 8x, y 2R

. (2)

This inequality not only naturally provides a upper quadratic approximation of g, but is also a

crucial pillar in the analysis of any PG based method.

This leads us to the following simple observation:

This also can b e seen by convex calculus which gives 0 2 (@f (x

)+rg(x)+x

 x), which is equivalent to

=(Id+@f)

1

 (Id rg)(x) ⌘ prox

f

(x  r g(x)) .

Bauschke, Bolte, and Teboulle: ADescentLemmaBeyondLipschitzContinuityforﬁrst-orderMethods

4 Mathematics of Op e r ati on s Research 00(0), pp. 000–000,



0000 INFORMS

Main Observation Developing the squared norm in (2), simple algebra shows that it can be

equivalently writ t en as:



kxk

g(x)







kyk

g(y)



hLy rg(y),xyi8x, y 2R

which in turn is nothing else but the gradient inequality for the convex function

kxk

 g(x).

Thus, for a given smooth convex function g on R

, the descent Lemma is equivalent to say that

kxk

g(x) is convex on R

This elementary and known fact, (see, e.g., [4, Theorem 18.15(vi)]) seems to have been overlooked.

It nat u ral l y suggests to consider, instead of the squared norm used for the unconstrained case

C = R

, a more general convex function that captures the geometry of the con st r ai nt C.This

provides the motivation for the forthc omi ng proximal gradient based algorithm and its analy s is for

the constrained composite problem (P).

2.1. The Constrained Composite Problem Our strategy to handle the constraint set C

is standard: a Legendre func t ion on C is chosen and its associated Bregman distance is used as a

proximity measure. Let us ﬁrst re cal l the deﬁnition of a Lege nd r e function.

Definition 1 (Legendre functions). [32, Chapter 26] Let h : X !(1, 1] be a lsc proper

convex fu nc t ion . It is called:

(i) essentially smooth, if h is di↵erentiable on int dom h, with moreover krh(x

)k!1for every

sequence {x

}

k2N

⇢intdom h converging to a boundary point of dom h as k !+1;

(ii) of Legendre type i f h is essentially smooth and strictly convex on int dom h.

Also, let us recall the useful fact that h is of Legendre type if and only if its conjugate h

⇤

is of

Legendre type. Moreover, the gradient of a Legendre function h is a bijection from int dom h to

int dom h

⇤

and its inverse is the gradient of the conjugate ([32, Thm 26. 5] , that is we have,

(rh)

1

= rh

⇤

and h

⇤

(rh(x)) = hx, rh(x)ih(x). (3)

Recall also that

dom @h = int dom h with @h(x)={rh(x)}, 8x 2intdom h. (4)

The Problem and Blanket A s su mpti ons Our aim is thus to solve

v(P)=inf{(x):=f(x)+g(x) |x 2dom h},

where dom h = C denotes the closure of dom h.

The following assumptions on the problem’s data are made through out the paper (and referr ed

to as the blanket assumptions) .

Assumption A

(i) f : X !(1, 1] is proper lower semicontinuous (lsc) convex,

(ii) h : X !(1, 1] is of Legendre type,

(iii) g : X ! (1, 1] is proper lsc convex with domg  dom h, which is di↵erentiable on

int dom h,

(iv) dom f \int dom h 6= ;,

(v) 1<v(P)=inf{(x):x 2

dom h}=inf{(x):x 2dom h}.

Note that the second equality in (v) follows e.g. from [4, Proposition 11. 1( i v) ] and (iv) because

dom(f + g) \int dom h = dom f \i nt dom h 6= ?.

HTML Viewer

A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications

Summary (2 min read)

Contribution and Outline

2.2. A New Descent Lemma Beyond Lipschitz Continuity

Clearly, D

2.3. A Symmetry Measure for D h

Example 2.

Example 3.

Assumptions H:

5.2. Two Simple Algorithms for Poisson Linear Inverse Problems

Citations

References

"A Descent Lemma Beyond Lipschitz Gr..." refers methods in this paper

"A Descent Lemma Beyond Lipschitz Gr..." refers methods in this paper

"A Descent Lemma Beyond Lipschitz Gr..." refers background in this paper

Related Papers (5)