scispace - formally typeset
Search or ask a question

Showing papers by "Dmitriy Drusvyatskiy published in 2018"


Journal ArticleDOI
TL;DR: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity as mentioned in this paper, and the equivalence of such an error bound to a natural quadratic growth condition is established.
Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the “error”—the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear and quadratic convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

235 citations


Posted Content
TL;DR: In particular, this article showed that the stochastic subgradient method on any locally Lipschitz function produces limit points that are all first-order stationary in the absence of smoothness and convexity.
Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

146 citations


Posted Content
TL;DR: It is proved that the projected stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$.
Abstract: We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$ As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function

101 citations


Posted Content
TL;DR: For the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps, the authors showed that under reasonable conditions on approximation quality and regularity of the models, any such algorithm can drive a natural stationarity measure to zero at the rate O(k − 1/4 ).
Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set.

95 citations


Posted Content
TL;DR: This work shows that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set.
Abstract: Subgradient methods converge linearly on a convex function that grows sharply away from its solution set. In this work, we show that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set. A variety of statistical and signal processing tasks come equipped with good initialization, and provably lead to formulations that are both weakly convex and sharp. Therefore, in such settings, subgradient methods can serve as inexpensive local search procedures. We illustrate the proposed techniques on phase retrieval and covariance estimation problems.

51 citations


Proceedings Article
31 Mar 2018
TL;DR: A generic scheme to solve non-convex optimization problems using gradient-based algorithms originally designed for minimizing convex functions by applying their approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks is introduced.
Abstract: We introduce a generic scheme to solve non-convex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them without assuming any knowledge about the convexity of the objective. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks.

45 citations


Journal ArticleDOI
TL;DR: In this article, the same iterate sequence is generated by a scheme that in each iteration computes an optimal average of quadratic lower models of the function, which leads to a limited-memory extension with improved performance.
Abstract: In a recent paper, Bubeck, Lee, and Singh introduced a new first order method for minimizing smooth strongly convex functions. Their geometric descent algorithm, largely inspired by the ellipsoid method, enjoys the optimal linear rate of convergence. We show that the same iterate sequence is generated by a scheme that in each iteration computes an optimal average of quadratic lower models of the function. Indeed, the minimum of the averaged quadratic approaches the true minimum at an optimal rate. This intuitive viewpoint reveals clear connections to the original fast-gradient methods and cutting plane ideas, and leads to limited-memory extensions with improved performance.

44 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that subgradient methods converge linearly on a convex function that grows sharply away from its solution set, provided that the methods are initialized within a fixed tube around the solution set.
Abstract: Subgradient methods converge linearly on a convex function that grows sharply away from its solution set. In this work, we show that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set. A variety of statistical and signal processing tasks come equipped with good initialization and provably lead to formulations that are both weakly convex and sharp. Therefore, in such settings, subgradient methods can serve as inexpensive local search procedures. We illustrate the proposed techniques on phase retrieval and covariance estimation problems.

42 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the partial minimization technique regularizes the problem, making it well-conditioned, and they illustrate the theory and algorithms on boundary control, optimal transport, and parameter estimation for robust dynamic inference.
Abstract: Common computational problems, such as parameter estimation in dynamic models and partial differential equation (PDE)-constrained optimization, require data fitting over a set of auxiliary parameters subject to physical constraints over an underlying state. Naive quadratically penalized formulations, commonly used in practice, suffer from inherent ill-conditioning. We show that surprisingly the partial minimization technique regularizes the problem, making it well-conditioned. This viewpoint sheds new light on variable projection techniques, as well as the penalty method for PDE-constrained optimization, and motivates robust extensions. In addition, we outline an inexact analysis, showing that the partial minimization subproblem can be solved very loosely in each iteration. We illustrate the theory and algorithms on boundary control, optimal transport, and parameter estimation for robust dynamic inference.

27 citations


Posted Content
TL;DR: A scheme that iteratively sample and minimize stochastic convex models of the objective function drives a natural stationarity measure to zero at the rate of $O(k^{-1/4})$.
Abstract: Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional convexity and relative strong convexity assumptions, the function values converge to the minimum at the rate of $O(k^{-1/2})$ and $\widetilde{O}(k^{-1})$, respectively. We discuss consequences for stochastic proximal point, mirror descent, regularized Gauss-Newton, and saddle point algorithms.

23 citations


Posted Content
TL;DR: This work investigates the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex and establishes dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model.
Abstract: We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem.

Journal ArticleDOI
TL;DR: The foundations of gauge duality are revisited and it is demonstrated that it can be explained using a modern approach to duality based on a perturbation framework, and a direct proof that optimal solutions of the Fenchel-Rockafellar dual of the gauge dual are precisely the primal solutions rescaled by the optimal value.
Abstract: We revisit the foundations of gauge duality and demonstrate that it can be explained using a modern approach to duality based on a perturbation framework. We therefore put gauge duality and Fenchel...

Posted Content
TL;DR: In this article, a stochastic subgradient method for minimizing a convex function with the improved rate of O(widetilde O(k − 1/2 ) was presented.
Abstract: In a recent paper, we showed that the stochastic subgradient method applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. In this supplementary note, we present a stochastic subgradient method for minimizing a convex function, with the improved rate $\widetilde O(k^{-1/2})$.

Posted Content
TL;DR: In this article, the authors show that computationally-cheap inexact projections may suffice instead of exact projection onto nonconvex sets, if one set is defined by sufficiently regular smooth constraints, then projecting onto the approximation obtained by linearizing those constraints around the current iterate suffices.
Abstract: Given two arbitrary closed sets in Euclidean space, a simple transversality condition guarantees that the method of alternating projections converges locally, at linear rate, to a point in the intersection. Exact projection onto nonconvex sets is typically intractable, but we show that computationally-cheap inexact projections may suffice instead. In particular, if one set is defined by sufficiently regular smooth constraints, then projecting onto the approximation obtained by linearizing those constraints around the current iterate suffices. On the other hand, if one set is a smooth manifold represented through local coordinates, then the approximate projection resulting from linearizing the coordinate system around the preceding iterate on the manifold also suffices.