scispace - formally typeset
Search or ask a question
Author

Dmitriy Drusvyatskiy

Other affiliations: Cornell University
Bio: Dmitriy Drusvyatskiy is an academic researcher from University of Washington. The author has contributed to research in topics: Convex function & Subgradient method. The author has an hindex of 26, co-authored 108 publications receiving 2310 citations. Previous affiliations of Dmitriy Drusvyatskiy include Cornell University.


Papers
More filters
Journal ArticleDOI
TL;DR: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity as mentioned in this paper, and the equivalence of such an error bound to a natural quadratic growth condition is established.
Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the “error”—the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear and quadratic convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

235 citations

Journal ArticleDOI
TL;DR: This work shows that under weak-convexity and Lipschitz conditions, the algorithm drives the expected norm of the gradient of the Moreau envelope to zero at the rate of $O(k^{-1/4})$.
Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and re...

191 citations

Journal ArticleDOI
TL;DR: In this paper, the authors consider global efficiency of algorithms for minimizing a sum of convex functions and a composition of a Lipschitz convex function with a smooth map, and show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity with complexity
Abstract: We consider global efficiency of algorithms for minimizing a sum of a convex function and a composition of a Lipschitz convex function with a smooth map. The basic algorithm we rely on is the prox-linear method, which in each iteration solves a regularized subproblem formed by linearizing the smooth map. When the subproblems are solved exactly, the method has efficiency $$\mathcal {O}(\varepsilon ^{-2})$$ , akin to gradient descent for smooth minimization. We show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity $$\widetilde{\mathcal {O}}(\varepsilon ^{-3})$$ . We round off the paper with an inertial prox-linear method that automatically accelerates in presence of convexity.

154 citations

Posted Content
TL;DR: This work explains the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition and generalizes to linear convergence analysis for proximal methods for minimizing compositions of nonsmooth functions with smooth mappings.
Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -- the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

151 citations

Posted Content
TL;DR: In particular, this article showed that the stochastic subgradient method on any locally Lipschitz function produces limit points that are all first-order stationary in the absence of smoothness and convexity.
Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

146 citations


Cited by
More filters
Book
27 Nov 2013
TL;DR: The many different interpretations of proximal operators and algorithms are discussed, their connections to many other topics in optimization and applied mathematics are described, some popular algorithms are surveyed, and a large number of examples of proxiesimal operators that commonly arise in practice are provided.
Abstract: This monograph is about a class of optimization algorithms called proximal algorithms. Much like Newton's method is a standard tool for solving unconstrained smooth optimization problems of modest size, proximal algorithms can be viewed as an analogous tool for nonsmooth, constrained, large-scale, or distributed versions of these problems. They are very generally applicable, but are especially well-suited to problems of substantial recent interest involving large or high-dimensional datasets. Proximal methods sit at a higher level of abstraction than classical algorithms like Newton's method: the base operation is evaluating the proximal operator of a function, which itself involves solving a small convex optimization problem. These subproblems, which generalize the problem of projecting a point onto a convex set, often admit closed-form solutions or can be solved very quickly with standard or simple specialized methods. Here, we discuss the many different interpretations of proximal operators and algorithms, describe their connections to many other topics in optimization and applied mathematics, survey some popular algorithms, and provide a large number of examples of proximal operators that commonly arise in practice.

3,627 citations

Book
16 Dec 2017

1,681 citations

Book
21 Feb 1970

986 citations

Posted Content
TL;DR: This article showed that gradient descent converges at a global linear rate to the global optimum for two-layer fully connected ReLU activated neural networks, where over-parameterization and random initialization jointly restrict weight vector to be close to its initialization for all iterations.
Abstract: One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

662 citations