Showing papers in "arXiv: Optimization and Control in 2018"

PDF

Open Access

Posted Content•

Local SGD Converges Fast and Communicates Little

[...]

École Polytechnique Fédérale de Lausanne¹

24 May 2018-arXiv: Optimization and Control

TL;DR: In this paper, the authors prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and minibatch size.

...read moreread less

Abstract: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.

...read moreread less

516 citations

Posted Content•

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

[...]

Lénaïc Chizat, Francis Bach

24 May 2018-arXiv: Optimization and Control

TL;DR: It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.

...read moreread less

Abstract: Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

...read moreread less

403 citations

Posted Content•

On Lazy Training in Differentiable Programming.

[...]

Lénaïc Chizat¹, Edouard Oyallon², Francis Bach³•Institutions (3)

Centre national de la recherche scientifique¹, École Centrale Paris², École Normale Supérieure³

19 Dec 2018-arXiv: Optimization and Control

TL;DR: In this article, the authors show that the lazy training phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

...read moreread less

Abstract: In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.

...read moreread less

219 citations

Posted Content•

A Tour of Reinforcement Learning: The View from Continuous Control

[...]

Benjamin Recht¹•Institutions (1)

University of California, Berkeley¹

25 Jun 2018-arXiv: Optimization and Control

TL;DR: This article surveys reinforcement learning from the perspective of optimization and control, with a focus on continuous control applications.

...read moreread less

Abstract: This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.

...read moreread less

193 citations

Posted Content•

Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning

[...]

Hassan Rafique, Mingrui Liu, Qihang Lin, Tianbao Yang

04 Oct 2018-arXiv: Optimization and Control

TL;DR: This paper proposes a proximally guided stochastic subgradient method and a proxIMally guided Stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively and establishes the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem.

...read moreread less

Abstract: Min-max saddle-point problems have broad applications in many tasks in machine learning, e.g., distributionally robust learning, learning with non-decomposable loss, or learning with uncertain data. Although convex-concave saddle-point problems have been broadly studied with efficient algorithms and solid theories available, it remains a challenge to design provably efficient algorithms for non-convex saddle-point problems, especially when the objective function involves an expectation or a large-scale finite sum. Motivated by recent literature on non-convex non-smooth minimization, this paper studies a family of non-convex min-max problems where the minimization component is non-convex (weakly convex) and the maximization component is concave. We propose a proximally guided stochastic subgradient method and a proximally guided stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively. We establish the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem.

...read moreread less

156 citations

Posted Content•

Stochastic subgradient method converges on tame functions

[...]

Damek Davis¹, Dmitriy Drusvyatskiy², Sham M. Kakade², Jason D. Lee³•Institutions (3)

Cornell University¹, University of Washington², University of Southern California³

20 Apr 2018-arXiv: Optimization and Control

TL;DR: In particular, this article showed that the stochastic subgradient method on any locally Lipschitz function produces limit points that are all first-order stationary in the absence of smoothness and convexity.

...read moreread less

Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

...read moreread less

146 citations

Journal Article•DOI•

Resilient Disaster Recovery Logistics of Distribution Systems: Co-Optimize Service Restoration with Repair Crew and Mobile Power Source Dispatch

[...]

Shunbo Lei¹, Chen Chen², Yupeng Li¹, Yunhe Hou¹•Institutions (2)

University of Hong Kong¹, Argonne National Laboratory²

20 Jun 2018-arXiv: Optimization and Control

TL;DR: A novel co-optimization model is formulated to route RCs and MPSs in the transportation network, schedule them in the DS, and reconfigure the DS for microgrid formation coordinately and model topology constraints based on the concept of spanning forest.

...read moreread less

Abstract: Repair crews (RCs) and mobile power sources (MPSs) are critical resources for distribution system (DS) outage management after a natural disaster. However, their logistics is not well investigated. We propose a resilient scheme for disaster recovery logistics to co-optimize DS restoration with dispatch of RCs and MPSs. A novel co-optimization model is formulated to route RCs and MPSs in the transportation network, schedule them in the DS, and reconfigure the DS for microgrid formation coordinately, etc. The model incorporates different timescales of DS restoration and RC/MPS dispatch, the coupling of transportation and power networks, etc. To ensure radiality of the DS with variable physical structure and MPS allocation, we also model topology constraints based on the concept of spanning forest. The model is convexified equivalently and linearized into a mixed-integer linear programming. To reduce its computation time, preprocessing methods are proposed to pre-assign a minimal set of repair tasks to depots and reduce the number of candidate nodes for MPS connection. Resilient recovery strategies thus are generated to enhance service restoration, especially by dynamic formation of microgrids that are powered by MPSs and topologized by repair actions of RCs and network reconfiguration of the DS. Case studies demonstrate the proposed methodology.

...read moreread less

140 citations

Journal Article•DOI•

A linear algorithm for optimization over directed graphs with geometric convergence

[...]

Ran Xin¹, Usman A. Khan¹•Institutions (1)

Tufts University¹

07 Mar 2018-arXiv: Optimization and Control

TL;DR: This letter proposes a linear algorithm based on an inexact gradient method and a gradient estimation technique and shows that the proposed algorithm geometrically converges to the global minimizer with a sufficiently small step-size.

...read moreread less

Abstract: In this letter, we study distributed optimization, where a network of agents, abstracted as a directed graph, collaborates to minimize the average of locally-known convex functions. Most of the existing approaches over directed graphs are based on push-sum (type) techniques, which use an independent algorithm to asymptotically learn either the left or right eigenvector of the underlying weight matrices. This strategy causes additional computation, communication, and nonlinearity in the algorithm. In contrast, we propose a linear algorithm based on an inexact gradient method and a gradient estimation technique. Under the assumptions that each local function is strongly-convex with Lipschitz-continuous gradients, we show that the proposed algorithm geometrically converges to the global minimizer with a sufficiently small step-size. We present simulations to illustrate the theoretical findings.

...read moreread less

134 citations

Posted Content•

SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

[...]

Cong Fang¹, Chris Li², Zhouchen Lin¹, Tong Zhang³•Institutions (3)

Peking University¹, Princeton University², Rutgers University³

04 Jul 2018-arXiv: Optimization and Control

TL;DR: This paper proposes a new technique named SPIDER, which can be used to track many deterministic quantities of interest with significantly reduced computational cost and proves that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting.

...read moreread less

Abstract: In this paper, we propose a new technique named \textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. We apply SPIDER to two tasks, namely the stochastic first-order and zeroth-order methods. For stochastic first-order method, combining SPIDER with normalized gradient descent, we propose two new algorithms, namely SPIDER-SFO and SPIDER-SFO\textsuperscript{+}, that solve non-convex stochastic optimization problems using stochastic gradients only. We provide sharp error-bound results on their convergence rates. In special, we prove that the SPIDER-SFO and SPIDER-SFO\textsuperscript{+} algorithms achieve a record-breaking gradient computation cost of $\mathcal{O}\left( \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3} ) \right)$ for finding an $\epsilon$-approximate first-order and $\tilde{\mathcal{O}}\left( \min( n^{1/2} \epsilon^{-2}+\epsilon^{-2.5}, \epsilon^{-3} ) \right)$ for finding an $(\epsilon, \mathcal{O}(\epsilon^{0.5}))$-approximate second-order stationary point, respectively. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting. For stochastic zeroth-order method, we prove a cost of $\mathcal{O}( d \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3}) )$ which outperforms all existing results.

...read moreread less

134 citations

Posted Content•

Approximation Methods for Bilevel Programming

[...]

Saeed Ghadimi, Mengdi Wang

06 Feb 2018-arXiv: Optimization and Control

TL;DR: An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.

...read moreread less

Abstract: In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on the outer objective function We also present an accelerated variant of this method which improves the rate of convergence under convexity assumption Furthermore, we generalize our results under stochastic setting where only noisy information of both objective functions is available To the best of our knowledge, this is the first time that such (stochastic) approximation algorithms with established iteration complexity (sample complexity) are provided for bilevel programming

...read moreread less

131 citations

Posted Content•

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

[...]

Hao Yu¹, Sen Yang¹, Shenghuo Zhu¹•Institutions (1)

Alibaba Group¹

17 Jul 2018-arXiv: Optimization and Control

TL;DR: In this paper, the authors provide a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead, and they show that the average interval can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled.

...read moreread less

Abstract: In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

...read moreread less

Journal Article•DOI•

A Mean-Field Optimal Control Formulation of Deep Learning

[...]

Weinan E, Jiequn Han, Qianxiao Li

03 Jul 2018-arXiv: Optimization and Control

TL;DR: In this article, the authors introduced the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem and proved optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type.

...read moreread less

Abstract: Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin's maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning.

...read moreread less

Journal Article•DOI•

Global Deterministic Optimization with Artificial Neural Networks Embedded

[...]

Artur M. Schweidtmann, Alexander Mitsos

22 Jan 2018-arXiv: Optimization and Control

TL;DR: The proposed method is based on relaxations of algorithms using McCormick relaxations in a reduced space employing the convex and concave envelopes of the nonlinear activation function for deterministic global optimization of optimization problems with artificial neural networks embedded.

...read moreread less

Abstract: Artificial neural networks (ANNs) are used in various applications for data-driven black-box modeling and subsequent optimization. Herein, we present an efficient method for deterministic global optimization of ANN embedded optimization problems. The proposed method is based on relaxations of algorithms using McCormick relaxations in a reduced-space [\textit{SIOPT}, 20 (2009), pp. 573-601] including the convex and concave envelopes of the nonlinear activation function of ANNs. The optimization problem is solved using our in-house global deterministic solver MAiNGO. The performance of the proposed method is shown in four optimization examples: an illustrative function, a fermentation process, a compressor plant and a chemical process optimization. The results show that computational solution time is favorable compared to the global general-purpose optimization solver BARON.

...read moreread less

Journal Article•DOI•

Distributed GNE seeking under partial-decision information over networks via a doubly-augmented operator splitting approach.

[...]

Lacra Pavel¹•Institutions (1)

University of Toronto¹

13 Aug 2018-arXiv: Optimization and Control

TL;DR: In this article, the problem of finding a zero of a sum of monotone operators through primal-dual analysis is recast as a problem of computing the Lagrangian multipliers.

...read moreread less

Abstract: We consider distributed computation of generalized Nash equilibrium (GNE) over networks, in games with shared coupling constraints. Existing methods require that each player has full access to opponents' decisions. In this paper, we assume that players have only partial-decision information, and can communicate with their neighbours over an arbitrary undirected graph. We recast the problem as that of finding a zero of a sum of monotone operators through primal-dual analysis. To distribute the problem, we doubly augment variables, so that each player has local decision estimates and local copies of Lagrangian multipliers. We introduce a single-layer algorithm, fully distributed with respect to both primal and dual variables. We show its convergence to a variational GNE with fixed step-sizes, by reformulating it as a forward-backward iteration for a pair of doubly-augmented monotone operators.

...read moreread less

Posted Content•

Last-Iterate Convergence: Zero-Sum Games and Constrained Min-Max Optimization

[...]

Constantinos Daskalakis¹, Ioannis Panageas¹•Institutions (1)

Massachusetts Institute of Technology¹

11 Jul 2018-arXiv: Optimization and Control

TL;DR: It is shown that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution and becomes a contracting map converging to the exact solution.

...read moreread less

Abstract: Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al \cite{DISZ17} and follow-up work of Liang and Stokes \cite{LiangS18} have established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits last-iterate convergence to saddle points in {\em unconstrained} convex-concave min-max optimization problems. We show that the same holds true in the more general problem of {\em constrained} min-max optimization under a variant of the no-regret Multiplicative-Weights-Update method called "Optimistic Multiplicative-Weights Update (OMWU)". This answers an open question of Syrgkanis et al \cite{SALS15}. The proof of our result requires fundamentally different techniques from those that exist in no-regret learning literature and the aforementioned papers. We show that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU is locally (asymptotically) stable converging to the exact solution. We believe that our techniques will be useful in the analysis of the last iterate of other learning algorithms.

...read moreread less

Posted Content•

On Distributionally Robust Chance Constrained Programs with Wasserstein Distance

[...]

Weijun Xie¹•Institutions (1)

Virginia Tech¹

19 Jun 2018-arXiv: Optimization and Control

TL;DR: It is shown that a DRCCP can be reformulated as a conditional value-at-risk constrained optimization problem, and thus admits tight inner and outer approximations and a big-M free formulation.

...read moreread less

Abstract: This paper studies a distributionally robust chance constrained program (DRCCP) with Wasserstein ambiguity set, where the uncertain constraints should be satisfied with a probability at least a given threshold for all the probability distributions of the uncertain parameters within a chosen Wasserstein distance from an empirical distribution. In this work, we investigate equivalent reformulations and approximations of such problems. We first show that a DRCCP can be reformulated as a conditional value-at-risk constrained optimization problem, and thus admits tight inner and outer approximations. We also show that a DRCCP of bounded feasible region is mixed integer representable by introducing big-M coefficients and additional binary variables. For a DRCCP with pure binary decision variables, by exploring the submodular structure, we show that it admits a big-M free formulation, which can be solved by a branch and cut algorithm. Finally, we present a numerical study to illustrate the effectiveness of the proposed formulations.

...read moreread less

Posted Content•

Stochastic subgradient method converges at the rate $O(k^{-1/4})$ on weakly convex functions

[...]

Damek Davis, Dmitriy Drusvyatskiy

08 Feb 2018-arXiv: Optimization and Control

TL;DR: It is proved that the projected stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$.

...read moreread less

Abstract: We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$ As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function

...read moreread less

Journal Article•DOI•

Input-to-State Safety with Control Barrier Functions

[...]

Shishir Kolathaya¹, Aaron D. Ames²•Institutions (2)

Indian Institute of Science¹, California Institute of Technology²

08 Mar 2018-arXiv: Optimization and Control

TL;DR: In this paper, the authors present a new notion of input-to-state safe control barrier functions (ISSf-CBFs), which ensure safety of nonlinear dynamical systems under input disturbances.

...read moreread less

Abstract: This letter presents a new notion of input-to-state safe control barrier functions (ISSf-CBFs), which ensure safety of nonlinear dynamical systems under input disturbances. Similar to how safety conditions are specified in terms of forward invariance of a set, input-to-state safety (ISSf) conditions are specified in terms of forward invariance of a slightly larger set. In this context, invariance of the larger set implies that the states stay either inside or very close to the smaller safe set; and this closeness is bounded by the magnitude of the disturbances. The main contribution of the letter is the methodology used for obtaining a valid ISSf-CBF, given a control barrier function (CBF). The associated universal control law will also be provided. Towards the end, we will study unified quadratic programs (QPs) that combine control Lyapunov functions (CLFs) and ISSf-CBFs in order to obtain a single control law that ensures both safety and stability in systems with input disturbances.

...read moreread less

Posted Content•DOI•

Data-Driven Chance Constrained Programs over Wasserstein Balls

[...]

Zhi Chen, Daniel Kuhn, Wolfram Wiesemann

01 Sep 2018-arXiv: Optimization and Control

TL;DR: This work provides an exact deterministic reformulation for data-driven chance constrained programs over Wasserstein balls and shows that two popular approximation schemes based on the conditional-value-at-risk and the Bonferroni inequality can perform poorly in practice and that these two schemes are generally incomparable with each other.

...read moreread less

Abstract: We provide an exact deterministic reformulation for data-driven chance constrained programs over Wasserstein balls. For individual chance constraints as well as joint chance constraints with right-hand side uncertainty, our reformulation amounts to a mixed-integer conic program. In the special case of a Wasserstein ball with the $1$-norm or the $\infty$-norm, the cone is the nonnegative orthant, and the chance constrained program can be reformulated as a mixed-integer linear program. Using our reformulation, we show that two popular approximation schemes based on the conditional-value-at-risk and the Bonferroni inequality can perform poorly in practice and that these two schemes are generally incomparable with each other.

...read moreread less

Posted Content•

Stochastic model-based minimization of weakly convex functions

[...]

Damek Davis, Dmitriy Drusvyatskiy

17 Mar 2018-arXiv: Optimization and Control

TL;DR: For the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps, the authors showed that under reasonable conditions on approximation quality and regularity of the models, any such algorithm can drive a natural stationarity measure to zero at the rate O(k − 1/4 ).

...read moreread less

Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set.

...read moreread less

Journal Article•DOI•

Enhancing the settling time estimation of a class of fixed-time stable systems

[...]

Rodrigo Aldana-López¹, David Gómez-Gutiérrez², David Gómez-Gutiérrez¹, Esteban Jiménez-Rodríguez³, Juan Diego Sanchez-Torres⁴, Michael Defoort - Show less +2 more•Institutions (4)

Institute of Robotics and Intelligent Systems¹, Monterrey Institute of Technology and Higher Education², CINVESTAV³, Western Institute of Technology and Higher Education⁴

19 Sep 2018-arXiv: Optimization and Control

TL;DR: In this article, the convergence time analysis of a class of fixed-time stable systems with the aim to provide a new non-conservative upper bound for its settling time is discussed. But the convergence times are not directly considered.

...read moreread less

Abstract: This paper deals with the convergence time analysis of a class of fixed-time stable systems with the aim to provide a new non-conservative upper bound for its settling time. Our contribution is fourfold. First, we revisit the well-known class of fixed-time stable systems, given in (Polyakov et al.,2012}, while showing the conservatism of the classical upper estimate of the settling time. Second, we provide the smallest constant that uniformly upper bounds the settling time of any trajectory of the system under consideration. Third, introducing a slight modification of the previous class of fixed-time systems, we propose a new predefined-time convergent algorithm where the least upper bound of the settling time is set a priori as a parameter of the system. At last, predefined-time controllers for first order and second order systems are introduced. Some simulation results highlight the performance of the proposed scheme in terms of settling time estimation compared to existing methods.

...read moreread less

Posted Content•

Understanding the Acceleration Phenomenon via High-Resolution Differential Equations

[...]

Bin Shi¹, Simon S. Du², Michael I. Jordan³, Weijie J. Su⁴•Institutions (4)

Chinese Academy of Sciences¹, University of Washington², University of California, Berkeley³, University of Pennsylvania⁴

21 Oct 2018-arXiv: Optimization and Control

TL;DR: In this paper, an alternative limiting process that yields high-resolution ODEs was proposed, which can be used to distinguish between Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method.

...read moreread less

Abstract: Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.

...read moreread less

Posted Content•

Direct Runge-Kutta Discretization Achieves Acceleration

[...]

Jingzhao Zhang¹, Aryan Mokhtari¹, Suvrit Sra¹, Ali Jadbabaie¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2018-arXiv: Optimization and Control

TL;DR: It is proved that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator.

...read moreread less

Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.

...read moreread less

Posted Content•

Distributed Stochastic Gradient Tracking Methods

[...]

Shi Pu¹, Angelia Nedic²•Institutions (2)

The Chinese University of Hong Kong¹, Arizona State University²

25 May 2018-arXiv: Optimization and Control

TL;DR: In this paper, the authors studied the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex, and the global objective is to find a common solution that minimizes the average of all cost functions.

...read moreread less

Abstract: In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size $n$, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.

...read moreread less

Posted Content•

On the Convergence of the SINDy Algorithm

[...]

Linan Zhang, Hayden Schaeffer

16 May 2018-arXiv: Optimization and Control

TL;DR: In this paper, the authors provide some theoretical results on the behavior and convergence of the algorithm proposed in [6] and prove that the algorithm approximates local minimizers of an unconstrained $\ell^0$-penalized least-squares problem, and provide sufficient conditions for general convergence, rate of convergence, and conditions for one-step recovery.

...read moreread less

Abstract: One way to understand time-series data is to identify the underlying dynamical system which generates it. This task can be done by selecting an appropriate model and a set of parameters which best fits the dynamics while providing the simplest representation (i.e. the smallest amount of terms). One such approach is the sparse identification of nonlinear dynamics framework [6] which uses a sparsity-promoting algorithm that iterates between a partial least-squares fit and a thresholding (sparsity-promoting) step. In this work, we provide some theoretical results on the behavior and convergence of the algorithm proposed in [6]. In particular, we prove that the algorithm approximates local minimizers of an unconstrained $\ell^0$-penalized least-squares problem. From this, we provide sufficient conditions for general convergence, rate of convergence, and conditions for one-step recovery. Examples illustrate that the rates of convergence are sharp. In addition, our results extend to other algorithms related to the algorithm in [6], and provide theoretical verification to several observed phenomena.

...read moreread less

Posted Content•

The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization

[...]

Constantinos Daskalakis¹, Ioannis Panageas¹•Institutions (1)

Massachusetts Institute of Technology¹

11 Jul 2018-arXiv: Optimization and Control

TL;DR: This work characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradients Descent Ascent (OGDA), and shows that both dynamics avoid unstable critical points for almost all initializations.

...read moreread less

Abstract: Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations. Moreover, for small step sizes and under mild assumptions, the set of \{OGDA\}-stable critical points is a superset of \{GDA\}-stable critical points, which is a superset of local min-max solutions (strict in some cases). The connecting thread is that the behavior of these dynamics can be studied from a dynamical systems perspective.

...read moreread less

Posted Content•

Spurious Valleys in Two-layer Neural Network Optimization Landscapes

[...]

Luca Venturi¹, Afonso S. Bandeira¹, Joan Bruna¹•Institutions (1)

New York University¹

18 Feb 2018-arXiv: Optimization and Control

TL;DR: Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, a notion of intrinsic dimension is identified and it is shown that it provides necessary and sufficient conditions for the absence of spurious valleys.

...read moreread less

Abstract: Neural networks provide a rich class of high-dimensional, non-convex optimization problems. Despite their non-convexity, gradient-descent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sub-level sets that do not include a global minimum. Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.

...read moreread less

Posted Content•

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization

[...]

Blake Woodworth¹, Jialei Wang², Adam Smith³, Brendan McMahan⁴, Nathan Srebro¹ - Show less +1 more•Institutions (4)

Toyota Technological Institute at Chicago¹, University of Chicago², Boston University³, Google⁴

25 May 2018-arXiv: Optimization and Control

TL;DR: A general oracle-based framework is suggested that captures parallel stochastic optimization in different parallelization settings described by a dependency graph, and generic lower bounds are derived in terms of this graph.

...read moreread less

Abstract: We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the "natural" algorithms are not known to be optimal.

...read moreread less

Posted Content•

On exponential convergence of SGD in non-convex over-parametrized learning

[...]

Raef Bassily, Mikhail Belkin, Siyuan Ma

06 Nov 2018-arXiv: Optimization and Control

TL;DR: It is argued that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.

...read moreread less

Abstract: Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.

...read moreread less

Posted Content•

Peer-to-Peer Electricity Market Analysis: From Variational to Generalized Nash Equilibrium

[...]

Hélène Le Cadre, Paulin Jacquot, Cheng Wan, Clémence Alasseur

05 Dec 2018-arXiv: Optimization and Control

TL;DR: It is shown that the preferences have a large impact on the structure of the trades, but that one equilibrium (variational) is optimal, and the learning mechanism needed to reach an equilibrium state in the peer-to-peer market design is discussed together with privacy issues.

...read moreread less

Abstract: We consider a network of prosumers involved in peer-to-peer energy exchanges, with differentiation price preferences on the trades with their neighbors, and we analyze two market designs: (i) a centralized market, used as a benchmark, where a global market operator optimizes the flows (trades) between the nodes, local demand and flexibility activation to maximize the system overall social welfare; (ii) a distributed peer-to-peer market design where prosumers in local energy communities optimize selfishly their trades, demand, and flexibility activation. We first characterizethe solution of the peer-to-peer market as a Variational Equilibrium and prove that the set of Variational Equilibria coincides with the set of social welfare optimal solutions of market design (i). We give several results that help understanding the structure of the trades at an equilibriumor at the optimum. We characterize the impact of preferences on the network line congestion and renewable energy waste under both designs. We provide a reduced example for which we give the set of all possible generalized equilibria, which enables to give an approximation of the price ofanarchy. We provide a more realistic example which relies on the IEEE 14-bus network, for which we can simulate the trades under different preference prices. Our analysis shows in particular that the preferences have a large impact on the structure of the trades, but that one equilibrium(variational) is optimal.

...read moreread less

Collapse