Showing papers on "Rate of convergence published in 2018"

PDF

Open Access

Journal Article•DOI•

Harnessing Smoothness to Accelerate Distributed Optimization

[...]

Guannan Qu¹, Na Li¹•Institutions (1)

01 Sep 2018-IEEE Transactions on Control of Network Systems

TL;DR: It is shown that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth, and a novel gradient estimation scheme is proposed that uses history information to achieve fast and accurate estimation of the average gradient.

...read moreread less

Abstract: There has been a growing effort in studying the distributed optimization problem over a network. The objective is to optimize a global function formed by a sum of local functions, using only local computation and communication. The literature has developed consensus-based distributed (sub)gradient descent (DGD) methods and has shown that they have the same convergence rate $O(\frac{\log t}{\sqrt{t}})$ as the centralized (sub)gradient methods (CGD), when the function is convex but possibly nonsmooth. However, when the function is convex and smooth, under the framework of DGD, it is unclear how to harness the smoothness to obtain a faster convergence rate comparable to CGD's convergence rate. In this paper, we propose a distributed algorithm that, despite using the same amount of communication per iteration as DGD, can effectively harnesses the function smoothness and converge to the optimum with a rate of $O(\frac{1}{t})$ . If the objective function is further strongly convex, our algorithm has a linear convergence rate. Both rates match the convergence rate of CGD. The key step in our algorithm is a novel gradient estimation scheme that uses history information to achieve fast and accurate estimation of the average gradient. To motivate the necessity of history information, we also show that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth.

...read moreread less

440 citations

Proceedings Article•

signSGD: Compressed Optimisation for Non-Convex Problems

[...]

Jeremy Bernstein¹, Yu-Xiang Wang², Kamyar Azizzadenesheli³, Animashree Anandkumar¹•Institutions (3)

California Institute of Technology¹, Carnegie Mellon University², Stanford University³

13 Feb 2018

TL;DR: SignSGD as mentioned in this paper uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions, which can achieve fast communication and fast convergence.

...read moreread less

Abstract: Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative l_1/l_2 geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.

...read moreread less

332 citations

Journal Article•DOI•

Adaptive Parameter Estimation and Control Design for Robot Manipulators With Finite-Time Convergence

[...]

Chenguang Yang¹, Yiming Jiang¹, Wei He², Jing Na³, Zhijun Li¹, Bin Xu⁴ - Show less +2 more•Institutions (4)

South China University of Technology¹, University of Science and Technology of China², Kunming University of Science and Technology³, Northwestern Polytechnical University⁴

08 Feb 2018-IEEE Transactions on Industrial Electronics

TL;DR: A robot control/identification scheme to identify the unknown robot kinematic and dynamic parameters with enhanced convergence rate was developed, and the information of parameter estimation error was properly integrated into the proposed identification algorithm, such that enhanced estimation performance was achieved.

...read moreread less

Abstract: For parameter identifications of robot systems, most existing works have focused on the estimation veracity, but few works of literature are concerned with the convergence speed. In this paper, we developed a robot control/identification scheme to identify the unknown robot kinematic and dynamic parameters with enhanced convergence rate. Superior to the traditional methods, the information of parameter estimation error was properly integrated into the proposed identification algorithm, such that enhanced estimation performance was achieved. Besides, the Newton–Euler (NE) method was used to build the robot dynamic model, where a singular value decomposition-based model reduction method was designed to remedy the potential singularity problems of the NE regressor. Moreover, an interval excitation condition was employed to relax the requirement of persistent excitation condition for the kinematic estimation. By using the Lyapunov synthesis, explicit analysis of the convergence rate of the tracking errors and the estimated parameters were performed. Simulation studies were conducted to show the accurate and fast convergence of the proposed finite-time (FT) identification algorithm based on a 7-DOF arm of Baxter robot.

...read moreread less

321 citations

Posted Content•

signSGD: Compressed Optimisation for Non-Convex Problems

[...]

Jeremy Bernstein¹, Yu-Xiang Wang², Kamyar Azizzadenesheli³, Animashree Anandkumar¹•Institutions (3)

California Institute of Technology¹, Carnegie Mellon University², Stanford University³

13 Feb 2018-arXiv: Learning

TL;DR: SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.

...read moreread less

Abstract: Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate The relative $\ell_1/\ell_2$ geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence Code to reproduce experiments is to be found at this https URL

...read moreread less

275 citations

Journal Article•

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods

[...]

Zeyuan Allen-Zhu

01 Jan 2018-Journal of Machine Learning Research

TL;DR: Katyusha as mentioned in this paper is a primal-only stochastic gradient method with negative momentum on top of Nesterov's momentum, which can be incorporated into a variance reduction based algorithm and speed it up, both in terms of sequential and parallel performance.

...read moreread less

Abstract: Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex and finite-sum. We introduce $\mathtt{Katyusha}$, a direct, primal-only stochastic gradient method to fix this issue. In convex finite-sum stochastic optimization, $\mathtt{Katyusha}$ has an optimal accelerated convergence rate, and enjoys an optimal parallel linear speedup in the mini-batch setting. The main ingredient is $\textit{Katyusha momentum}$, a novel "negative momentum" on top of Nesterov's momentum. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of $\textit{sequential and parallel}$ performance. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially try to give Katyusha a hug.

...read moreread less

273 citations

Journal Article•DOI•

Accelerated Methods for Non-Convex Optimization

[...]

Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

14 Jun 2018-Siam Journal on Optimization

TL;DR: This work presents an accelerated gradient method for nonconvex optimization problems with Lipschitz continuous first and second derivatives that is Hessian free, i.e., it only requires gradient computations, and is therefore suitable for large-scale applications.

...read moreread less

Abstract: We present an accelerated gradient method for nonconvex optimization problems with Lipschitz continuous first and second derivatives In a time $O(\epsilon^{-7/4} \log(1/ \epsilon) )$, the method f

...read moreread less

243 citations

Journal Article•DOI•

Calculus of the Exponent of Kurdyka–Łojasiewicz Inequality and Its Applications to Linear Convergence of First-Order Methods

[...]

Guoyin Li¹, Ting Kei Pong²•Institutions (2)

University of New South Wales¹, Hong Kong Polytechnic University²

01 Oct 2018-Foundations of Computational Mathematics

TL;DR: The Kurdyka–Łojasiewicz exponent is studied, an important quantity for analyzing the convergence rate of first-order methods, and various calculus rules are developed to deduce the KL exponent of new (possibly nonconvex and nonsmooth) functions formed from functions with known KL exponents.

...read moreread less

Abstract: In this paper, we study the Kurdyka–Łojasiewicz (KL) exponent, an important quantity for analyzing the convergence rate of first-order methods. Specifically, we develop various calculus rules to deduce the KL exponent of new (possibly nonconvex and nonsmooth) functions formed from functions with known KL exponents. In addition, we show that the well-studied Luo–Tseng error bound together with a mild assumption on the separation of stationary values implies that the KL exponent is $$\frac{1}{2}$$ . The Luo–Tseng error bound is known to hold for a large class of concrete structured optimization problems, and thus we deduce the KL exponent of a large class of functions whose exponents were previously unknown. Building upon this and the calculus rules, we are then able to show that for many convex or nonconvex optimization models for applications such as sparse recovery, their objective function’s KL exponent is $$\frac{1}{2}$$ . This includes the least squares problem with smoothly clipped absolute deviation regularization or minimax concave penalty regularization and the logistic regression problem with $$\ell _1$$ regularization. Since many existing local convergence rate analysis for first-order methods in the nonconvex scenario relies on the KL exponent, our results enable us to obtain explicit convergence rate for various first-order methods when they are applied to a large variety of practical optimization models. Finally, we further illustrate how our results can be applied to establishing local linear convergence of the proximal gradient algorithm and the inertial proximal algorithm with constant step sizes for some specific models that arise in sparse recovery.

...read moreread less

242 citations

Journal Article•DOI•

Error Bounds, Quadratic Growth, and Linear Convergence of Proximal Methods

[...]

Dmitriy Drusvyatskiy¹, Adrian S. Lewis²•Institutions (2)

University of Washington¹, Cornell University²

15 Mar 2018-Mathematics of Operations Research

TL;DR: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity as mentioned in this paper, and the equivalence of such an error bound to a natural quadratic growth condition is established.

...read moreread less

Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the “error”—the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear and quadratic convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

...read moreread less

235 citations

Journal Article•

Maximum Principle Based Algorithms for Deep Learning

[...]

Qianxiao Li¹, Long Chen², Cheng Tai², Weinan E•Institutions (2)

Agency for Science, Technology and Research¹, Peking University²

01 Apr 2018-Journal of Machine Learning Research

TL;DR: The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms using the Pontryagin's maximum principle, demonstrating that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out.

...read moreread less

Abstract: The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

...read moreread less

209 citations

Journal Article•DOI•

Prescribed Performance Control of Uncertain Euler–Lagrange Systems Subject to Full-State Constraints

[...]

Kai Zhao¹, Yongduan Song¹, Tiedong Ma¹, Liu He¹•Institutions (1)

Chongqing University¹

01 Aug 2018-IEEE Transactions on Neural Networks

TL;DR: By using the Lyapunov analysis, it is proven that all the signals of the closed-loop systems are semiglobally uniformly ultimately bounded.

...read moreread less

Abstract: This paper studies the zero-error tracking control problem of Euler-Lagrange systems subject to full-state constraints and nonparametric uncertainties. By blending an error transformation with barrier Lyapunov function, a neural adaptive tracking control scheme is developed, resulting in a solution with several salient features: 1) the control action is continuous and $\mathscr C^{1}$ smooth; 2) the full-state tracking error converges to a prescribed compact set around origin within a given finite time at a controllable rate of convergence that can be uniformly prespecified; 3) with Nussbaum gain in the loop, the tracking error further shrinks to zero as $t\to \infty $ ; and 4) the neural network (NN) unit can be safely included in the loop during the entire system operational envelope without the danger of violating the compact set precondition imposed on the NN training inputs. Furthermore, by using the Lyapunov analysis, it is proven that all the signals of the closed-loop systems are semiglobally uniformly ultimately bounded. The effectiveness and benefits of the proposed control method are validated via computer simulation.

...read moreread less

203 citations

Proceedings Article•

Theoretical Linear Convergence of Unfolded ISTA and Its Practical Weights and Thresholds

[...]

Xiaohan Chen¹, Jialin Liu², Zhangyang Wang¹, Wotao Yin²•Institutions (2)

Texas A&M University¹, University of California, Los Angeles²

29 Aug 2018

TL;DR: In this article, a weight structure that is necessary for asymptotic convergence to the true sparse signal is introduced, which can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases.

...read moreread less

Abstract: In recent years, unfolding iterative algorithms as neural networks has become an empirical success in solving sparse recovery problems. However, its theoretical understanding is still immature, which prevents us from fully utilizing the power of neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage Thresholding Algorithm) for sparse signal recovery. We introduce a weight structure that is necessary for asymptotic convergence to the true sparse signal. With this structure, unfolded ISTA can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose to incorporate thresholding in the network to perform support selection, which is easy to implement and able to boost the convergence rate both theoretically and empirically. Extensive simulations, including sparse vector recovery and a compressive sensing experiment on real image data, corroborate our theoretical results and demonstrate their practical usefulness. We have made our codes publicly available: https://github.com/xchen-tamu/linear-lista-cpss.

...read moreread less

Journal Article•DOI•

ADD-OPT: Accelerated Distributed Directed Optimization

[...]

Chenguang Xi¹, Ran Xin¹, Usman A. Khan¹•Institutions (1)

Tufts University¹

01 May 2018-IEEE Transactions on Automatic Control

TL;DR: The proposed algorithm, Accelerated Distributed Directed OPTimization (ADD-OPT), achieves the best known convergence rate for this class of problems, given strongly convex, objective functions with globally Lipschitz-continuous gradients.

...read moreread less

Abstract: In this paper, we consider distributed optimization problems where the goal is to minimize a sum of objective functions over a multiagent network. We focus on the case when the interagent communication is described by a strongly connected, directed graph. The proposed algorithm, Accelerated Distributed Directed OPTimization (ADD-OPT), achieves the best known convergence rate for this class of problems, $O(\mu ^{k}),0 , given strongly convex, objective functions with globally Lipschitz-continuous gradients, where $k$ is the number of iterations. Moreover, ADD-OPT supports a wider and more realistic range of step sizes in contrast to existing work. In particular, we show that ADD-OPT converges for arbitrarily small (positive) step sizes. Simulations further illustrate our results.

...read moreread less

Posted Content•

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

[...]

Sharan Vaswani¹, Francis Bach², Mark Schmidt¹•Institutions (2)

University of British Columbia¹, École Normale Supérieure²

16 Oct 2018-arXiv: Learning

TL;DR: It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.

...read moreread less

Abstract: Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. We also show that this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent in non-convex settings. Under interpolation, we further show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for k iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.

...read moreread less

Proceedings Article•

On the convergence of a class of Adam-type algorithms for non-convex optimization

[...]

Xiangyi Chen¹, Sijia Liu², Ruoyu Sun³, Mingyi Hong¹•Institutions (3)

University of Minnesota¹, IBM², University of Illinois at Urbana–Champaign³

08 Aug 2018

TL;DR: In this article, the convergence of adaptive gradient based momentum algorithms is studied for nonconvex stochastic optimization problems and a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods are provided.

...read moreread less

Abstract: This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same $O(\log{T}/\sqrt{T})$ convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

...read moreread less

Journal Article•DOI•

Active Adaptive Estimation and Control for Vehicle Suspensions With Prescribed Performance

[...]

Jing Na¹, Yingbo Huang¹, Xing Wu¹, Guanbin Gao¹, Guido Herrmann², Jason Zheng Jiang² - Show less +2 more•Institutions (2)

Kunming University of Science and Technology¹, University of Bristol²

01 Nov 2018-IEEE Transactions on Control Systems and Technology

TL;DR: An adaptive control for vehicle active suspensions with unknown nonlinearities (e.g., nonlinear springs and piecewise dampers) is proposed, such that both the transient and steady-state suspension response are guaranteed.

...read moreread less

Abstract: This paper proposes an adaptive control for vehicle active suspensions with unknown nonlinearities (eg, nonlinear springs and piecewise dampers) A prescribed performance function that characterizes the convergence rate, maximum overshoot, and steady-state error is incorporated into the control design to stabilize the vertical and pitch motions, such that both the transient and steady-state suspension response are guaranteed Moreover, a novel adaptive law is used to achieve precise estimation of essential parameters (eg, mass of vehicle body and moment of inertia for pitch motion), where the parameter estimation error is obtained explicitly and then used as a new leakage term Theoretical studies prove the convergence of the estimated parameters, and compare the suggested controller with generic adaptive controllers using the gradient descent and e-modification schemes In addition to motion displacements, dynamic tire loads and suspension travel constraints are also considered Extensive comparative simulations on a dynamic simulator consisting of commercial vehicle simulation software Carsim 81 and MATLAB Simulink are provided to show the efficacy of the proposed control, and to illustrate the improved performance

...read moreread less

Journal Article•DOI•

Linear Convergence in Optimization Over Directed Graphs With Row-Stochastic Matrices

[...]

Chenguang Xi¹, Van Sy Mai², Ran Xin¹, Eyad H. Abed², Usman A. Khan¹ - Show less +1 more•Institutions (2)

Tufts University¹, University of Maryland, College Park²

23 Jan 2018-IEEE Transactions on Automatic Control

TL;DR: This paper considers a distributed optimization problem over a multiagent network, in which the objective function is a sum of individual cost functions at the agents, and proposes a algorithm that achieves the best known rate of convergence for this class of problems.

...read moreread less

Abstract: This paper considers a distributed optimization problem over a multiagent network, in which the objective function is a sum of individual cost functions at the agents. We focus on the case when communication between the agents is described by a directed graph. Existing distributed optimization algorithms for directed graphs require at least the knowledge of the neighbors’ out-degree at each agent (due to the requirement of column-stochastic matrices). In contrast, our algorithm requires no such knowledge. Moreover, the proposed algorithm achieves the best known rate of convergence for this class of problems, $O(\mu ^k)$ for $0 , where $k$ is the number of iterations, given that the objective functions are strongly convex and have Lipschitz-continuous gradients. Numerical experiments are also provided to illustrate the theoretical findings.

...read moreread less

Journal Article•DOI•

Tensor SVD: Statistical and Computational Limits

[...]

Anru Zhang¹, Dong Xia²•Institutions (2)

University of Wisconsin-Madison¹, Columbia University²

28 May 2018-IEEE Transactions on Information Theory

TL;DR: A general framework for tensor singular value decomposition (tensor singular value decompposition (SVD)), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data, is proposed.

...read moreread less

Abstract: In this paper, we propose a general framework for tensor singular value decomposition (tensor singular value decomposition (SVD)), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.

...read moreread less

Journal Article•DOI•

Constraint Energy Minimizing Generalized Multiscale Finite Element Method

[...]

Eric T. Chung¹, Yalchin Efendiev², Wing Tat Leung²•Institutions (2)

The Chinese University of Hong Kong¹, Texas A&M University²

01 Sep 2018-Computer Methods in Applied Mechanics and Engineering

TL;DR: In this article, the authors proposed a constraint energy minimization to construct multiscale spaces for GMsFEM, which is performed in the oversampling domain, which can handle non-decaying components of the local minimizers.

...read moreread less

Proceedings Article•

A linear speedup analysis of distributed deep learning with sparse and quantized communication

[...]

Peng Jiang¹, Gagan Agrawal¹•Institutions (1)

Ohio State University¹

03 Dec 2018

TL;DR: This paper studies the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization and proposes a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate.

...read moreread less

Abstract: The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/√MK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% - 5% communication data size.

...read moreread less

Journal Article•DOI•

A simplified view of first order methods for optimization

[...]

Marc Teboulle¹•Institutions (1)

Tel Aviv University¹

01 Jul 2018-Mathematical Programming

TL;DR: The foundational role of the proximal framework is discussed, with a focus on non-Euclidean proximal distances of Bregman type, which are central to the analysis of many other fundamental first order minimization relatives.

...read moreread less

Abstract: We discuss the foundational role of the proximal framework in the development and analysis of some iconic first order optimization algorithms, with a focus on non-Euclidean proximal distances of Bregman type, which are central to the analysis of many other fundamental first order minimization relatives. We stress simplification and unification by highlighting self-contained elementary proof-patterns to obtain convergence rate and global convergence both in the convex and the nonconvex settings, which in turn also allows to present some novel results.

...read moreread less

Journal Article•DOI•

Second-Order Continuous-Time Algorithms for Economic Power Dispatch in Smart Grids

[...]

Xing He¹, Daniel W. C. Ho², Tingwen Huang³, Junzhi Yu⁴, Haitham Abu-Rub³, Chaojie Li⁵ - Show less +2 more•Institutions (5)

Southwest University¹, City University of Hong Kong², Texas A&M University at Qatar³, Chinese Academy of Sciences⁴, RMIT University⁵

01 Sep 2018-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: Simulation results show the effectiveness and performance of the proposed continuous-time algorithms and show that the convergence rate of second-order algorithm is faster than that of first-order distributed algorithm.

...read moreread less

Abstract: This paper proposes two second-order continuous-time algorithms to solve the economic power dispatch problem in smart grids. The collective aim is to minimize a sum of generation cost function subject to the power demand and individual generator constraints. First, in the framework of nonsmooth analysis and algebraic graph theory, one distributed second-order algorithm is developed and guaranteed to find an optimal solution. As a result, the power demand constraints can be kept all the time under appropriate initial condition. The second algorithm is under a centralized framework, and the optimal solution is robust in the sense that different initial power conditions do not change the convergence of the optimal solution. Finally, simulation results based on five-unit system, IEEE 30-bus system, and IEEE 300-bus system show the effectiveness and performance of the proposed continuous-time algorithms. The examples also show that the convergence rate of second-order algorithm is faster than that of first-order distributed algorithm.

...read moreread less

Posted Content•

Approximation Methods for Bilevel Programming

[...]

Saeed Ghadimi, Mengdi Wang

06 Feb 2018-arXiv: Optimization and Control

TL;DR: An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.

...read moreread less

Abstract: In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on the outer objective function We also present an accelerated variant of this method which improves the rate of convergence under convexity assumption Furthermore, we generalize our results under stochastic setting where only noisy information of both objective functions is available To the best of our knowledge, this is the first time that such (stochastic) approximation algorithms with established iteration complexity (sample complexity) are provided for bilevel programming

...read moreread less

Journal Article•DOI•

On the Convergence of Newton's Method in Power Flow Studies for DC Microgrids

[...]

Alejandro Garces¹•Institutions (1)

Technological University of Pereira¹

28 Mar 2018-IEEE Transactions on Power Systems

TL;DR: This paper presents sufficient conditions for the quadratic convergence of the Newton's method in this type of grids with constant power terminals and computational results complement this theoretical analysis.

...read moreread less

Abstract: The power flow is a nonlinear problem that requires a Newton's method to be solved in dc microgrids with constant power terminals. This paper presents sufficient conditions for the quadratic convergence of the Newton's method in this type of grids. The classic Newton's method as well as an approximated Newton's method are analyzed in both master–slave and island operation with droop controls. Requirements for the convergence as well as for the existence and uniqueness of the solution starting from voltages close to 1 pu are presented. Computational results complement this theoretical analysis.

...read moreread less

Posted Content•

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

[...]

Dongruo Zhou, Yiqi Tang¹, Ziyan Yang², Yuan Cao³, Quanquan Gu³ - Show less +1 more•Institutions (3)

Ohio State University¹, University of Virginia², University of California, Los Angeles³

16 Aug 2018-arXiv: Learning

TL;DR: A sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases, shows that Padam converges to a first-order stationary point at the rate of O\big.

...read moreread less

Abstract: Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.

...read moreread less

Journal Article•DOI•

Real-parameter unconstrained optimization based on enhanced fitness-adaptive differential evolution algorithm with novel mutation

[...]

Ali Wagdy Mohamed¹, Ponnuthurai Nagaratnam Suganthan²•Institutions (2)

Cairo University¹, Nanyang Technological University²

01 May 2018

TL;DR: Experimental results indicate that in terms of robustness, stability and quality of the solution obtained, EFADE is significantly better than, or at least comparable to state-of-the-art approaches with outstanding performance.

...read moreread less

Abstract: This paper presents enhanced fitness-adaptive differential evolution algorithm with novel mutation (EFADE) for solving global numerical optimization problems over continuous space. A new triangular mutation operator is introduced. It is based on the convex combination vector of the triplet defined by the three randomly chosen vectors and the difference vectors between the best, better and the worst individuals among the three randomly selected vectors. Triangular mutation operator helps the search for better balance between the global exploration ability and the local exploitation tendency as well as enhancing the convergence rate of the algorithm through the optimization process. Besides, two novel, effective adaptation schemes are used to update the control parameters to appropriate values without either extra parameters or prior knowledge of the characteristics of the optimization problem. In order to verify and analyze the performance of EFADE, numerical experiments on a set of 28 test problems from the CEC2013 benchmark for 10, 30 and 50 dimensions, including a comparison with 12 recent DE-based algorithms and six recent evolutionary algorithms, are executed. Experimental results indicate that in terms of robustness, stability and quality of the solution obtained, EFADE is significantly better than, or at least comparable to state-of-the-art approaches with outstanding performance.

...read moreread less

Proceedings Article•

Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

[...]

Kevin Scaman¹, Francis Bach, Sébastien Bubeck², Laurent Massoulié, Yin Tat Lee² - Show less +1 more•Institutions (2)

Huawei¹, Microsoft²

02 Dec 2018

TL;DR: The error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions, and the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate are provided.

...read moreread less

Abstract: In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

...read moreread less

Journal Article•DOI•

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

[...]

Coralia Cartis¹, Katya Scheinberg²•Institutions (2)

University of Oxford¹, Lehigh University²

01 Jun 2018-Mathematical Programming

TL;DR: It is shown that in terms of the order of the accuracy, the evaluation complexity of a line-search method which is based on random first-order models and directions is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good.

...read moreread less

Abstract: We present global convergence rates for a line-search method which is based on random first-order models and directions whose quality is ensured only with certain probability. We show that in terms of the order of the accuracy, the evaluation complexity of such a method is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good. We particularize and improve these results in the convex and strongly convex case. We also analyse a probabilistic cubic regularization variant that allows approximate probabilistic second-order models and show improved complexity bounds compared to probabilistic first-order methods; again, as a function of the accuracy, the probabilistic cubic regularization bounds are of the same (optimal) order as for the deterministic case.

...read moreread less

Journal Article•DOI•

The Fastest Known Globally Convergent First-Order Method for Minimizing Strongly Convex Functions

[...]

Bryan Van Scoy¹, Randy A. Freeman¹, Kevin M. Lynch•Institutions (1)

Northwestern University¹

01 Jan 2018

TL;DR: A novel gradient-based algorithm for unconstrained convex optimization, which can be seen as an extension of methods such as gradient descent, Nesterov’s accelerated gradient ascent, and the heavy-ball method is designed and analyzed.

...read moreread less

Abstract: We design and analyze a novel gradient-based algorithm for unconstrained convex optimization. When the objective function is $m$ -strongly convex and its gradient is $L$ -Lipschitz continuous, the iterates and function values converge linearly to the optimum at rates $\rho $ and $\rho ^{2}$ , respectively, where $\rho = 1-\sqrt {m/L}$ . These are the fastest known guaranteed linear convergence rates for globally convergent first-order methods, and for high desired accuracies the corresponding iteration complexity is within a factor of two of the theoretical lower bound. We use a simple graphical design procedure based on integral quadratic constraints to derive closed-form expressions for the algorithm parameters. The new algorithm, which we call the triple momentum method, can be seen as an extension of methods such as gradient descent, Nesterov’s accelerated gradient descent, and the heavy-ball method.

...read moreread less

Proceedings Article•

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

[...]

Chi Jin¹, Praneeth Netrapalli², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Microsoft²

03 Jul 2018

TL;DR: In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.

...read moreread less

Abstract: Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

...read moreread less

Journal Article•DOI•

A proximal difference-of-convex algorithm with extrapolation

[...]

Bo Wen¹, Bo Wen², Bo Wen³, Xiaojun Chen¹, Ting Kei Pong¹ - Show less +1 more•Institutions (3)

Hong Kong Polytechnic University¹, Hebei University of Technology², Harbin Institute of Technology³

01 Mar 2018-Computational Optimization and Applications

TL;DR: A proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA, and it is shown that any cluster point of the sequence generated by the algorithm is a stationary points of the DC optimization problem for a fairly general choice of extrapolation parameters.

...read moreread less

Abstract: We consider a class of difference-of-convex (DC) optimization problems whose objective is level-bounded and is the sum of a smooth convex function with Lipschitz gradient, a proper closed convex function and a continuous concave function. While this kind of problems can be solved by the classical difference-of-convex algorithm (DCA) (Pham et al. Acta Math Vietnam 22:289–355, 1997), the difficulty of the subproblems of this algorithm depends heavily on the choice of DC decomposition. Simpler subproblems can be obtained by using a specific DC decomposition described in Pham et al. (SIAM J Optim 8:476–505, 1998). This decomposition has been proposed in numerous work such as Gotoh et al. (DC formulations and algorithms for sparse optimization problems, 2017), and we refer to the resulting DCA as the proximal DCA. Although the subproblems are simpler, the proximal DCA is the same as the proximal gradient algorithm when the concave part of the objective is void, and hence is potentially slow in practice. In this paper, motivated by the extrapolation techniques for accelerating the proximal gradient algorithm in the convex settings, we consider a proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA. We show that any cluster point of the sequence generated by our algorithm is a stationary point of the DC optimization problem for a fairly general choice of extrapolation parameters: in particular, the parameters can be chosen as in FISTA with fixed restart (O’Donoghue and Candes in Found Comput Math 15, 715–732, 2015). In addition, by assuming the Kurdyka-Łojasiewicz property of the objective and the differentiability of the concave part, we establish global convergence of the sequence generated by our algorithm and analyze its convergence rate. Our numerical experiments on two difference-of-convex regularized least squares models show that our algorithm usually outperforms the proximal DCA and the general iterative shrinkage and thresholding algorithm proposed in Gong et al. (A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems, 2013).

...read moreread less

Collapse