scispace - formally typeset
Search or ask a question

Showing papers on "Rate of convergence published in 2018"


Journal ArticleDOI
Guannan Qu1, Na Li1
TL;DR: It is shown that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth, and a novel gradient estimation scheme is proposed that uses history information to achieve fast and accurate estimation of the average gradient.
Abstract: There has been a growing effort in studying the distributed optimization problem over a network. The objective is to optimize a global function formed by a sum of local functions, using only local computation and communication. The literature has developed consensus-based distributed (sub)gradient descent (DGD) methods and has shown that they have the same convergence rate $O(\frac{\log t}{\sqrt{t}})$ as the centralized (sub)gradient methods (CGD), when the function is convex but possibly nonsmooth. However, when the function is convex and smooth, under the framework of DGD, it is unclear how to harness the smoothness to obtain a faster convergence rate comparable to CGD's convergence rate. In this paper, we propose a distributed algorithm that, despite using the same amount of communication per iteration as DGD, can effectively harnesses the function smoothness and converge to the optimum with a rate of $O(\frac{1}{t})$ . If the objective function is further strongly convex, our algorithm has a linear convergence rate. Both rates match the convergence rate of CGD. The key step in our algorithm is a novel gradient estimation scheme that uses history information to achieve fast and accurate estimation of the average gradient. To motivate the necessity of history information, we also show that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth.

440 citations


Proceedings Article
13 Feb 2018
TL;DR: SignSGD as mentioned in this paper uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions, which can achieve fast communication and fast convergence.
Abstract: Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative l_1/l_2 geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.

332 citations


Journal ArticleDOI
TL;DR: A robot control/identification scheme to identify the unknown robot kinematic and dynamic parameters with enhanced convergence rate was developed, and the information of parameter estimation error was properly integrated into the proposed identification algorithm, such that enhanced estimation performance was achieved.
Abstract: For parameter identifications of robot systems, most existing works have focused on the estimation veracity, but few works of literature are concerned with the convergence speed. In this paper, we developed a robot control/identification scheme to identify the unknown robot kinematic and dynamic parameters with enhanced convergence rate. Superior to the traditional methods, the information of parameter estimation error was properly integrated into the proposed identification algorithm, such that enhanced estimation performance was achieved. Besides, the Newton–Euler (NE) method was used to build the robot dynamic model, where a singular value decomposition-based model reduction method was designed to remedy the potential singularity problems of the NE regressor. Moreover, an interval excitation condition was employed to relax the requirement of persistent excitation condition for the kinematic estimation. By using the Lyapunov synthesis, explicit analysis of the convergence rate of the tracking errors and the estimated parameters were performed. Simulation studies were conducted to show the accurate and fast convergence of the proposed finite-time (FT) identification algorithm based on a 7-DOF arm of Baxter robot.

321 citations


Posted Content
TL;DR: SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Abstract: Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate The relative $\ell_1/\ell_2$ geometry of gradients, noise and curvature informs whether signSGD or SGD is theoretically better suited to a particular problem On the practical side we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions Using a theorem by Gauss we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence Code to reproduce experiments is to be found at this https URL

275 citations


Journal Article
TL;DR: Katyusha as mentioned in this paper is a primal-only stochastic gradient method with negative momentum on top of Nesterov's momentum, which can be incorporated into a variance reduction based algorithm and speed it up, both in terms of sequential and parallel performance.
Abstract: Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex and finite-sum. We introduce $\mathtt{Katyusha}$, a direct, primal-only stochastic gradient method to fix this issue. In convex finite-sum stochastic optimization, $\mathtt{Katyusha}$ has an optimal accelerated convergence rate, and enjoys an optimal parallel linear speedup in the mini-batch setting. The main ingredient is $\textit{Katyusha momentum}$, a novel "negative momentum" on top of Nesterov's momentum. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of $\textit{sequential and parallel}$ performance. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially try to give Katyusha a hug.

273 citations


Journal ArticleDOI
TL;DR: This work presents an accelerated gradient method for nonconvex optimization problems with Lipschitz continuous first and second derivatives that is Hessian free, i.e., it only requires gradient computations, and is therefore suitable for large-scale applications.
Abstract: We present an accelerated gradient method for nonconvex optimization problems with Lipschitz continuous first and second derivatives In a time $O(\epsilon^{-7/4} \log(1/ \epsilon) )$, the method f

243 citations


Journal ArticleDOI
TL;DR: The Kurdyka–Łojasiewicz exponent is studied, an important quantity for analyzing the convergence rate of first-order methods, and various calculus rules are developed to deduce the KL exponent of new (possibly nonconvex and nonsmooth) functions formed from functions with known KL exponents.
Abstract: In this paper, we study the Kurdyka–Łojasiewicz (KL) exponent, an important quantity for analyzing the convergence rate of first-order methods. Specifically, we develop various calculus rules to deduce the KL exponent of new (possibly nonconvex and nonsmooth) functions formed from functions with known KL exponents. In addition, we show that the well-studied Luo–Tseng error bound together with a mild assumption on the separation of stationary values implies that the KL exponent is $$\frac{1}{2}$$ . The Luo–Tseng error bound is known to hold for a large class of concrete structured optimization problems, and thus we deduce the KL exponent of a large class of functions whose exponents were previously unknown. Building upon this and the calculus rules, we are then able to show that for many convex or nonconvex optimization models for applications such as sparse recovery, their objective function’s KL exponent is $$\frac{1}{2}$$ . This includes the least squares problem with smoothly clipped absolute deviation regularization or minimax concave penalty regularization and the logistic regression problem with $$\ell _1$$ regularization. Since many existing local convergence rate analysis for first-order methods in the nonconvex scenario relies on the KL exponent, our results enable us to obtain explicit convergence rate for various first-order methods when they are applied to a large variety of practical optimization models. Finally, we further illustrate how our results can be applied to establishing local linear convergence of the proximal gradient algorithm and the inertial proximal algorithm with constant step sizes for some specific models that arise in sparse recovery.

242 citations


Journal ArticleDOI
TL;DR: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity as mentioned in this paper, and the equivalence of such an error bound to a natural quadratic growth condition is established.
Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the “error”—the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear and quadratic convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

235 citations


Journal Article
TL;DR: The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms using the Pontryagin's maximum principle, demonstrating that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out.
Abstract: The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

209 citations


Journal ArticleDOI
TL;DR: By using the Lyapunov analysis, it is proven that all the signals of the closed-loop systems are semiglobally uniformly ultimately bounded.
Abstract: This paper studies the zero-error tracking control problem of Euler-Lagrange systems subject to full-state constraints and nonparametric uncertainties. By blending an error transformation with barrier Lyapunov function, a neural adaptive tracking control scheme is developed, resulting in a solution with several salient features: 1) the control action is continuous and $\mathscr C^{1}$ smooth; 2) the full-state tracking error converges to a prescribed compact set around origin within a given finite time at a controllable rate of convergence that can be uniformly prespecified; 3) with Nussbaum gain in the loop, the tracking error further shrinks to zero as $t\to \infty $ ; and 4) the neural network (NN) unit can be safely included in the loop during the entire system operational envelope without the danger of violating the compact set precondition imposed on the NN training inputs. Furthermore, by using the Lyapunov analysis, it is proven that all the signals of the closed-loop systems are semiglobally uniformly ultimately bounded. The effectiveness and benefits of the proposed control method are validated via computer simulation.

203 citations


Proceedings Article
29 Aug 2018
TL;DR: In this article, a weight structure that is necessary for asymptotic convergence to the true sparse signal is introduced, which can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases.
Abstract: In recent years, unfolding iterative algorithms as neural networks has become an empirical success in solving sparse recovery problems. However, its theoretical understanding is still immature, which prevents us from fully utilizing the power of neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage Thresholding Algorithm) for sparse signal recovery. We introduce a weight structure that is necessary for asymptotic convergence to the true sparse signal. With this structure, unfolded ISTA can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose to incorporate thresholding in the network to perform support selection, which is easy to implement and able to boost the convergence rate both theoretically and empirically. Extensive simulations, including sparse vector recovery and a compressive sensing experiment on real image data, corroborate our theoretical results and demonstrate their practical usefulness. We have made our codes publicly available: https://github.com/xchen-tamu/linear-lista-cpss.

Journal ArticleDOI
TL;DR: The proposed algorithm, Accelerated Distributed Directed OPTimization (ADD-OPT), achieves the best known convergence rate for this class of problems, given strongly convex, objective functions with globally Lipschitz-continuous gradients.
Abstract: In this paper, we consider distributed optimization problems where the goal is to minimize a sum of objective functions over a multiagent network. We focus on the case when the interagent communication is described by a strongly connected, directed graph. The proposed algorithm, Accelerated Distributed Directed OPTimization (ADD-OPT), achieves the best known convergence rate for this class of problems, $O(\mu ^{k}),0 , given strongly convex, objective functions with globally Lipschitz-continuous gradients, where $k$ is the number of iterations. Moreover, ADD-OPT supports a wider and more realistic range of step sizes in contrast to existing work. In particular, we show that ADD-OPT converges for arbitrarily small (positive) step sizes. Simulations further illustrate our results.

Posted Content
TL;DR: It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.
Abstract: Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. We also show that this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent in non-convex settings. Under interpolation, we further show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for k iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.

Proceedings Article
08 Aug 2018
TL;DR: In this article, the convergence of adaptive gradient based momentum algorithms is studied for nonconvex stochastic optimization problems and a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods are provided.
Abstract: This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same $O(\log{T}/\sqrt{T})$ convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

Journal ArticleDOI
TL;DR: An adaptive control for vehicle active suspensions with unknown nonlinearities (e.g., nonlinear springs and piecewise dampers) is proposed, such that both the transient and steady-state suspension response are guaranteed.
Abstract: This paper proposes an adaptive control for vehicle active suspensions with unknown nonlinearities (eg, nonlinear springs and piecewise dampers) A prescribed performance function that characterizes the convergence rate, maximum overshoot, and steady-state error is incorporated into the control design to stabilize the vertical and pitch motions, such that both the transient and steady-state suspension response are guaranteed Moreover, a novel adaptive law is used to achieve precise estimation of essential parameters (eg, mass of vehicle body and moment of inertia for pitch motion), where the parameter estimation error is obtained explicitly and then used as a new leakage term Theoretical studies prove the convergence of the estimated parameters, and compare the suggested controller with generic adaptive controllers using the gradient descent and e-modification schemes In addition to motion displacements, dynamic tire loads and suspension travel constraints are also considered Extensive comparative simulations on a dynamic simulator consisting of commercial vehicle simulation software Carsim 81 and MATLAB Simulink are provided to show the efficacy of the proposed control, and to illustrate the improved performance

Journal ArticleDOI
TL;DR: This paper considers a distributed optimization problem over a multiagent network, in which the objective function is a sum of individual cost functions at the agents, and proposes a algorithm that achieves the best known rate of convergence for this class of problems.
Abstract: This paper considers a distributed optimization problem over a multiagent network, in which the objective function is a sum of individual cost functions at the agents. We focus on the case when communication between the agents is described by a directed graph. Existing distributed optimization algorithms for directed graphs require at least the knowledge of the neighbors’ out-degree at each agent (due to the requirement of column-stochastic matrices). In contrast, our algorithm requires no such knowledge. Moreover, the proposed algorithm achieves the best known rate of convergence for this class of problems, $O(\mu ^k)$ for $0 , where $k$ is the number of iterations, given that the objective functions are strongly convex and have Lipschitz-continuous gradients. Numerical experiments are also provided to illustrate the theoretical findings.

Journal ArticleDOI
TL;DR: A general framework for tensor singular value decomposition (tensor singular value decompposition (SVD)), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data, is proposed.
Abstract: In this paper, we propose a general framework for tensor singular value decomposition (tensor singular value decomposition (SVD)), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a constraint energy minimization to construct multiscale spaces for GMsFEM, which is performed in the oversampling domain, which can handle non-decaying components of the local minimizers.

Proceedings Article
03 Dec 2018
TL;DR: This paper studies the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization and proposes a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate.
Abstract: The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/√MK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% - 5% communication data size.

Journal ArticleDOI
TL;DR: The foundational role of the proximal framework is discussed, with a focus on non-Euclidean proximal distances of Bregman type, which are central to the analysis of many other fundamental first order minimization relatives.
Abstract: We discuss the foundational role of the proximal framework in the development and analysis of some iconic first order optimization algorithms, with a focus on non-Euclidean proximal distances of Bregman type, which are central to the analysis of many other fundamental first order minimization relatives. We stress simplification and unification by highlighting self-contained elementary proof-patterns to obtain convergence rate and global convergence both in the convex and the nonconvex settings, which in turn also allows to present some novel results.

Journal ArticleDOI
TL;DR: Simulation results show the effectiveness and performance of the proposed continuous-time algorithms and show that the convergence rate of second-order algorithm is faster than that of first-order distributed algorithm.
Abstract: This paper proposes two second-order continuous-time algorithms to solve the economic power dispatch problem in smart grids. The collective aim is to minimize a sum of generation cost function subject to the power demand and individual generator constraints. First, in the framework of nonsmooth analysis and algebraic graph theory, one distributed second-order algorithm is developed and guaranteed to find an optimal solution. As a result, the power demand constraints can be kept all the time under appropriate initial condition. The second algorithm is under a centralized framework, and the optimal solution is robust in the sense that different initial power conditions do not change the convergence of the optimal solution. Finally, simulation results based on five-unit system, IEEE 30-bus system, and IEEE 300-bus system show the effectiveness and performance of the proposed continuous-time algorithms. The examples also show that the convergence rate of second-order algorithm is faster than that of first-order distributed algorithm.

Posted Content
TL;DR: An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.
Abstract: In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on the outer objective function We also present an accelerated variant of this method which improves the rate of convergence under convexity assumption Furthermore, we generalize our results under stochastic setting where only noisy information of both objective functions is available To the best of our knowledge, this is the first time that such (stochastic) approximation algorithms with established iteration complexity (sample complexity) are provided for bilevel programming

Journal ArticleDOI
TL;DR: This paper presents sufficient conditions for the quadratic convergence of the Newton's method in this type of grids with constant power terminals and computational results complement this theoretical analysis.
Abstract: The power flow is a nonlinear problem that requires a Newton's method to be solved in dc microgrids with constant power terminals. This paper presents sufficient conditions for the quadratic convergence of the Newton's method in this type of grids. The classic Newton's method as well as an approximated Newton's method are analyzed in both master–slave and island operation with droop controls. Requirements for the convergence as well as for the existence and uniqueness of the solution starting from voltages close to 1 pu are presented. Computational results complement this theoretical analysis.

Posted Content
TL;DR: A sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as RMSProp and AMSGrad as special cases, shows that Padam converges to a first-order stationary point at the rate of O\big.
Abstract: Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.

Journal ArticleDOI
01 May 2018
TL;DR: Experimental results indicate that in terms of robustness, stability and quality of the solution obtained, EFADE is significantly better than, or at least comparable to state-of-the-art approaches with outstanding performance.
Abstract: This paper presents enhanced fitness-adaptive differential evolution algorithm with novel mutation (EFADE) for solving global numerical optimization problems over continuous space. A new triangular mutation operator is introduced. It is based on the convex combination vector of the triplet defined by the three randomly chosen vectors and the difference vectors between the best, better and the worst individuals among the three randomly selected vectors. Triangular mutation operator helps the search for better balance between the global exploration ability and the local exploitation tendency as well as enhancing the convergence rate of the algorithm through the optimization process. Besides, two novel, effective adaptation schemes are used to update the control parameters to appropriate values without either extra parameters or prior knowledge of the characteristics of the optimization problem. In order to verify and analyze the performance of EFADE, numerical experiments on a set of 28 test problems from the CEC2013 benchmark for 10, 30 and 50 dimensions, including a comparison with 12 recent DE-based algorithms and six recent evolutionary algorithms, are executed. Experimental results indicate that in terms of robustness, stability and quality of the solution obtained, EFADE is significantly better than, or at least comparable to state-of-the-art approaches with outstanding performance.

Proceedings Article
02 Dec 2018
TL;DR: The error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions, and the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate are provided.
Abstract: In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

Journal ArticleDOI
TL;DR: It is shown that in terms of the order of the accuracy, the evaluation complexity of a line-search method which is based on random first-order models and directions is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good.
Abstract: We present global convergence rates for a line-search method which is based on random first-order models and directions whose quality is ensured only with certain probability. We show that in terms of the order of the accuracy, the evaluation complexity of such a method is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good. We particularize and improve these results in the convex and strongly convex case. We also analyse a probabilistic cubic regularization variant that allows approximate probabilistic second-order models and show improved complexity bounds compared to probabilistic first-order methods; again, as a function of the accuracy, the probabilistic cubic regularization bounds are of the same (optimal) order as for the deterministic case.

Journal ArticleDOI
01 Jan 2018
TL;DR: A novel gradient-based algorithm for unconstrained convex optimization, which can be seen as an extension of methods such as gradient descent, Nesterov’s accelerated gradient ascent, and the heavy-ball method is designed and analyzed.
Abstract: We design and analyze a novel gradient-based algorithm for unconstrained convex optimization. When the objective function is $m$ -strongly convex and its gradient is $L$ -Lipschitz continuous, the iterates and function values converge linearly to the optimum at rates $\rho $ and $\rho ^{2}$ , respectively, where $\rho = 1-\sqrt {m/L}$ . These are the fastest known guaranteed linear convergence rates for globally convergent first-order methods, and for high desired accuracies the corresponding iteration complexity is within a factor of two of the theoretical lower bound. We use a simple graphical design procedure based on integral quadratic constraints to derive closed-form expressions for the algorithm parameters. The new algorithm, which we call the triple momentum method, can be seen as an extension of methods such as gradient descent, Nesterov’s accelerated gradient descent, and the heavy-ball method.

Proceedings Article
03 Jul 2018
TL;DR: In this article, a simple variant of Nesterov's accelerated gradient descent (AGD) was shown to achieve faster convergence rate than GD in the nonconvex setting.
Abstract: Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

Journal ArticleDOI
TL;DR: A proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA, and it is shown that any cluster point of the sequence generated by the algorithm is a stationary points of the DC optimization problem for a fairly general choice of extrapolation parameters.
Abstract: We consider a class of difference-of-convex (DC) optimization problems whose objective is level-bounded and is the sum of a smooth convex function with Lipschitz gradient, a proper closed convex function and a continuous concave function. While this kind of problems can be solved by the classical difference-of-convex algorithm (DCA) (Pham et al. Acta Math Vietnam 22:289–355, 1997), the difficulty of the subproblems of this algorithm depends heavily on the choice of DC decomposition. Simpler subproblems can be obtained by using a specific DC decomposition described in Pham et al. (SIAM J Optim 8:476–505, 1998). This decomposition has been proposed in numerous work such as Gotoh et al. (DC formulations and algorithms for sparse optimization problems, 2017), and we refer to the resulting DCA as the proximal DCA. Although the subproblems are simpler, the proximal DCA is the same as the proximal gradient algorithm when the concave part of the objective is void, and hence is potentially slow in practice. In this paper, motivated by the extrapolation techniques for accelerating the proximal gradient algorithm in the convex settings, we consider a proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA. We show that any cluster point of the sequence generated by our algorithm is a stationary point of the DC optimization problem for a fairly general choice of extrapolation parameters: in particular, the parameters can be chosen as in FISTA with fixed restart (O’Donoghue and Candes in Found Comput Math 15, 715–732, 2015). In addition, by assuming the Kurdyka-Łojasiewicz property of the objective and the differentiability of the concave part, we establish global convergence of the sequence generated by our algorithm and analyze its convergence rate. Our numerical experiments on two difference-of-convex regularized least squares models show that our algorithm usually outperforms the proximal DCA and the general iterative shrinkage and thresholding algorithm proposed in Gong et al. (A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems, 2013).