scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Optimization and Control in 2018"


Posted Content
TL;DR: In this paper, the authors prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and minibatch size.
Abstract: Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis. We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T^{1/2}---where T denotes the number of total steps---compared to mini-batch SGD. This also holds for asynchronous implementations. Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.

516 citations


Posted Content
TL;DR: It is shown that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers and involves Wasserstein gradient flows, a by-product of optimal transport theory.
Abstract: Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.

403 citations


Posted Content
TL;DR: In this article, the authors show that the lazy training phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Abstract: In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.

219 citations


Posted Content
TL;DR: This article surveys reinforcement learning from the perspective of optimization and control, with a focus on continuous control applications.
Abstract: This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.

193 citations


Posted Content
TL;DR: This paper proposes a proximally guided stochastic subgradient method and a proxIMally guided Stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively and establishes the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem.
Abstract: Min-max saddle-point problems have broad applications in many tasks in machine learning, e.g., distributionally robust learning, learning with non-decomposable loss, or learning with uncertain data. Although convex-concave saddle-point problems have been broadly studied with efficient algorithms and solid theories available, it remains a challenge to design provably efficient algorithms for non-convex saddle-point problems, especially when the objective function involves an expectation or a large-scale finite sum. Motivated by recent literature on non-convex non-smooth minimization, this paper studies a family of non-convex min-max problems where the minimization component is non-convex (weakly convex) and the maximization component is concave. We propose a proximally guided stochastic subgradient method and a proximally guided stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively. We establish the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem.

156 citations


Posted Content
TL;DR: In particular, this article showed that the stochastic subgradient method on any locally Lipschitz function produces limit points that are all first-order stationary in the absence of smoothness and convexity.
Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.

146 citations


Journal ArticleDOI
TL;DR: A novel co-optimization model is formulated to route RCs and MPSs in the transportation network, schedule them in the DS, and reconfigure the DS for microgrid formation coordinately and model topology constraints based on the concept of spanning forest.
Abstract: Repair crews (RCs) and mobile power sources (MPSs) are critical resources for distribution system (DS) outage management after a natural disaster. However, their logistics is not well investigated. We propose a resilient scheme for disaster recovery logistics to co-optimize DS restoration with dispatch of RCs and MPSs. A novel co-optimization model is formulated to route RCs and MPSs in the transportation network, schedule them in the DS, and reconfigure the DS for microgrid formation coordinately, etc. The model incorporates different timescales of DS restoration and RC/MPS dispatch, the coupling of transportation and power networks, etc. To ensure radiality of the DS with variable physical structure and MPS allocation, we also model topology constraints based on the concept of spanning forest. The model is convexified equivalently and linearized into a mixed-integer linear programming. To reduce its computation time, preprocessing methods are proposed to pre-assign a minimal set of repair tasks to depots and reduce the number of candidate nodes for MPS connection. Resilient recovery strategies thus are generated to enhance service restoration, especially by dynamic formation of microgrids that are powered by MPSs and topologized by repair actions of RCs and network reconfiguration of the DS. Case studies demonstrate the proposed methodology.

140 citations


Journal ArticleDOI
Ran Xin1, Usman A. Khan1
TL;DR: This letter proposes a linear algorithm based on an inexact gradient method and a gradient estimation technique and shows that the proposed algorithm geometrically converges to the global minimizer with a sufficiently small step-size.
Abstract: In this letter, we study distributed optimization, where a network of agents, abstracted as a directed graph, collaborates to minimize the average of locally-known convex functions. Most of the existing approaches over directed graphs are based on push-sum (type) techniques, which use an independent algorithm to asymptotically learn either the left or right eigenvector of the underlying weight matrices. This strategy causes additional computation, communication, and nonlinearity in the algorithm. In contrast, we propose a linear algorithm based on an inexact gradient method and a gradient estimation technique. Under the assumptions that each local function is strongly-convex with Lipschitz-continuous gradients, we show that the proposed algorithm geometrically converges to the global minimizer with a sufficiently small step-size. We present simulations to illustrate the theoretical findings.

134 citations


Posted Content
TL;DR: This paper proposes a new technique named SPIDER, which can be used to track many deterministic quantities of interest with significantly reduced computational cost and proves that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting.
Abstract: In this paper, we propose a new technique named \textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. We apply SPIDER to two tasks, namely the stochastic first-order and zeroth-order methods. For stochastic first-order method, combining SPIDER with normalized gradient descent, we propose two new algorithms, namely SPIDER-SFO and SPIDER-SFO\textsuperscript{+}, that solve non-convex stochastic optimization problems using stochastic gradients only. We provide sharp error-bound results on their convergence rates. In special, we prove that the SPIDER-SFO and SPIDER-SFO\textsuperscript{+} algorithms achieve a record-breaking gradient computation cost of $\mathcal{O}\left( \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3} ) \right)$ for finding an $\epsilon$-approximate first-order and $\tilde{\mathcal{O}}\left( \min( n^{1/2} \epsilon^{-2}+\epsilon^{-2.5}, \epsilon^{-3} ) \right)$ for finding an $(\epsilon, \mathcal{O}(\epsilon^{0.5}))$-approximate second-order stationary point, respectively. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding approximate first-order stationary points under the gradient Lipschitz assumption in the finite-sum setting. For stochastic zeroth-order method, we prove a cost of $\mathcal{O}( d \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3}) )$ which outperforms all existing results.

134 citations


Posted Content
TL;DR: An approximation algorithm is presented for solving a class of bilevel programming problem where the inner objective function is strongly convex and its finite-time convergence analysis under different convexity assumption on the outer objective function.
Abstract: In this paper, we study a class of bilevel programming problem where the inner objective function is strongly convex More specifically, under some mile assumptions on the partial derivatives of both inner and outer objective functions, we present an approximation algorithm for solving this class of problem and provide its finite-time convergence analysis under different convexity assumption on the outer objective function We also present an accelerated variant of this method which improves the rate of convergence under convexity assumption Furthermore, we generalize our results under stochastic setting where only noisy information of both objective functions is available To the best of our knowledge, this is the first time that such (stochastic) approximation algorithms with established iteration complexity (sample complexity) are provided for bilevel programming

131 citations


Posted Content
Hao Yu1, Sen Yang1, Shenghuo Zhu1
TL;DR: In this paper, the authors provide a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead, and they show that the average interval can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled.
Abstract: In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Journal ArticleDOI
TL;DR: In this article, the authors introduced the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem and proved optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type.
Abstract: Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton-Jacobi-Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin's maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning.

Journal ArticleDOI
TL;DR: The proposed method is based on relaxations of algorithms using McCormick relaxations in a reduced space employing the convex and concave envelopes of the nonlinear activation function for deterministic global optimization of optimization problems with artificial neural networks embedded.
Abstract: Artificial neural networks (ANNs) are used in various applications for data-driven black-box modeling and subsequent optimization. Herein, we present an efficient method for deterministic global optimization of ANN embedded optimization problems. The proposed method is based on relaxations of algorithms using McCormick relaxations in a reduced-space [\textit{SIOPT}, 20 (2009), pp. 573-601] including the convex and concave envelopes of the nonlinear activation function of ANNs. The optimization problem is solved using our in-house global deterministic solver MAiNGO. The performance of the proposed method is shown in four optimization examples: an illustrative function, a fermentation process, a compressor plant and a chemical process optimization. The results show that computational solution time is favorable compared to the global general-purpose optimization solver BARON.

Journal ArticleDOI
TL;DR: In this article, the problem of finding a zero of a sum of monotone operators through primal-dual analysis is recast as a problem of computing the Lagrangian multipliers.
Abstract: We consider distributed computation of generalized Nash equilibrium (GNE) over networks, in games with shared coupling constraints. Existing methods require that each player has full access to opponents' decisions. In this paper, we assume that players have only partial-decision information, and can communicate with their neighbours over an arbitrary undirected graph. We recast the problem as that of finding a zero of a sum of monotone operators through primal-dual analysis. To distribute the problem, we doubly augment variables, so that each player has local decision estimates and local copies of Lagrangian multipliers. We introduce a single-layer algorithm, fully distributed with respect to both primal and dual variables. We show its convergence to a variational GNE with fixed step-sizes, by reformulating it as a forward-backward iteration for a pair of doubly-augmented monotone operators.

Posted Content
TL;DR: It is shown that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution and becomes a contracting map converging to the exact solution.
Abstract: Motivated by applications in Game Theory, Optimization, and Generative Adversarial Networks, recent work of Daskalakis et al \cite{DISZ17} and follow-up work of Liang and Stokes \cite{LiangS18} have established that a variant of the widely used Gradient Descent/Ascent procedure, called "Optimistic Gradient Descent/Ascent (OGDA)", exhibits last-iterate convergence to saddle points in {\em unconstrained} convex-concave min-max optimization problems. We show that the same holds true in the more general problem of {\em constrained} min-max optimization under a variant of the no-regret Multiplicative-Weights-Update method called "Optimistic Multiplicative-Weights Update (OMWU)". This answers an open question of Syrgkanis et al \cite{SALS15}. The proof of our result requires fundamentally different techniques from those that exist in no-regret learning literature and the aforementioned papers. We show that OMWU monotonically improves the Kullback-Leibler divergence of the current iterate to the (appropriately normalized) min-max solution until it enters a neighborhood of the solution. Inside that neighborhood we show that OMWU is locally (asymptotically) stable converging to the exact solution. We believe that our techniques will be useful in the analysis of the last iterate of other learning algorithms.

Posted Content
Weijun Xie1
TL;DR: It is shown that a DRCCP can be reformulated as a conditional value-at-risk constrained optimization problem, and thus admits tight inner and outer approximations and a big-M free formulation.
Abstract: This paper studies a distributionally robust chance constrained program (DRCCP) with Wasserstein ambiguity set, where the uncertain constraints should be satisfied with a probability at least a given threshold for all the probability distributions of the uncertain parameters within a chosen Wasserstein distance from an empirical distribution. In this work, we investigate equivalent reformulations and approximations of such problems. We first show that a DRCCP can be reformulated as a conditional value-at-risk constrained optimization problem, and thus admits tight inner and outer approximations. We also show that a DRCCP of bounded feasible region is mixed integer representable by introducing big-M coefficients and additional binary variables. For a DRCCP with pure binary decision variables, by exploring the submodular structure, we show that it admits a big-M free formulation, which can be solved by a branch and cut algorithm. Finally, we present a numerical study to illustrate the effectiveness of the proposed formulations.

Posted Content
TL;DR: It is proved that the projected stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$.
Abstract: We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$ As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function

Journal ArticleDOI
TL;DR: In this paper, the authors present a new notion of input-to-state safe control barrier functions (ISSf-CBFs), which ensure safety of nonlinear dynamical systems under input disturbances.
Abstract: This letter presents a new notion of input-to-state safe control barrier functions (ISSf-CBFs), which ensure safety of nonlinear dynamical systems under input disturbances. Similar to how safety conditions are specified in terms of forward invariance of a set, input-to-state safety (ISSf) conditions are specified in terms of forward invariance of a slightly larger set. In this context, invariance of the larger set implies that the states stay either inside or very close to the smaller safe set; and this closeness is bounded by the magnitude of the disturbances. The main contribution of the letter is the methodology used for obtaining a valid ISSf-CBF, given a control barrier function (CBF). The associated universal control law will also be provided. Towards the end, we will study unified quadratic programs (QPs) that combine control Lyapunov functions (CLFs) and ISSf-CBFs in order to obtain a single control law that ensures both safety and stability in systems with input disturbances.

Posted ContentDOI
TL;DR: This work provides an exact deterministic reformulation for data-driven chance constrained programs over Wasserstein balls and shows that two popular approximation schemes based on the conditional-value-at-risk and the Bonferroni inequality can perform poorly in practice and that these two schemes are generally incomparable with each other.
Abstract: We provide an exact deterministic reformulation for data-driven chance constrained programs over Wasserstein balls. For individual chance constraints as well as joint chance constraints with right-hand side uncertainty, our reformulation amounts to a mixed-integer conic program. In the special case of a Wasserstein ball with the $1$-norm or the $\infty$-norm, the cone is the nonnegative orthant, and the chance constrained program can be reformulated as a mixed-integer linear program. Using our reformulation, we show that two popular approximation schemes based on the conditional-value-at-risk and the Bonferroni inequality can perform poorly in practice and that these two schemes are generally incomparable with each other.

Posted Content
TL;DR: For the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps, the authors showed that under reasonable conditions on approximation quality and regularity of the models, any such algorithm can drive a natural stationarity measure to zero at the rate O(k − 1/4 ).
Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set.

Journal ArticleDOI
TL;DR: In this article, the convergence time analysis of a class of fixed-time stable systems with the aim to provide a new non-conservative upper bound for its settling time is discussed. But the convergence times are not directly considered.
Abstract: This paper deals with the convergence time analysis of a class of fixed-time stable systems with the aim to provide a new non-conservative upper bound for its settling time. Our contribution is fourfold. First, we revisit the well-known class of fixed-time stable systems, given in (Polyakov et al.,2012}, while showing the conservatism of the classical upper estimate of the settling time. Second, we provide the smallest constant that uniformly upper bounds the settling time of any trajectory of the system under consideration. Third, introducing a slight modification of the previous class of fixed-time systems, we propose a new predefined-time convergent algorithm where the least upper bound of the settling time is set a priori as a parameter of the system. At last, predefined-time controllers for first order and second order systems are introduced. Some simulation results highlight the performance of the proposed scheme in terms of settling time estimation compared to existing methods.

Posted Content
TL;DR: In this paper, an alternative limiting process that yields high-resolution ODEs was proposed, which can be used to distinguish between Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method.
Abstract: Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as "gradient correction" that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.

Posted Content
TL;DR: It is proved that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator.
Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.

Posted Content
TL;DR: In this paper, the authors studied the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex, and the global objective is to find a common solution that minimizes the average of all cost functions.
Abstract: In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size $n$, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.

Posted Content
TL;DR: In this paper, the authors provide some theoretical results on the behavior and convergence of the algorithm proposed in [6] and prove that the algorithm approximates local minimizers of an unconstrained $\ell^0$-penalized least-squares problem, and provide sufficient conditions for general convergence, rate of convergence, and conditions for one-step recovery.
Abstract: One way to understand time-series data is to identify the underlying dynamical system which generates it. This task can be done by selecting an appropriate model and a set of parameters which best fits the dynamics while providing the simplest representation (i.e. the smallest amount of terms). One such approach is the sparse identification of nonlinear dynamics framework [6] which uses a sparsity-promoting algorithm that iterates between a partial least-squares fit and a thresholding (sparsity-promoting) step. In this work, we provide some theoretical results on the behavior and convergence of the algorithm proposed in [6]. In particular, we prove that the algorithm approximates local minimizers of an unconstrained $\ell^0$-penalized least-squares problem. From this, we provide sufficient conditions for general convergence, rate of convergence, and conditions for one-step recovery. Examples illustrate that the rates of convergence are sharp. In addition, our results extend to other algorithms related to the algorithm in [6], and provide theoretical verification to several observed phenomena.

Posted Content
TL;DR: This work characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradients Descent Ascent (OGDA), and shows that both dynamics avoid unstable critical points for almost all initializations.
Abstract: Motivated by applications in Optimization, Game Theory, and the training of Generative Adversarial Networks, the convergence properties of first order methods in min-max problems have received extensive study. It has been recognized that they may cycle, and there is no good understanding of their limit points when they do not. When they converge, do they converge to local min-max solutions? We characterize the limit points of two basic first order methods, namely Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA). We show that both dynamics avoid unstable critical points for almost all initializations. Moreover, for small step sizes and under mild assumptions, the set of \{OGDA\}-stable critical points is a superset of \{GDA\}-stable critical points, which is a superset of local min-max solutions (strict in some cases). The connecting thread is that the behavior of these dynamics can be studied from a dynamical systems perspective.

Posted Content
TL;DR: Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, a notion of intrinsic dimension is identified and it is shown that it provides necessary and sufficient conditions for the absence of spurious valleys.
Abstract: Neural networks provide a rich class of high-dimensional, non-convex optimization problems. Despite their non-convexity, gradient-descent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sub-level sets that do not include a global minimum. Focusing on a class of two-layer neural networks defined by smooth (but generally non-linear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.

Posted Content
TL;DR: A general oracle-based framework is suggested that captures parallel stochastic optimization in different parallelization settings described by a dependency graph, and generic lower bounds are derived in terms of this graph.
Abstract: We suggest a general oracle-based framework that captures different parallel stochastic optimization settings described by a dependency graph, and derive generic lower bounds in terms of this graph. We then use the framework and derive lower bounds for several specific parallel optimization settings, including delayed updates and parallel processing with intermittent communication. We highlight gaps between lower and upper bounds on the oracle complexity, and cases where the "natural" algorithms are not known to be optimal.

Posted Content
TL;DR: It is argued that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.
Abstract: Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exponential convergence of SGD with constant step size for convex loss functions. In this note, we extend those results to a much broader non-convex function class satisfying the Polyak-Lojasiewicz (PL) condition. A number of important non-convex problems in machine learning, including some classes of neural networks, have been recently shown to satisfy the PL condition. We argue that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime.

Posted Content
TL;DR: It is shown that the preferences have a large impact on the structure of the trades, but that one equilibrium (variational) is optimal, and the learning mechanism needed to reach an equilibrium state in the peer-to-peer market design is discussed together with privacy issues.
Abstract: We consider a network of prosumers involved in peer-to-peer energy exchanges, with differentiation price preferences on the trades with their neighbors, and we analyze two market designs: (i) a centralized market, used as a benchmark, where a global market operator optimizes the flows (trades) between the nodes, local demand and flexibility activation to maximize the system overall social welfare; (ii) a distributed peer-to-peer market design where prosumers in local energy communities optimize selfishly their trades, demand, and flexibility activation. We first characterizethe solution of the peer-to-peer market as a Variational Equilibrium and prove that the set of Variational Equilibria coincides with the set of social welfare optimal solutions of market design (i). We give several results that help understanding the structure of the trades at an equilibriumor at the optimum. We characterize the impact of preferences on the network line congestion and renewable energy waste under both designs. We provide a reduced example for which we give the set of all possible generalized equilibria, which enables to give an approximation of the price ofanarchy. We provide a more realistic example which relies on the IEEE 14-bus network, for which we can simulate the trades under different preference prices. Our analysis shows in particular that the preferences have a large impact on the structure of the trades, but that one equilibrium(variational) is optimal.