scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Optimization and Control in 2016"


Posted Content
TL;DR: CVXPY allows the user to express convex optimization problems in a natural syntax that follows the math, rather than in the restrictive standard form required by solvers.
Abstract: CVXPY is a domain-specific language for convex optimization embedded in Python. It allows the user to express convex optimization problems in a natural syntax that follows the math, rather than in the restrictive standard form required by solvers. CVXPY makes it easy to combine convex optimization with high-level features of Python such as parallelism and object-oriented design. CVXPY is available at this http URL under the GPL license, along with documentation and examples.

1,215 citations


Posted Content
TL;DR: The paper argues that the set of distributions chosen should be chosen to be appropriate for the application at hand, and that some of the choices that have been popular until recently are, for many applications, not good choices.
Abstract: Distributionally robust stochastic optimization (DRSO) is an approach to optimization under uncertainty in which, instead of assuming that there is an underlying probability distribution that is known exactly, one hedges against a chosen set of distributions. In this paper, we consider sets of distributions that are within a chosen Wasserstein distance from a nominal distribution. We argue that such a choice of sets has two advantages: (1) The resulting distributions hedged against are more reasonable than those resulting from other popular choices of sets, such as {\Phi}-divergence ambiguity set. (2) The problem of determining the worst-case expectation has desirable tractability properties. We derive a dual reformulation of the corresponding DRSO problem and construct approximate worst-case distributions (or an exact worst-case distribution if it exists) explicitly via the first-order optimality conditions of the dual problem. Our contributions are five-fold. (i) We identify necessary and sufficient conditions for the existence of a worst-case distribution, which is naturally related to the growth rate of the objective function. (ii) We show that the worst-case distributions resulting from an appropriate Wasserstein distance have a concise structure and a clear interpretation. (iii) Using this structure, we show that data-driven DRSO problems can be approximated to any accuracy by robust optimization problems, and thereby many DRSO problems become tractable by using tools from robust optimization. (iv) To the best of our knowledge, our proof of strong duality is the first constructive proof for DRSO problems, and we show that the constructive proof technique is also useful in other contexts. (v) Our strong duality result holds in a very general setting, and we show that it can be applied to infinite dimensional process control problems and worst-case value-at-risk analysis.

505 citations


Posted Content
TL;DR: In this paper, stochastic variance reduced gradient (SVRG) methods for nonconvex finite-sum problems were studied and the authors proved nonasymptotic rates of convergence to stationary points.
Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.

391 citations


Posted Content
TL;DR: In this article, the authors considered the problem of distributed optimization over time-varying graphs and proposed a distributed algorithm, referred to as DIGing, based on a combination of a distributed inexact gradient method and a gradient tracking technique.
Abstract: This paper considers the problem of distributed optimization over time-varying graphs. For the case of undirected graphs, we introduce a distributed algorithm, referred to as DIGing, based on a combination of a distributed inexact gradient method and a gradient tracking technique. The DIGing algorithm uses doubly stochastic mixing matrices and employs fixed step-sizes and, yet, drives all the agents' iterates to a global and consensual minimizer. When the graphs are directed, in which case the implementation of doubly stochastic mixing matrices is unrealistic, we construct an algorithm that incorporates the push-sum protocol into the DIGing structure, thus obtaining Push-DIGing algorithm. The Push-DIGing uses column stochastic matrices and fixed step-sizes, but it still converges to a global and consensual minimizer. Under the strong convexity assumption, we prove that the algorithms converge at R-linear (geometric) rates as long as the step-sizes do not exceed some upper bounds. We establish explicit estimates for the convergence rates. When the graph is undirected it shows that DIGing scales polynomially in the number of agents. We also provide some numerical experiments to demonstrate the efficacy of the proposed algorithms and to validate our theoretical findings.

364 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a methodology that allows safety conditions ( expressed as control barrier functions) to be unified with performance objectives (represented as control Lyapunov functions) in the context of real-time optimization-based controllers.
Abstract: Safety critical systems involve the tight coupling between potentially conflicting control objectives and safety constraints. As a means of creating a formal framework for controlling systems of this form, and with a view toward automotive applications, this paper develops a methodology that allows safety conditions -- expressed as control barrier functions -- to be unified with performance objectives -- expressed as control Lyapunov functions -- in the context of real-time optimization-based controllers. Safety conditions are specified in terms of forward invariance of a set, and are verified via two novel generalizations of barrier functions; in each case, the existence of a barrier function satisfying Lyapunov-like conditions implies forward invariance of the set, and the relationship between these two classes of barrier functions is characterized. In addition, each of these formulations yields a notion of control barrier function (CBF), providing inequality constraints in the control input that, when satisfied, again imply forward invariance of the set. Through these constructions, CBFs can naturally be unified with control Lyapunov functions (CLFs) in the context of a quadratic program (QP); this allows for the achievement of control objectives (represented by CLFs) subject to conditions on the admissible states of the system (represented by CBFs). The mediation of safety and performance through a QP is demonstrated on adaptive cruise control and lane keeping, two automotive control problems that present both safety and performance considerations coupled with actuator bounds.

348 citations


Book ChapterDOI
TL;DR: This chapter tackles the discrepancy between theory and practice and uncover fundamental limits of a class of operator-splitting schemes, and shows that the relaxed Peaceman-Rachford splitting algorithm is nearly as fast as the proximal point algorithm in the ergodic sense and nearly as slow as the subgradient method in the nonergodic sense.
Abstract: Operator-splitting schemes are iterative algorithms for solving many types of numerical problems. A lot is known about these methods: they converge, and in many cases we know how quickly they converge. But when they are applied to optimization problems, there is a gap in our understanding: The theoretical speed of operator-splitting schemes is nearly always measured in the ergodic sense, but ergodic operator-splitting schemes are rarely used in practice. In this chapter, we tackle the discrepancy between theory and practice and uncover fundamental limits of a class of operator-splitting schemes. Our surprising conclusion is that the relaxed Peaceman-Rachford splitting algorithm, a version of the Alternating Direction Method of Multipliers (ADMM), is nearly as fast as the proximal point algorithm in the ergodic sense and nearly as slow as the subgradient method in the nonergodic sense. A large class of operator-splitting schemes extend from the relaxed Peaceman-Rachford splitting algorithm. Our results show that this class of operator-splitting schemes is also nearly as slow as the subgradient method. The tools we create in this chapter can also be used to prove nonergodic convergence rates of more general splitting schemes, so they are interesting in their own right.

264 citations


Posted Content
TL;DR: There is an equivalence between the technique of estimate sequences and a family of Lyapunov functions in both continuous and discrete time, which allows for a simple and unified analysis of many existing momentum algorithms.
Abstract: Momentum methods play a significant role in optimization. Examples include Nesterov's accelerated gradient method and the conditional gradient algorithm. Several momentum methods are provably optimal under standard oracle models, and all use a technique called estimate sequences to analyze their convergence properties. The technique of estimate sequences has long been considered difficult to understand, leading many researchers to generate alternative, "more intuitive" methods and analyses. We show there is an equivalence between the technique of estimate sequences and a family of Lyapunov functions in both continuous and discrete time. This connection allows us to develop a simple and unified analysis of many existing momentum algorithms, introduce several new algorithms, and strengthen the connection between algorithms and continuous-time dynamical systems.

223 citations


Posted Content
TL;DR: Nine new test cases in MATPOWER format are published, the largest of which is a pan-European ficticious data set that stems from the PEGASE project and provides a MATLAB code to transform the data into standard mathematical optimization format.
Abstract: In this paper, we publish nine new test cases in MATPOWER format. Four test cases are French very high-voltage grid generated by the offline plateform of iTesla: part of the data was sampled. Four test cases are RTE snapshots of the full French very high-voltage and high-voltage grid that come from French SCADAs via the Convergence software. The ninth and largest test case is a pan-European ficticious data set that stems from the PEGASE project. It complements the four PEGASE test cases that we previously published in MATPOWER version 5.1 in March 2015. We also provide a MATLAB code to transform the data into standard mathematical optimization format. Computational results confirming the validity of the data are presented in this paper.

197 citations


Posted Content
TL;DR: In this article, distributed synchronous and asynchronous algorithms for information exchange and equilibrium computation over a networked system were studied. And the almost-sure convergence of the obtained sequences to the equilibrium point was established.
Abstract: We consider a class of Nash games, termed as aggregative games, being played over a networked system. In an aggregative game, a player's objective is a function of the aggregate of all the players' decisions. Every player maintains an estimate of this aggregate, and the players exchange this information with their local neighbors over a connected network. We study distributed synchronous and asynchronous algorithms for information exchange and equilibrium computation over such a network. Under standard conditions, we establish the almost-sure convergence of the obtained sequences to the equilibrium point. We also consider extensions of our schemes to aggregative games where the players' objectives are coupled through a more general form of aggregate function. Finally, we present numerical results that demonstrate the performance of the proposed schemes.

184 citations


Posted Content
TL;DR: In this paper, a notion of relative smoothness and relative strong convexity was developed relative to a user-specified reference function, which should be computationally tractable for algorithms, and it was shown that many differentiable convex functions are relatively smooth with respect to a relatively simple reference function.
Abstract: The usual approach to developing and analyzing first-order methods for smooth convex optimization assumes that the gradient of the objective function is uniformly smooth with some Lipschitz constant $L$. However, in many settings the differentiable convex function $f(\cdot)$ is not uniformly smooth -- for example in $D$-optimal design where $f(x):=-\ln \det(HXH^T)$, or even the univariate setting with $f(x) := -\ln(x) + x^2$. Herein we develop a notion of "relative smoothness" and relative strong convexity that is determined relative to a user-specified "reference function" $h(\cdot)$ (that should be computationally tractable for algorithms), and we show that many differentiable convex functions are relatively smooth with respect to a correspondingly fairly-simple reference function $h(\cdot)$. We extend two standard algorithms -- the primal gradient scheme and the dual averaging scheme -- to our new setting, with associated computational guarantees. We apply our new approach to develop a new first-order method for the $D$-optimal design problem, with associated computational complexity analysis. Some of our results have a certain overlap with the recent work \cite{bbt}.

182 citations


Journal ArticleDOI
TL;DR: In this article, a distributed model predictive control (DMPC) algorithm for heterogeneous vehicle platoons with unidirectional topologies and a priori unknown desired set point is presented.
Abstract: This paper presents a distributed model predictive control (DMPC) algorithm for heterogeneous vehicle platoons with unidirectional topologies and a priori unknown desired set point. The vehicles (or nodes) in a platoon are dynamically decoupled but constrained by spatial geometry. Each node is assigned a local open-loop optimal control problem only relying on the information of neighboring nodes, in which the cost function is designed by penalizing on the errors between predicted and assumed trajectories. Together with this penalization, an equality based terminal constraint is proposed to ensure stability, which enforces the terminal states of each node in the predictive horizon equal to the average of its neighboring states. By using the sum of local cost functions as a Lyapunov candidate, it is proved that asymptotic stability of such a DMPC can be achieved through an explicit sufficient condition on the weights of the cost functions. Simulations with passenger cars demonstrate the effectiveness of proposed DMPC.

Posted Content
TL;DR: This work considers the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error, and presents the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression.
Abstract: We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting of initial conditions in O(1/n 2), and in terms of dependence on the noise and dimension d of the problem, as O(d/n). Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension-free quantities that may still be small while the " optimal " terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.

Posted Content
TL;DR: This work gives a simple proof that the Frank-Wolfe algorithm obtains a stationary point at a rate of $O(1/\sqrt{t})$ on non-convex objectives with a Lipschitz continuous gradient.
Abstract: We give a simple proof that the Frank-Wolfe algorithm obtains a stationary point at a rate of $O(1/\sqrt{t})$ on non-convex objectives with a Lipschitz continuous gradient. Our analysis is affine invariant and is the first, to the best of our knowledge, giving a similar rate to what was already proven for projected gradient methods (though on slightly different measures of stationarity).

Posted Content
TL;DR: In this paper, a distributed Nash equilibrium seeking strategy is proposed for non-cooperative games, where the players cannot directly observe the actions of the players who are not their neighbors, and instead, the players are supposed to be communicating with each other via an undirected and connected communication graph.
Abstract: In this paper, Nash equilibrium seeking among a network of players is considered. Different from many existing works on Nash equilibrium seeking in non-cooperative games, the players considered in this paper cannot directly observe the actions of the players who are not their neighbors. Instead, the players are supposed to be capable of communicating with each other via an undirected and connected communication graph. By a synthesis of a leader-following consensus protocol and the gradient play, a distributed Nash equilibrium seeking strategy is proposed for the non-cooperative games. Analytical analysis on the convergence of the players' actions to the Nash equilibrium is conducted via Lyapunov stability analysis. For games with non-quadratic payoffs, where multiple isolated Nash equilibria may coexist in the game, a local convergence result is derived under certain conditions. Then, a stronger condition is provided to derive a non-local convergence result for the non-quadratic games. For quadratic games, it is shown that the proposed seeking strategy enables the players' actions to converge to the Nash equilibrium globally under the given conditions. Numerical examples are provided to verify the effectiveness of the proposed seeking strategy.

Posted Content
TL;DR: This work explains the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition and generalizes to linear convergence analysis for proximal methods for minimizing compositions of nonsmooth functions with smooth mappings.
Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -- the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion.

Journal ArticleDOI
TL;DR: Under similar but more restrictive conditions, it is shown that a modified version of the power method converges to the global optimum, which is simpler and faster than convex approaches.
Abstract: We estimate $n$ phases (angles) from noisy pairwise relative phase measurements. The task is modeled as a nonconvex least-squares optimization problem. It was recently shown that this problem can be solved in polynomial time via convex relaxation, under some conditions on the noise. In this paper, under similar but more restrictive conditions, we show that a modified version of the power method converges to the global optimum. This is simpler and (empirically) faster than convex approaches. Empirically, they both succeed in the same regime. Further analysis shows that, in the same noise regime as previously studied, second-order necessary optimality conditions for this quadratically constrained quadratic program are also sufficient, despite nonconvexity.

Posted Content
TL;DR: This work considers the fundamental problem in non-convex optimization of efficiently reaching a stationary point, and proposes a first-order minibatch stochastic method that converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.
Abstract: We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

Posted Content
TL;DR: An accelerated variant of the DANE algorithm, called AIDE, is proposed that not only matches the communication lower bounds but can also be implemented using a purely first-order oracle.
Abstract: In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice. It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the communication lower bounds but can also be implemented using a purely first-order oracle. Our empirical results show that AIDE is superior to other communication efficient algorithms in settings that naturally arise in machine learning applications.

Posted Content
TL;DR: In this article, the authors provide tight upper and lower bounds on the complexity of minimizing the average of $m$ convex functions using gradient and prox oracles of the component functions, and show that there is a significant gap between complexity of deterministic vs randomized optimization.
Abstract: We provide tight upper and lower bounds on the complexity of minimizing the average of $m$ convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smoothing that improve over methods using just gradient accesses.

Posted Content
TL;DR: A new class of stochastic optimization algorithms to cope with large-scale problems routinely encountered in machine learning applications, based on entropic regularization of the primal OT problem, which results in a smooth dual optimization optimization which can be addressed with algorithms that have a provably faster convergence.
Abstract: Optimal transport (OT) defines a powerful framework to compare probability distributions in a geometrically faithful way. However, the practical impact of OT is still limited because of its computational burden. We propose a new class of stochastic optimization algorithms to cope with large-scale problems routinely encountered in machine learning applications. These methods are able to manipulate arbitrary distributions (either discrete or continuous) by simply requiring to be able to draw samples from them, which is the typical setup in high-dimensional learning problems. This alleviates the need to discretize these densities, while giving access to provably convergent methods that output the correct distance without discretization error. These algorithms rely on two main ideas: (a) the dual OT problem can be re-cast as the maximization of an expectation ; (b) entropic regularization of the primal OT problem results in a smooth dual optimization optimization which can be addressed with algorithms that have a provably faster convergence. We instantiate these ideas in three different setups: (i) when comparing a discrete distribution to another, we show that incremental stochastic optimization schemes can beat Sinkhorn's algorithm, the current state-of-the-art finite dimensional OT solver; (ii) when comparing a discrete distribution to a continuous density, a semi-discrete reformulation of the dual program is amenable to averaged stochastic gradient descent, leading to better performance than approximately solving the problem by discretization ; (iii) when dealing with two continuous densities, we propose a stochastic gradient descent over a reproducing kernel Hilbert space (RKHS). This is currently the only known method to solve this problem, apart from computing OT on finite samples. We backup these claims on a set of discrete, semi-discrete and continuous benchmark problems.

Proceedings ArticleDOI
TL;DR: Several safe-guarding techniques are incorporated into the algorithm, namely virtual control and trust regions, which add another layer of algorithmic robustness and convergence results will be independent from any numerical schemes used for discretization.
Abstract: This paper presents an algorithm to solve non-convex optimal control problems, where non-convexity can arise from nonlinear dynamics, and non-convex state and control constraints. This paper assumes that the state and control constraints are already convex or convexified, the proposed algorithm convexifies the nonlinear dynamics, via a linearization, in a successive manner. Thus at each succession, a convex optimal control subproblem is solved. Since the dynamics are linearized and other constraints are convex, after a discretization, the subproblem can be expressed as a finite dimensional convex programming subproblem. Since convex optimization problems can be solved very efficiently, especially with custom solvers, this subproblem can be solved in time-critical applications, such as real-time path planning for autonomous vehicles. Several safe-guarding techniques are incorporated into the algorithm, namely virtual control and trust regions, which add another layer of algorithmic robustness. A convergence analysis is presented in continuous- time setting. By doing so, our convergence results will be independent from any numerical schemes used for discretization. Numerical simulations are performed for an illustrative trajectory optimization example.

Posted Content
TL;DR: The method improves upon the complexity of gradient descent and provides the additional second-order guarantee that $ abla^2 f(x) \succeq -O(\epsilon^{1/2})I$ for the computed $x$.
Abstract: We present an accelerated gradient method for non-convex optimization problems with Lipschitz continuous first and second derivatives. The method requires time $O(\epsilon^{-7/4} \log(1/ \epsilon) )$ to find an $\epsilon$-stationary point, meaning a point $x$ such that $\| abla f(x)\| \le \epsilon$. The method improves upon the $O(\epsilon^{-2} )$ complexity of gradient descent and provides the additional second-order guarantee that $ abla^2 f(x) \succeq -O(\epsilon^{1/2})I$ for the computed $x$. Furthermore, our method is Hessian free, i.e. it only requires gradient computations, and is therefore suitable for large scale applications.

Posted Content
TL;DR: In this paper, a coarse-to-fine scaling algorithm for entropic transport-type problems has been proposed, which combines several modifications: a log-domain stabilized formulation, the well-known epsilon-scaling heuristic, an adaptive truncation of the kernel and a coarse to fine scheme.
Abstract: Scaling algorithms for entropic transport-type problems have become a very popular numerical method, encompassing Wasserstein barycenters, multi-marginal problems, gradient flows and unbalanced transport. However, a standard implementation of the scaling algorithm has several numerical limitations: the scaling factors diverge and convergence becomes impractically slow as the entropy regularization approaches zero. Moreover, handling the dense kernel matrix becomes unfeasible for large problems. To address this, we combine several modifications: A log-domain stabilized formulation, the well-known epsilon-scaling heuristic, an adaptive truncation of the kernel and a coarse-to-fine scheme. This permits the solution of larger problems with smaller regularization and negligible truncation error. A new convergence analysis of the Sinkhorn algorithm is developed, working towards a better understanding of epsilon-scaling. Numerical examples illustrate efficiency and versatility of the modified algorithm.

Posted Content
TL;DR: Numerical tests on large-scale logistic regression problems reveal that the proposed novel limited-memory stochastic block BFGS update is more robust and substantially outperforms current state-of-the-art methods.
Abstract: We propose a novel limited-memory stochastic block BFGS update for incorporating enriched curvature information in stochastic approximation methods. In our method, the estimate of the inverse Hessian matrix that is maintained by it, is updated at each iteration using a sketch of the Hessian, i.e., a randomly generated compressed form of the Hessian. We propose several sketching strategies, present a new quasi-Newton method that uses stochastic block BFGS updates combined with the variance reduction approach SVRG to compute batch stochastic gradients, and prove linear convergence of the resulting method. Numerical tests on large-scale logistic regression problems reveal that our method is more robust and substantially outperforms current state-of-the-art methods.

Posted Content
TL;DR: The Barzilai-Borwein (BB) method is proposed to be used to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB, which is superior to some advanced SGD variants.
Abstract: One of the major issues in stochastic gradient descent (SGD) methods is how to choose an appropriate step size while running the algorithm. Since the traditional line search technique does not apply for stochastic optimization algorithms, the common practice in SGD is either to use a diminishing step size, or to tune a fixed step size by hand, which can be time consuming in practice. In this paper, we propose to use the Barzilai-Borwein (BB) method to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB. We prove that SVRG-BB converges linearly for strongly convex objective functions. As a by-product, we prove the linear convergence result of SVRG with Option I proposed in [10], whose convergence result is missing in the literature. Numerical experiments on standard data sets show that the performance of SGD-BB and SVRG-BB is comparable to and sometimes even better than SGD and SVRG with best-tuned step sizes, and is superior to some advanced SGD variants.

Posted Content
TL;DR: A variational, continuous-time framework for understanding accelerated methods is proposed and a systematic methodology for converting accelerated higher-order methods from continuous time to discrete time is provided, which illuminates a class of dynamics that may be useful for designing better algorithms for optimization.
Abstract: Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. While many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the \emph{Bregman Lagrangian} which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods correspond to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

Posted Content
TL;DR: The empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of converge in testing error) among the three Stochastic methods.
Abstract: Recently, {\it stochastic momentum} methods have been widely adopted in training deep neural networks. However, their convergence analysis is still underexplored at the moment, in particular for non-convex optimization. This paper fills the gap between practice and theory by developing a basic convergence analysis of two stochastic momentum methods, namely stochastic heavy-ball method and the stochastic variant of Nesterov's accelerated gradient method. We hope that the basic convergence results developed in this paper can serve the reference to the convergence of stochastic momentum methods and also serve the baselines for comparison in future development of stochastic momentum methods. The novelty of convergence analysis presented in this paper is a unified framework, revealing more insights about the similarities and differences between different stochastic momentum methods and stochastic gradient method. The unified framework exhibits a continuous change from the gradient method to Nesterov's accelerated gradient method and finally the heavy-ball method incurred by a free parameter, which can help explain a similar change observed in the testing error convergence behavior for deep learning. Furthermore, our empirical results for optimizing deep neural networks demonstrate that the stochastic variant of Nesterov's accelerated gradient method achieves a good tradeoff (between speed of convergence in training error and robustness of convergence in testing error) among the three stochastic methods.

Posted Content
TL;DR: The Riemannian SVRG (RSVRG) as discussed by the authors is a new variance reduced Riemmannian optimization method for finite sums of geodesically smooth functions.
Abstract: We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new variance reduced Riemannian optimization method. We analyze RSVRG for both geodesically convex and nonconvex (smooth) functions. Our analysis reveals that RSVRG inherits advantages of the usual SVRG method, but with factors depending on curvature of the manifold that influence its convergence. To our knowledge, RSVRG is the first provably fast stochastic Riemannian method. Moreover, our paper presents the first non-asymptotic complexity analysis (novel even for the batch setting) for nonconvex Riemannian optimization. Our results have several implications; for instance, they offer a Riemannian perspective on variance reduced PCA, which promises a short, transparent convergence analysis.

Posted Content
TL;DR: A multistage adaptive robust optimization model for the unit commitment (UC) problem is presented, which models the sequential nature of the dispatch process and utilizes a new type of dynamic uncertainty sets to capture the temporal and spatial correlations of wind and solar power.
Abstract: The deep penetration of wind and solar power is a critical component of the future power grid. However, the intermittency and stochasticity of these renewable resources bring significant challenges to the reliable and economic operation of power systems. Motivated by these challenges, we present a multistage adaptive robust optimization model for the unit commitment (UC) problem, which models the sequential nature of the dispatch process and utilizes a new type of dynamic uncertainty sets to capture the temporal and spatial correlations of wind and solar power. The model also considers the operation of energy storage devices. We propose a simplified and effective affine policy for dispatch decisions, and develop an efficient algorithmic framework using a combination of constraint generation and duality based reformulation with various improvements. Extensive computational experiments show that the proposed method can efficiently solve multistage robust UC problems on the Polish 2736-bus system under high dimensional uncertainty of 60 wind farms and 30 solar farms. The computational results also suggest that the proposed model leads to significant benefits in both costs and reliability over robust models with traditional uncertainty sets as well as deterministic models with reserve rules.

Posted Content
TL;DR: In this paper, the authors proposed randomized Newton-type algorithms that exploit the non-uniform sub-sampling of a convex function, as well as inexact updates, as means to reduce the computational complexity.
Abstract: We consider the problem of finding the minimizer of a convex function $F: \mathbb R^d \rightarrow \mathbb R$ of the form $F(w) := \sum_{i=1}^n f_i(w) + R(w)$ where a low-rank factorization of $ abla^2 f_i(w)$ is readily available. We consider the regime where $n \gg d$. As second-order methods prove to be effective in finding the minimizer to a high-precision, in this work, we propose randomized Newton-type algorithms that exploit \textit{non-uniform} sub-sampling of $\{ abla^2 f_i(w)\}_{i=1}^{n}$, as well as inexact updates, as means to reduce the computational complexity. Two non-uniform sampling distributions based on {\it block norm squares} and {\it block partial leverage scores} are considered in order to capture important terms among $\{ abla^2 f_i(w)\}_{i=1}^{n}$. We show that at each iteration non-uniformly sampling at most $\mathcal O(d \log d)$ terms from $\{ abla^2 f_i(w)\}_{i=1}^{n}$ is sufficient to achieve a linear-quadratic convergence rate in $w$ when a suitable initial point is provided. In addition, we show that our algorithms achieve a lower computational complexity and exhibit more robustness and better dependence on problem specific quantities, such as the condition number, compared to similar existing methods, especially the ones based on uniform sampling. Finally, we empirically demonstrate that our methods are at least twice as fast as Newton's methods with ridge logistic regression on several real datasets.