scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Optimization and Control in 2020"


Posted Content
TL;DR: It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.
Abstract: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

197 citations


Posted Content
TL;DR: The current state-of-the-art first-order algorithm for strongly-convex-strongly-concave minimax problems is the algorithm of as discussed by the authors.
Abstract: This paper resolves a longstanding open question pertaining to the design of near-optimal first-order algorithms for smooth and strongly-convex-strongly-concave minimax problems. Current state-of-the-art first-order algorithms find an approximate Nash equilibrium using $\tilde{O}(\kappa_{\mathbf x}+\kappa_{\mathbf y})$ or $\tilde{O}(\min\{\kappa_{\mathbf x}\sqrt{\kappa_{\mathbf y}}, \sqrt{\kappa_{\mathbf x}}\kappa_{\mathbf y}\})$ gradient evaluations, where $\kappa_{\mathbf x}$ and $\kappa_{\mathbf y}$ are the condition numbers for the strong-convexity and strong-concavity assumptions. A gap still remains between these results and the best existing lower bound $\tilde{\Omega}(\sqrt{\kappa_{\mathbf x}\kappa_{\mathbf y}})$. This paper presents the first algorithm with $\tilde{O}(\sqrt{\kappa_{\mathbf x}\kappa_{\mathbf y}})$ gradient complexity, matching the lower bound up to logarithmic factors. Our algorithm is designed based on an accelerated proximal point method and an accelerated solver for minimax proximal steps. It can be easily extended to the settings of strongly-convex-concave, convex-concave, nonconvex-strongly-concave, and nonconvex-concave functions. This paper also presents algorithms that match or outperform all existing methods in these settings in terms of gradient complexity, up to logarithmic factors.

143 citations


Posted Content
TL;DR: These are the first convergence rate results for using nonlinear TTSA algorithms on the concerned class of bilevel optimization problems and it is shown that a two-timescale actor-critic proximal policy optimization algorithm can be viewed as a special case of the framework.
Abstract: This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.

142 citations


Posted Content
TL;DR: This work proposes a novel Lagrange multiplier update method that utilizes derivatives of the constraint function, and introduces a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost.
Abstract: Lagrangian methods are widely used algorithms for constrained optimization problems, but their learning dynamics exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior during agent training. We address this shortcoming by proposing a novel Lagrange multiplier update method that utilizes derivatives of the constraint function. We take a controls perspective, wherein the traditional Lagrange multiplier update behaves as \emph{integral} control; our terms introduce \emph{proportional} and \emph{derivative} control, achieving favorable learning dynamics through damping and predictive measures. We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark. Lastly, we introduce a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost. Our extensive experiments demonstrate improved performance and hyperparameter robustness, while our algorithms remain nearly as simple to derive and implement as the traditional Lagrangian approach.

110 citations


Posted Content
TL;DR: A new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient is proposed and it is shown that this assumption is both more general and more reasonable than assumptions made in all prior work.
Abstract: Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic gradient. We show that our assumption is both more general and more reasonable than assumptions made in all prior work. Moreover, our results yield the optimal $\mathcal{O}(\varepsilon^{-4})$ rate for finding a stationary point of nonconvex smooth functions, and recover the optimal $\mathcal{O}(\varepsilon^{-1})$ rate for finding a global solution if the Polyak-Łojasiewicz condition is satisfied. We compare against convergence rates under convexity and prove a theorem on the convergence of SGD under Quadratic Functional Growth and convexity, which might be of independent interest. Moreover, we perform our analysis in a framework which allows for a detailed study of the effects of a wide array of sampling strategies and minibatch sizes for finite-sum optimization problems. We corroborate our theoretical results with experiments on real and synthetic data.

89 citations


Posted Content
TL;DR: A new method to obtain feedback controllers of an unknown dynamical system directly from noisy input/state data and derive nonconservative design methods for quadratic stabilization, and data-based linear matrix inequalities, enables control design from large datasets.
Abstract: We propose a new method to obtain feedback controllers of an unknown dynamical system directly from noisy input/state data. The key ingredient of our design is a new matrix S-lemma that will be proven in this paper. We provide both strict and non-strict versions of this S-lemma, that are of interest in their own right. Thereafter, we will apply these results to data-driven control. In particular, we will derive non-conservative design methods for quadratic stabilization, H_2 and H_inf control, all in terms of data-based linear matrix inequalities. In contrast to previous work, the dimensions of our decision variables are independent of the time horizon of the experiment. Our approach thus enables control design from large data sets.

83 citations


Posted Content
TL;DR: This work proposes and analyzes algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets and proves that they require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications.
Abstract: We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $\chi^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $\chi^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.

82 citations


Posted Content
TL;DR: It is proved that the proposed stochastic Polyak step-size (SPS) enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead.
Abstract: We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.

81 citations


Posted Content
TL;DR: Numerical experiments show that HPIPM reliably solves challenging QPs, and that it outperforms other state-of-the-art solvers in speed.
Abstract: This paper introduces HPIPM, a high-performance framework for quadratic programming (QP), designed to provide building blocks to efficiently and reliably solve model predictive control problems. HPIPM currently supports three QP types, and provides interior point method (IPM) solvers as well (partial) condensing routines. In particular, the IPM for optimal control QPs is intended to supersede the HPMPC solver, and it largely improves robustness while keeping the focus on speed. Numerical experiments show that HPIPM reliably solves challenging QPs, and that it outperforms other state-of-the-art solvers in speed.

79 citations


Posted Content
TL;DR: The proposed algorithm outperforms or matches the performance of several recently proposed schemes while, arguably, being more transparent, easier to implement, and converging with respect to a stronger criterion.
Abstract: We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y \in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly", and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-accurate first-order Nash equilibrium with respect to a stationarity criterion that is stronger than the commonly used proximal gradient norm. The proposed approach is fairly simple: we perform approximate proximal-point iterations on the primal function, with inexact oracle provided by Nesterov's algorithm run on the regularized function $F(x_t,\cdot)$ with $O(\varepsilon_y)$ regularization term, $x_t$ being the current primal iterate. The resulting iteration complexity is $O({\varepsilon_x}^{-2} \, {\varepsilon_y}^{-1/2})$ up to a logarithmic factor. As a byproduct, in the regime $\varepsilon_y = O(\varepsilon_x^2)$ our algorithm gives $O({\varepsilon_x}^{-3})$ complexity for finding $\varepsilon_x$-stationary point of the natural Moreau envelope of the primal function. Moreover, when $F(x,\cdot)$ is strongly concave, the complexity bound improves to $O({\varepsilon_x}^{-2}{\kappa_y}^{1/2})$ up to a logarithmic factor, where $\kappa_y$ is the appropriate condition number. In both scenarios, our algorithm outperforms or matches the performance (in terms of convergence rate) of several recently proposed schemes while, arguably, being more transparent, easier to implement, and converging with respect to a stronger criterion. Finally, we extend the approach to non-Euclidean proximal geometries.

77 citations


Posted Content
TL;DR: A new class of structured nonconvex-nonconcave min-max optimization problems are introduced, proposing a generalization of the extragradient algorithm which provably converges to a stationary point and its iteration complexity and sample complexity bounds either match or improve the best known bounds.
Abstract: The use of min-max optimization in adversarial training of deep neural network classifiers and training of generative adversarial networks has motivated the study of nonconvex-nonconcave optimization objectives, which frequently arise in these applications. Unfortunately, recent results have established that even approximate first-order stationary points of such objectives are intractable, even under smoothness conditions, motivating the study of min-max objectives with additional structure. We introduce a new class of structured nonconvex-nonconcave min-max optimization problems, proposing a generalization of the extragradient algorithm which provably converges to a stationary point. The algorithm applies not only to Euclidean spaces, but also to general $\ell_p$-normed finite-dimensional real vector spaces. We also discuss its stability under stochastic oracles and provide bounds on its sample complexity. Our iteration complexity and sample complexity bounds either match or improve the best known bounds for the same or less general nonconvex-nonconcave settings, such as those that satisfy variational coherence or in which a weak solution to the associated variational inequality problem is assumed to exist.

Posted Content
TL;DR: An estimator based on Richardson extrapolation of the Sinkhorn divergence is proposed which enjoys improved statistical and computational efficiency guarantees, under a condition on the regularity of the approximation error, which is in particular satisfied for Gaussian densities.
Abstract: The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal transport problem which can be solved to $\epsilon$-accuracy by adding an entropic regularization of order $\epsilon$ and using for instance Sinkhorn's algorithm. In this work, we propose instead to estimate it with the Sinkhorn divergence, which is also built on entropic regularization but includes debiasing terms. We show that, for smooth densities, this estimator has a comparable sample complexity but allows higher regularization levels, of order $\epsilon^{1/2}$, which leads to improved computational complexity bounds and a strong speedup in practice. Our theoretical analysis covers the case of both randomly sampled densities and deterministic discretizations on uniform grids. We also propose and analyze an estimator based on Richardson extrapolation of the Sinkhorn divergence which enjoys improved statistical and computational efficiency guarantees, under a condition on the regularity of the approximation error, which is in particular satisfied for Gaussian densities. We finally demonstrate the efficiency of the proposed estimators with numerical experiments.

Posted Content
TL;DR: This work shows that for a subclass of nonconvex-nonconcave objectives satisfying a so-called two-sided Polyak-Łojasiewicz inequality, the alternating gradient descent ascent (AGDA) algorithm converges globally at a linear rate and the stochastic AGDA achieves a sublinear rate.
Abstract: Nonconvex minimax problems appear frequently in emerging machine learning applications, such as generative adversarial networks and adversarial learning. Simple algorithms such as the gradient descent ascent (GDA) are the common practice for solving these nonconvex games and receive lots of empirical success. Yet, it is known that these vanilla GDA algorithms with constant step size can potentially diverge even in the convex setting. In this work, we show that for a subclass of nonconvex-nonconcave objectives satisfying a so-called two-sided Polyak-Łojasiewicz inequality, the alternating gradient descent ascent (AGDA) algorithm converges globally at a linear rate and the stochastic AGDA achieves a sublinear rate. We further develop a variance reduced algorithm that attains a provably faster rate than AGDA when the problem has the finite-sum structure.

Posted Content
TL;DR: This paper proposes the first accelerated compressed gradient descent (ACGD) methods and improves upon the existing non-accelerated rates and recovers the optimal rates of accelerated gradient descent as a special case when no compression is applied.
Abstract: Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate $O\Big((1+\omega)\sqrt{\frac{L}{\mu}}\log \frac{1}{\epsilon}\Big)$ for $\mu$-strongly convex problems and $O\Big((1+\omega)\sqrt{\frac{L}{\epsilon}}\Big)$ for convex problems, respectively, where $\omega$ is the compression parameter. Our results improve upon the existing non-accelerated rates $O\Big((1+\omega)\frac{L}{\mu}\log \frac{1}{\epsilon}\Big)$ and $O\Big((1+\omega)\frac{L}{\epsilon}\Big)$, respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression ($\omega=0$) is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $\widetilde{O}\Big(\omega+\sqrt{\frac{L}{\mu}}+\sqrt{\big(\frac{\omega}{n}+\sqrt{\frac{\omega}{n}}\big)\frac{\omega L}{\mu}}\Big)$, where $n$ is the number of devices/workers and $\widetilde{O}$ hides the logarithmic factor $\log \frac{1}{\epsilon}$. This improves upon the previous best result $\widetilde{O}\Big(\omega + \frac{L}{\mu}+\frac{\omega L}{n\mu} \Big)$ achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.

Posted Content
TL;DR: In this article, the authors extend Willems' lemma to the situation where multiple (possibly short) system trajectories are given instead of a single long one, and introduce a notion of collective persistency of excitation.
Abstract: Willems et al.'s fundamental lemma asserts that all trajectories of a linear system can be obtained from a single given one, assuming that a persistency of excitation condition holds. This result has profound implications for system identification and data-driven control, and has seen a revival over the last few years. The purpose of this paper is to extend Willems' lemma to the situation where multiple (possibly short) system trajectories are given instead of a single long one. To this end, we introduce a notion of collective persistency of excitation. We will then show that all trajectories of a linear system can be obtained from a given finite number of trajectories, as long as these are collectively persistently exciting. We will demonstrate that this result enables the identification of linear systems from data sets with missing data samples. Additionally, we show that the result is of practical significance in data-driven control of unstable systems.

Book ChapterDOI
TL;DR: The simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives is emphasized and various enhancements that have been proposed to improve practical performance are provided.
Abstract: This article reviews the gradient sampling methodology for solving nonsmooth, nonconvex optimization problems. We state an intuitively straightforward gradient sampling algorithm and summarize its convergence properties. Throughout this discussion, we emphasize the simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives. We provide an overview of various enhancements that have been proposed to improve practical performance, as well as an overview of several extensions that have been proposed in the literature, such as to solve constrained problems. We also clarify certain technical aspects of the analysis of gradient sampling algorithms, most notably related to the assumptions one needs to make about the set of points at which the objective is continuously differentiable. Finally, we discuss possible future research directions.

Posted Content
TL;DR: A general asynchronous Stochastic Approximation scheme featuring a weighted infinity-norm contractive operator is considered, and a bound on its finite-time convergence rate on a single trajectory is proved.
Abstract: We consider a general asynchronous Stochastic Approximation (SA) scheme featuring a weighted infinity-norm contractive operator, and prove a bound on its finite-time convergence rate on a single trajectory. Additionally, we specialize the result to asynchronous $Q$-learning. The resulting bound matches the sharpest available bound for synchronous $Q$-learning, and improves over previous known bounds for asynchronous $Q$-learning.

Posted Content
TL;DR: It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.
Abstract: In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $\eta$ to the batch-size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.

Posted Content
TL;DR: The authors showed that SGDM converges as fast as SGD for smooth objectives under both strongly convex and nonconvex settings, and established the first convergence guarantee for the multistage setting.
Abstract: SGD with momentum (SGDM) has been widely applied in many machine learning tasks, and it is often applied with dynamic stepsizes and momentum weights tuned in a stagewise manner. Despite of its empirical advantage over SGD, the role of momentum is still unclear in general since previous analyses on SGDM either provide worse convergence bounds than those of SGD, or assume Lipschitz or quadratic objectives, which fail to hold in practice. Furthermore, the role of dynamic parameters has not been addressed. In this work, we show that SGDM converges as fast as SGD for smooth objectives under both strongly convex and nonconvex settings. We also establish \textit{the first} convergence guarantee for the multistage setting, and show that the multistage strategy is beneficial for SGDM compared to using fixed parameters. Finally, we verify these theoretical claims by numerical experiments.

Journal ArticleDOI
TL;DR: The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem that is also studied in the context of zero-sum games.
Abstract: The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very popular in a wide range of signal and data processing applications such as fair beamforming, training generative adversarial networks (GANs), and robust machine learning, to just name a few. The overarching goal of this article is to provide a survey of recent advances for an important subclass of min-max problem, where the minimization and maximization problems can be non-convex and/or non-concave. In particular, we will first present a number of applications to showcase the importance of such min-max problems; then we discuss key theoretical challenges, and provide a selective review of some exciting recent theoretical and algorithmic advances in tackling non-convex min-max problems. Finally, we will point out open questions and future research directions.

Posted Content
TL;DR: This paper analyzes the trajectories of stochastic gradient descent to help understand the algorithm's convergence properties in non-convex problems and shows that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules.
Abstract: This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $\Theta(1/n^p)$ step-size schedule This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR

Book ChapterDOI
TL;DR: In this survey, several aspects of a finite difference method used to approximate the previously mentioned system of PDEs are discussed, including convergence, variational aspects and algorithms for solving the resulting systems of nonlinear equations.
Abstract: The theory of mean field games aims at studying deterministic or stochastic differential games (Nash equilibria) as the number of agents tends to infinity. Since very few mean field games have explicit or semi-explicit solutions, numerical simulations play a crucial role in obtaining quantitative information from this class of models. They may lead to systems of evolutive partial differential equations coupling a backward Bellman equation and a forward Fokker–Planck equation. In the present survey, we focus on such systems. The forward-backward structure is an important feature of this system, which makes it necessary to design unusual strategies for mathematical analysis and numerical approximation. In this survey, several aspects of a finite difference method used to approximate the previously mentioned system of PDEs are discussed, including convergence, variational aspects and algorithms for solving the resulting systems of nonlinear equations. Finally, we discuss in details two applications of mean field games to the study of crowd motion and to macroeconomics, a comparison with mean field type control, and present numerical simulations.

Posted Content
TL;DR: This paper characterize the convergence properties of a wide class of zeroth-, first-, and (scalable) second-order methods in non-convex/non-concave problems and shows that these state-of-the-art min-max optimization algorithms may converge with arbitrarily high probability to attractors that are in no way min- max optimal or even stationary.
Abstract: Compared to minimization problems, the min-max landscape in machine learning applications is considerably more convoluted because of the existence of cycles and similar phenomena. Such oscillatory behaviors are well-understood in the convex-concave regime, and many algorithms are known to overcome them. In this paper, we go beyond the convex-concave setting and we characterize the convergence properties of a wide class of zeroth-, first-, and (scalable) second-order methods in non-convex/non-concave problems. In particular, we show that these state-of-the-art min-max optimization algorithms may converge with arbitrarily high probability to attractors that are in no way min-max optimal or even stationary. Spurious convergence phenomena of this type can arise even in two-dimensional problems, a fact which corroborates the empirical evidence surrounding the formidable difficulty of training GANs.

Posted Content
TL;DR: In this article, a decentralized control system with linear dynamics, quadratic cost, and Gaussian disturbances is considered, and it is shown that the optimal control law is unique, linear, and identical across all subsystems.
Abstract: A decentralized control system with linear dynamics, quadratic cost, and Gaussian disturbances is considered. The system consists of a finite number of subsystems whose dynamics and per-step cost function are coupled through their mean-field (empirical average). The system has mean-field sharing information structure, i.e., each controller observes the state of its local subsystem (either perfectly or with noise) and the mean-field. It is shown that the optimal control law is unique, linear, and identical across all subsystems. Moreover, the optimal gains are computed by solving two decoupled Riccati equations in the full observation model and by solving an additional filter Riccati equation in the noisy observation model. These Riccati equations do not depend on the number of subsystems. It is also shown that the optimal decentralized performance is the same as the optimal centralized performance. An example, motivated by smart grids, is presented to illustrate the result.

Posted Content
TL;DR: This work proposes a new moment-SOS hierarchy, called CS-TSSOS, for solving large-scale sparse polynomial optimization problems, and obtains a two-level hierarchy of semidefinite programming relaxations with the crucial property to involve quasi block-diagonal matrices and the guarantee of convergence to the global optimum.
Abstract: This work proposes a new moment-SOS hierarchy, called CS-TSSOS, for solving large-scale sparse polynomial optimization problems. Its novelty is to exploit simultaneously correlative sparsity and term sparsity by combining advantages of two existing frameworks for sparse polynomial optimization. The former is due to Waki et al. while the latter was initially proposed by Wang et al. and later exploited in the TSSOS hierarchy. In doing so we obtain CS-TSSOS -- a two-level hierarchy of semidefinite programming relaxations with (i), the crucial property to involve quasi block-diagonal matrices and (ii), the guarantee of convergence to the global optimum. We demonstrate its efficiency on several large-scale instances of the celebrated Max-Cut problem and the important industrial optimal power flow problem, involving up to several thousands of variables and ten thousands of constraints.

Posted Content
TL;DR: To the best of the authors' knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-( strongly) concave and (strongly) convex-nonconcave minimax problems.
Abstract: Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. AGP employs simple gradient projection steps for updating the primal and dual variables alternatively at each iteration. We show that it can find an $\varepsilon$-stationary point of the objective function in $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) iterations under nonconvex-strongly concave (resp. nonconvex-concave) setting. Moreover, its gradient complexity to obtain an $\varepsilon$-stationary point of the objective function is bounded by $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp., $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under the strongly convex-nonconcave (resp., convex-nonconcave) setting. To the best of our knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. Moreover, the complexity results for solving the latter (strongly) convex-nonconcave minimax problems have never been obtained before in the literature.

Posted Content
TL;DR: This paper presents a theoretical convergence analysis of the continuous time Fictitious Play process and proves that the induced exploitability decreases at a rate of $O(\frac{1}{t})$.
Abstract: In this paper, we deepen the analysis of continuous time Fictitious Play learning algorithm to the consideration of various finite state Mean Field Game settings (finite horizon, $\gamma$-discounted), allowing in particular for the introduction of an additional common noise. We first present a theoretical convergence analysis of the continuous time Fictitious Play process and prove that the induced exploitability decreases at a rate $O(\frac{1}{t})$. Such analysis emphasizes the use of exploitability as a relevant metric for evaluating the convergence towards a Nash equilibrium in the context of Mean Field Games. These theoretical contributions are supported by numerical experiments provided in either model-based or model-free settings. We provide hereby for the first time converging learning dynamics for Mean Field Games in the presence of common noise.

Posted Content
TL;DR: In this article, the authors briefly review the development of ranking-and-selection (R&S) in the past 70 years, especially the theoretical achievements and practical applications in the last 20 years.
Abstract: In this paper, we briefly review the development of ranking-and-selection (R&S) in the past 70 years, especially the theoretical achievements and practical applications in the last 20 years. Different from the frequentist and Bayesian classifications adopted by Kim and Nelson (2006b) and Chick (2006) in their review articles, we categorize existing R&S procedures into fixed-precision and fixed-budget procedures, as in Hunter and Nelson (2017). We show that these two categories of procedures essentially differ in the underlying methodological formulations, i.e., they are built on hypothesis testing and dynamic-programming, respectively. In light of this variation, we review in detail some well-known procedures in the literature and show how they fit into these two formulations. In addition, we discuss the use of R&S procedures in solving various practical problems and propose what we think are the important research questions in the field.

Posted Content
TL;DR: It is proved for the first time that a ring-road mixed traffic system with one CAV and multiple heterogeneous human-driven vehicles is not completely controllable, but is stabilizable under a very mild condition and an upper bound for reachable traffic velocity is derived via controlling the CAV.
Abstract: Connected and automated vehicles (CAVs) have a great potential to improve traffic efficiency in mixed traffic systems, which has been demonstrated by multiple numerical simulations and field experiments. However, some fundamental properties of mixed traffic flow, including controllability and stabilizability, have not been well understood. This paper analyzes the controllability of mixed traffic systems and designs a system-level optimal control strategy. Using the Popov-Belevitch-Hautus (PBH) criterion, we prove for the first time that a ring-road mixed traffic system with one CAV and multiple heterogeneous human-driven vehicles is not completely controllable, but is stabilizable under a very mild condition. Then, we formulate the design of a system-level control strategy for the CAV as a structured optimal control problem, where the CAV's communication ability is explicitly considered. Finally, we derive an upper bound for reachable traffic velocity via controlling the CAV. Extensive numerical experiments verify the effectiveness of our analytical results and the proposed control strategy. Our results validate the possibility of utilizing CAVs as mobile actuators to smooth traffic flow actively.

Posted Content
TL;DR: This work introduces MathOptInterface, an abstract data structure for representing mathematical optimization problems based on combining pre-defined functions and sets that leads naturally to a general file format for mathematical optimization the authors call MathOptFormat.
Abstract: We introduce MathOptInterface, an abstract data structure for representing mathematical optimization problems based on combining pre-defined functions and sets. MathOptInterface is significantly more general than existing data structures in the literature, encompassing, for example, a spectrum of problems classes from integer programming with indicator constraints to bilinear semidefinite programming. We also outline an automated rewriting system between equivalent formulations of a constraint. MathOptInterface has been implemented in practice, forming the foundation of a recent rewrite of JuMP, an open-source algebraic modeling language in the Julia language. The regularity of the MathOptInterface representation leads naturally to a general file format for mathematical optimization we call MathOptFormat. In addition, the automated rewriting system provides modeling power to users while making it easy to connect new solvers to JuMP.