Showing papers in "Mathematical Programming in 2021"

PDF

Open Access

Journal Article•DOI•

Understanding the acceleration phenomenon via high-resolution differential equations

[...]

Bin Shi¹, Simon S. Du², Michael I. Jordan³, Weijie J. Su⁴•Institutions (4)

Chinese Academy of Sciences¹, University of Washington², University of California, Berkeley³, University of Pennsylvania⁴

06 Jul 2021-Mathematical Programming

TL;DR: An alternative limiting process that yields high-resolution ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time and are more accurate surrogates for the underlying algorithms.

...read moreread less

Abstract: Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms—Nesterov’s accelerated gradient method for strongly convex functions (NAG-SC) and Polyak’s heavy-ball method—we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak’s heavy-ball method, but they allow the identification of a term that we refer to as “gradient correction” that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov’s accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result—that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.

...read moreread less

148 citations

Journal Article•DOI•

Implementable tensor methods in unconstrained convex optimization.

[...]

Yurii Nesterov¹•Institutions (1)

Catholic University of Leuven¹

01 Jan 2021-Mathematical Programming

TL;DR: New tensor methods for unconstrained convex optimization, which solve at each iteration an auxiliary problem of minimizing convex multivariate polynomial, and an efficient technique for solving the auxiliary problem, based on the recently developed relative smoothness condition are developed.

...read moreread less

Abstract: In this paper we develop new tensor methods for unconstrained convex optimization, which solve at each iteration an auxiliary problem of minimizing convex multivariate polynomial. We analyze the simplest scheme, based on minimization of a regularized local model of the objective function, and its accelerated version obtained in the framework of estimating sequences. Their rates of convergence are compared with the worst-case lower complexity bounds for corresponding problem classes. Finally, for the third-order methods, we suggest an efficient technique for solving the auxiliary problem, which is based on the recently developed relative smoothness condition (Bauschke et al. in Math Oper Res 42:330–348, 2017; Lu et al. in SIOPT 28(1):333–354, 2018). With this elaboration, the third-order methods become implementable and very fast. The rate of convergence in terms of the function value for the accelerated third-order scheme reaches the level $$O\left( {1 \over k^4}\right) $$ , where k is the number of iterations. This is very close to the lower bound of the order $$O\left( {1 \over k^5}\right) $$ , which is also justified in this paper. At the same time, in many important cases the computational cost of one iteration of this method remains on the level typical for the second-order methods.

...read moreread less

131 citations

Journal Article•DOI•

Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems

[...]

Yuyuan Ouyang¹, Yangyang Xu²•Institutions (2)

Clemson University¹, Rensselaer Polytechnic Institute²

01 Jan 2021-Mathematical Programming

TL;DR: In this article, lower complexity bounds of first-order methods on large-scale saddle-point problems were derived for affinely constrained smooth convex optimization problems, where the iterates are in the linear span of past first order information.

...read moreread less

Abstract: On solving a convex-concave bilinear saddle-point problem (SPP), there have been many works studying the complexity results of first-order methods. These results are all about upper complexity bounds, which can determine at most how many iterations would guarantee a solution of desired accuracy. In this paper, we pursue the opposite direction by deriving lower complexity bounds of first-order methods on large-scale SPPs. Our results apply to the methods whose iterates are in the linear span of past first-order information, as well as more general methods that produce their iterates in an arbitrary manner based on first-order information. We first work on the affinely constrained smooth convex optimization that is a special case of SPP. Different from gradient method on unconstrained problems, we show that first-order methods on affinely constrained problems generally cannot be accelerated from the known convergence rate O(1 / t) to $$O(1/t^2)$$ , and in addition, O(1 / t) is optimal for convex problems. Moreover, we prove that for strongly convex problems, $$O(1/t^2)$$ is the best possible convergence rate, while it is known that gradient methods can have linear convergence on unconstrained problems. Then we extend these results to general SPPs. It turns out that our lower complexity bounds match with several established upper complexity bounds in the literature, and thus they are tight and indicate the optimality of several existing first-order methods.

...read moreread less

125 citations

Journal Article•DOI•

Distributed stochastic gradient tracking methods

[...]

Shi Pu¹, Angelia Nedic²•Institutions (2)

The Chinese University of Hong Kong¹, Arizona State University²

01 May 2021-Mathematical Programming

TL;DR: It is shown that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost, which is a comparable performance to a centralized stochastic gradient algorithm.

...read moreread less

Abstract: In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size n, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.

...read moreread less

117 citations

Journal Article•DOI•

On distributionally robust chance constrained programs with Wasserstein distance

[...]

Weijun Xie¹•Institutions (1)

Virginia Tech¹

01 Mar 2021-Mathematical Programming

TL;DR: In this article, a distributionally robust chance constrained program (DRCCP) with Wasserstein ambiguity set is studied, where the uncertain constraints should be satisfied with a probability at least a given threshold for all the probability distributions of the uncertain parameters within a chosen WASSERstein distance from an empirical distribution.

...read moreread less

Abstract: This paper studies a distributionally robust chance constrained program (DRCCP) with Wasserstein ambiguity set, where the uncertain constraints should be satisfied with a probability at least a given threshold for all the probability distributions of the uncertain parameters within a chosen Wasserstein distance from an empirical distribution. In this work, we investigate equivalent reformulations and approximations of such problems. We first show that a DRCCP can be reformulated as a conditional value-at-risk constrained optimization problem, and thus admits tight inner and outer approximations. We also show that a DRCCP of bounded feasible region is mixed integer representable by introducing big-M coefficients and additional binary variables. For a DRCCP with pure binary decision variables, by exploring the submodular structure, we show that it admits a big-M free formulation, which can be solved by a branch and cut algorithm. Finally, we present a numerical study to illustrate the effectiveness of the proposed formulations.

...read moreread less

108 citations

Journal Article•DOI•

Lower bounds for finding stationary points II: first-order methods

[...]

Yair Carmon¹, John C. Duchi¹, Oliver Hinder¹, Aaron Sidford¹•Institutions (1)

Stanford University¹

01 Jan 2021-Mathematical Programming

TL;DR: It is proved that deterministic first-order methods, even applied to arbitrarily smooth functions, cannot achieve convergence rates in ϵ better than ϵ - 8 / 5, which is within the best known rate for such methods.

...read moreread less

Abstract: We establish lower bounds on the complexity of finding $$\epsilon $$ -stationary points of smooth, non-convex high-dimensional functions using first-order methods. We prove that deterministic first-order methods, even applied to arbitrarily smooth functions, cannot achieve convergence rates in $$\epsilon $$ better than $$\epsilon ^{-8/5}$$ , which is within $$\epsilon ^{-1/15}\log \frac{1}{\epsilon }$$ of the best known rate for such methods. Moreover, for functions with Lipschitz first and second derivatives, we prove that no deterministic first-order method can achieve convergence rates better than $$\epsilon ^{-12/7}$$ , while $$\epsilon ^{-2}$$ is a lower bound for functions with only Lipschitz gradient. For convex functions with Lipschitz gradient, accelerated gradient descent achieves a better rate, showing that finding stationary points is easier given convexity.

...read moreread less

86 citations

Journal Article•DOI•

Stochastic quasi-gradient methods: variance reduction via Jacobian sketching

[...]

Robert M. Gower¹, Peter Richtárik², Peter Richtárik³, Peter Richtárik⁴, Francis Bach⁵ - Show less +1 more•Institutions (5)

Télécom ParisTech¹, King Abdullah University of Science and Technology², University of Edinburgh³, Moscow Institute of Physics and Technology⁴, PSL Research University⁵

01 Jul 2021-Mathematical Programming

TL;DR: It is proved that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches, and a refined convergence theorem applies to a smaller class of sketches, featuring a novel proof technique based on a stochastic Lyapunov function.

...read moreread less

Abstract: We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method—JacSketch—is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variance-reduced unbiased estimator of the gradient. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a stochastic quasi-gradient method. Our method can also be seen as stochastic gradient descent applied to a controlled stochastic optimization reformulation of the original problem, where the control comes from the Jacobian estimates. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches, featuring a novel proof technique based on a stochastic Lyapunov function. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the celebrated stochastic average gradient (SAGA) method, and its several existing and many new minibatch, reduced memory, and importance sampling variants. Our rate for SAGA with importance sampling is the current best-known rate for this method, resolving a conjecture by Schmidt et al. (Proceedings of the eighteenth international conference on artificial intelligence and statistics, AISTATS 2015, San Diego, California, 2015). The rates we obtain for minibatch SAGA are also superior to existing rates and are sufficiently tight as to show a decrease in total complexity as the minibatch size increases. Moreover, we obtain the first minibatch SAGA method with importance sampling.

...read moreread less

74 citations

Journal Article•DOI•

Riemannian proximal gradient methods

[...]

Wen Huang¹, Ke Wei²•Institutions (2)

Xiamen University¹, Fudan University²

09 Mar 2021-Mathematical Programming

TL;DR: In this article, a Riemannian proximal gradient method (RPG) and its accelerated variant (ARPG) were developed for similar problems but constrained on a manifold, and the global convergence of RPG was established under mild assumptions, while the O(1/k) was also derived for ARPG based on the notion of retraction convexity.

...read moreread less

Abstract: In the Euclidean setting the proximal gradient method and its accelerated variants are a class of efficient algorithms for optimization problems with decomposable objective. In this paper, we develop a Riemannian proximal gradient method (RPG) and its accelerated variant (ARPG) for similar problems but constrained on a manifold. The global convergence of RPG is established under mild assumptions, and the O(1/k) is also derived for RPG based on the notion of retraction convexity. If assuming the objective function obeys the Rimannian Kurdyka–Łojasiewicz (KL) property, it is further shown that the sequence generated by RPG converges to a single stationary point. As in the Euclidean setting, local convergence rate can be established if the objective function satisfies the Riemannian KL property with an exponent. Moreover, we show that the restriction of a semialgebraic function onto the Stiefel manifold satisfies the Riemannian KL property, which covers for example the well-known sparse PCA problem. Numerical experiments on random and synthetic data are conducted to test the performance of the proposed RPG and ARPG.

...read moreread less

53 citations

Journal Article•DOI•

A primal-dual interior-point algorithm for nonsymmetric exponential-cone optimization

[...]

Joachim Dahl, Erling D. Andersen

09 Mar 2021-Mathematical Programming

TL;DR: A novel higher-order search direction, similar in spirit to a Mehrotra corrector for symmetric cone algorithms, is proposed, resulting in a practical algorithm with good numerical performance, on level with standard symmetric cones algorithms.

...read moreread less

Abstract: A new primal-dual interior-point algorithm applicable to nonsymmetric conic optimization is proposed. It is a generalization of the famous algorithm suggested by Nesterov and Todd for the symmetric conic case, and uses primal-dual scalings for nonsymmetric cones proposed by Tuncel. We specialize Tuncel’s primal-dual scalings for the important case of 3 dimensional exponential-cones, resulting in a practical algorithm with good numerical performance, on level with standard symmetric cone (e.g., quadratic cone) algorithms. A significant contribution of the paper is a novel higher-order search direction, similar in spirit to a Mehrotra corrector for symmetric cone algorithms. To a large extent, the efficiency of our proposed algorithm can be attributed to this new corrector.

...read moreread less

47 citations

Journal Article•DOI•

Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning

[...]

Jérôme Bolte¹, Edouard Pauwels¹•Institutions (1)

University of Toulouse¹

01 Jul 2021-Mathematical Programming

TL;DR: In this article, the authors introduce generalized derivatives called conservative fields for which they develop a calculus and provide representation formulas for nonsmooth approaches with a flexible calculus, such as backpropagation algorithm in deep learning.

...read moreread less

Abstract: Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice.

...read moreread less

44 citations

Journal Article•DOI•

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming

[...]

Yangyang Xu¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Jan 2021-Mathematical Programming

TL;DR: In this paper, the convergence rate of the inexact version of the augmented Lagrangian method was studied for general convex programs with both equality and inequality constraints, and the global convergence rate was established in terms of the number of gradient evaluations to obtain a primal and/or primal-dual solution with a specified accuracy.

...read moreread less

Abstract: Augmented Lagrangian method (ALM) has been popularly used for solving constrained optimization problems. Practically, subproblems for updating primal variables in the framework of ALM usually can only be solved inexactly. The convergence and local convergence speed of ALM have been extensively studied. However, the global convergence rate of the inexact ALM is still open for problems with nonlinear inequality constraints. In this paper, we work on general convex programs with both equality and inequality constraints. For these problems, we establish the global convergence rate of the inexact ALM and estimate its iteration complexity in terms of the number of gradient evaluations to produce a primal and/or primal-dual solution with a specified accuracy. We first establish an ergodic convergence rate result of the inexact ALM that uses constant penalty parameters or geometrically increasing penalty parameters. Based on the convergence rate result, we then apply Nesterov’s optimal first-order method on each primal subproblem and estimate the iteration complexity of the inexact ALM. We show that if the objective is convex, then $$O(\varepsilon ^{-1})$$ gradient evaluations are sufficient to guarantee a primal $$\varepsilon $$ -solution in terms of both primal objective and feasibility violation. If the objective is strongly convex, the result can be improved to $$O(\varepsilon ^{-\frac{1}{2}}|\log \varepsilon |)$$ . To produce a primal-dual $$\varepsilon $$ -solution, more gradient evaluations are needed for convex case, and the number is $$O(\varepsilon ^{-\frac{4}{3}})$$ , while for strongly convex case, the number is still $$O(\varepsilon ^{-\frac{1}{2}}|\log \varepsilon |)$$ . Finally, we establish a nonergodic convergence rate result of the inexact ALM that uses geometrically increasing penalty parameters. This result is established only for the primal problem. We show that the nonergodic iteration complexity result is in the same order as that for the ergodic result. Numerical experiments on quadratically constrained quadratic programming are conducted to compare the performance of the inexact ALM with different settings.

...read moreread less

Journal Article•DOI•

The sum-of-squares hierarchy on the sphere and applications in quantum information theory

[...]

Kun Fang¹, Hamza Fawzi¹•Institutions (1)

University of Cambridge¹

01 Nov 2021-Mathematical Programming

TL;DR: In this article, the authors considered the problem of maximizing a homogeneous polynomial on the unit sphere and its hierarchy of sum-of-squares relaxations, and obtained a quadratic improvement of the known convergence rate by Reznick and Doherty.

...read moreread less

Abstract: We consider the problem of maximizing a homogeneous polynomial on the unit sphere and its hierarchy of sum-of-squares relaxations. Exploiting the polynomial kernel technique, we obtain a quadratic improvement of the known convergence rate by Reznick and Doherty and Wehner. Specifically, we show that the rate of convergence is no worse than $$O(d^2/\ell ^2)$$ in the regime $$\ell = \Omega (d)$$ where $$\ell $$ is the level of the hierarchy and d the dimension, solving a problem left open in the recent paper by de Klerk and Laurent ( arXiv:1904.08828 ). Importantly, our analysis also works for matrix-valued polynomials on the sphere which has applications in quantum information for the Best Separable State problem. By exploiting the duality relation between sums of squares and the Doherty–Parrilo–Spedalieri hierarchy in quantum information theory, we show that our result generalizes to nonquadratic polynomials the convergence rates of Navascues, Owari and Plenio.

...read moreread less

Journal Article•DOI•

Near-optimal discrete optimization for experimental design: a regret minimization approach

[...]

Zeyuan Allen-Zhu¹, Yuanzhi Li², Aarti Singh², Yining Wang³•Institutions (3)

Microsoft¹, Carnegie Mellon University², University of Florida³

01 Mar 2021-Mathematical Programming

TL;DR: In this article, a polynomial-time regret minimization framework was proposed to achieve a 1 + ε-approximation with only O(p/ε 2 ) design points, for all the optimality criteria above.

...read moreread less

Abstract: The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by optimality criteria, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is challenging, and for certain instances of D/E-optimality exact or even approximate optimization is proven to be NP-hard. We propose a polynomial-time regret minimization framework to achieve a $$(1+\varepsilon )$$ approximation with only $$O(p/\varepsilon ^2)$$ design points, for all the optimality criteria above. In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves $$(1+\varepsilon )$$ approximations for D/E/G-optimality, and the best poly-time algorithm achieving $$(1+\varepsilon )$$ -approximation for A/V-optimality requires $$k=\varOmega (p^2/\varepsilon )$$ design points.

...read moreread less

Journal Article•DOI•

Sparse optimization on measures with over-parameterized gradient descent

[...]

Lénaïc Chizat¹•Institutions (1)

Université Paris-Saclay¹

17 Mar 2021-Mathematical Programming

TL;DR: This work shows that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles, which leads to a global optimization algorithm with a complexity scaling as $\log(1/\epsilon)$ in the desired accuracy $-d$ for convex methods.

...read moreread less

Abstract: Minimizing a convex function of a measure with a sparsity-inducing penalty is a typical problem arising, e.g., in sparse spikes deconvolution or two-layer neural networks training. We show that this problem can be solved by discretizing the measure and running non-convex gradient descent on the positions and weights of the particles. For measures on a d-dimensional manifold and under some non-degeneracy assumptions, this leads to a global optimization algorithm with a complexity scaling as $$\log (1/\epsilon )$$ in the desired accuracy $$\epsilon $$ , instead of $$\epsilon ^{-d}$$ for convex methods. The key theoretical tools are a local convergence analysis in Wasserstein space and an analysis of a perturbed mirror descent in the space of measures. Our bounds involve quantities that are exponential in d which is unavoidable under our assumptions.

...read moreread less

Journal Article•DOI•

Prophet secretary through blind strategies

[...]

José R. Correa¹, Raimundo Saona¹, Bruno Ziliotto²•Institutions (2)

University of Chile¹, Paris Dauphine University²

01 Nov 2021-Mathematical Programming

TL;DR: For the prophet secretary problem, the best known constant is 0.675 as mentioned in this paper, which was improved to 0.732 by Azar et al. in 2018, which is the first time that the prophet inequality has been improved.

...read moreread less

Abstract: In the classic prophet inequality, a well-known problem in optimal stopping theory, samples from independent random variables (possibly differently distributed) arrive online. A gambler who knows the distributions, but cannot see the future, must decide at each point in time whether to stop and pick the current sample or to continue and lose that sample forever. The goal of the gambler is to maximize the expected value of what she picks and the performance measure is the worst case ratio between the expected value the gambler gets and what a prophet that sees all the realizations in advance gets. In the late seventies, Krengel and Sucheston (Bull Am Math Soc 83(4):745–747, 1977), established that this worst case ratio is 0.5. A particularly interesting variant is the so-called prophet secretary problem, in which the only difference is that the samples arrive in a uniformly random order. For this variant several algorithms are known to achieve a constant of $$1-1/e \approx 0.632$$ and very recently this barrier was slightly improved by Azar et al. (in: Proceedings of the ACM conference on economics and computation, EC, 2018). In this paper we introduce a new type of multi-threshold strategy, called blind strategy. Such a strategy sets a nonincreasing sequence of thresholds that depends only on the distribution of the maximum of the random variables, and the gambler stops the first time a sample surpasses the threshold of the stage. Our main result shows that these strategies can achieve a constant of 0.669 for the prophet secretary problem, improving upon the best known result of Azar et al. (in: Proceedings of the ACM conference on economics and computation, EC, 2018), and even that of Beyhaghi et al. (Improved approximations for posted price and second price mechanisms. CoRR arXiv:1807.03435 , 2018) that works in the case in which the gambler can select the order of the samples. The crux of the result is a very precise analysis of the underlying stopping time distribution for the gambler’s strategy that is inspired by the theory of Schur-convex functions. We further prove that our family of blind strategies cannot lead to a constant better than 0.675. Finally we prove that no algorithm for the gambler can achieve a constant better than $$\sqrt{3}-1 \approx 0.732$$ , which also improves upon a recent result of Azar et al. (in: Proceedings of the ACM conference on economics and computation, EC, 2018). This implies that the upper bound on what the gambler can get in the prophet secretary problem is strictly lower than what she can get in the i.i.d. case. This constitutes the first separation between the prophet secretary problem and the i.i.d. prophet inequality.

...read moreread less

Journal Article•DOI•

Adaptive regularization with cubics on manifolds

[...]

Naman Agarwal¹, Nicolas Boumal¹, Brian Bullins², Coralia Cartis³•Institutions (3)

Princeton University¹, Toyota Technological Institute at Chicago², University of Oxford³

01 Jul 2021-Mathematical Programming

TL;DR: A generalization of ARC to optimization on Riemannian manifolds is studied, which generalizes the iteration complexity results to this richer framework and identifies appropriate manifold-specific assumptions that allow it to secure complexity guarantees both when using the exponential map and when using a general retraction.

...read moreread less

Abstract: Adaptive regularization with cubics (ARC) is an algorithm for unconstrained, non-convex optimization. Akin to the trust-region method, its iterations can be thought of as approximate, safe-guarded Newton steps. For cost functions with Lipschitz continuous Hessian, ARC has optimal iteration complexity, in the sense that it produces an iterate with gradient smaller than $$\varepsilon $$ in $$O(1/\varepsilon ^{1.5})$$ iterations. For the same price, it can also guarantee a Hessian with smallest eigenvalue larger than $$-\sqrt{\varepsilon }$$ . In this paper, we study a generalization of ARC to optimization on Riemannian manifolds. In particular, we generalize the iteration complexity results to this richer framework. Our central contribution lies in the identification of appropriate manifold-specific assumptions that allow us to secure these complexity guarantees both when using the exponential map and when using a general retraction. A substantial part of the paper is devoted to studying these assumptions—relevant beyond ARC—and providing user-friendly sufficient conditions for them. Numerical experiments are encouraging.

...read moreread less

Journal Article•DOI•

Optimal complexity and certification of Bregman first-order methods

[...]

Radu-Alexandru Dragomir¹, Radu-Alexandru Dragomir², Adrien B. Taylor², Alexandre d'Aspremont², Jérôme Bolte¹ - Show less +1 more•Institutions (2)

University of Toulouse¹, École Normale Supérieure²

21 Apr 2021-Mathematical Programming

TL;DR: It is shown how to constructively obtain the corresponding worst-case functions by extending the computer-assisted performance estimation framework of Drori and Teboulle to Bregman first-order methods, and to handle the classes of differentiable and strictly convex functions.

...read moreread less

Abstract: We provide a lower bound showing that the O(1/k) convergence rate of the NoLips method (a.k.a. Bregman Gradient or Mirror Descent) is optimal for the class of problems satisfying the relative smoothness assumption. This assumption appeared in the recent developments around the Bregman Gradient method, where acceleration remained an open issue. The main inspiration behind this lower bound stems from an extension of the performance estimation framework of Drori and Teboulle (Mathematical Programming, 2014) to Bregman first-order methods. This technique allows computing worst-case scenarios for NoLips in the context of relatively-smooth minimization. In particular, we used numerically generated worst-case examples as a basis for obtaining the general lower bound.

...read moreread less

Journal Article•DOI•

Sparse noncommutative polynomial optimization

[...]

Igor Klep¹, Victor Magron², Janez Povh¹•Institutions (2)

University of Ljubljana¹, Centre national de la recherche scientifique²

28 Jan 2021-Mathematical Programming

TL;DR: This article focuses on optimization of polynomials in noncommuting variables, while taking into account sparsity in the input data, and a converging hierarchy of semidefinite relaxations for eigenvalue and trace optimization is provided.

...read moreread less

Abstract: This article focuses on optimization of polynomials in noncommuting variables, while taking into account sparsity in the input data. A converging hierarchy of semidefinite relaxations for eigenvalue and trace optimization is provided. This hierarchy is a noncommutative analogue of results due to Lasserre (SIAM J Optim 17(3):822–843, 2006) and Waki et al. (SIAM J Optim 17(1):218–242, 2006). The Gelfand–Naimark–Segal construction is applied to extract optimizers if flatness and irreducibility conditions are satisfied. Among the main techniques used are amalgamation results from operator algebra. The theoretical results are utilized to compute lower bounds on minimal eigenvalue of noncommutative polynomials from the literature.

...read moreread less

Journal Article•DOI•

Distributionally robust chance-constrained programs with right-hand side uncertainty under Wasserstein ambiguity

[...]

Nam Ho-Nguyen¹, Fatma Kılınç-Karzan², Simge Küçükyavuz³, Dabeen Lee•Institutions (3)

University of Sydney¹, Carnegie Mellon University², Northwestern University³

04 Feb 2021-Mathematical Programming

TL;DR: An improved formulation of exact deterministic mixed-integer programming (MIP) reformulations of distributionally robust chance-constrained programs (DR-CCP) with random right-hand sides over Wasserstein ambiguity sets is considered and several hidden connections are revealed.

...read moreread less

Abstract: We consider exact deterministic mixed-integer programming (MIP) reformulations of distributionally robust chance-constrained programs (DR-CCP) with random right-hand sides over Wasserstein ambiguity sets. The existing MIP formulations are known to have weak continuous relaxation bounds, and, consequently, for hard instances with small radius, or with large problem sizes, the branch-and-bound based solution processes suffer from large optimality gaps even after hours of computation time. This significantly hinders the practical application of the DR-CCP paradigm. Motivated by these challenges, we conduct a polyhedral study to strengthen these formulations. We reveal several hidden connections between DR-CCP and its nominal counterpart (the sample average approximation), mixing sets, and robust 0–1 programming. By exploiting these connections in combination, we provide an improved formulation and two classes of valid inequalities for DR-CCP. We test the impact of our results on a stochastic transportation problem numerically. Our experiments demonstrate the effectiveness of our approach; in particular our improved formulation and proposed valid inequalities reduce the overall solution times remarkably. Moreover, this allows us to significantly scale up the problem sizes that can be handled in such DR-CCP formulations by reducing the solution times from hours to seconds.

...read moreread less

Journal Article•DOI•

Inexact accelerated high-order proximal-point methods

[...]

Yurii Nesterov¹•Institutions (1)

Catholic University of Leuven¹

11 Nov 2021-Mathematical Programming

TL;DR: A new framework of Bi-Level Unconstrained Minimization (BLUM) for development of accelerated methods in Convex Programming, and presents new methods with the exact auxiliary search procedure, which have the rate of convergence O(k^{−(3p+1)/2}), where p ≥ 1 is the order of the proximal operator.

...read moreread less

Abstract: In this paper, we present a new framework of bi-level unconstrained minimization for development of accelerated methods in Convex Programming. These methods use approximations of the high-order proximal points, which are solutions of some auxiliary parametric optimization problems. For computing these points, we can use different methods, and, in particular, the lower-order schemes. This opens a possibility for the latter methods to overpass traditional limits of the Complexity Theory. As an example, we obtain a new second-order method with the convergence rate $$O\left( k^{-4}\right) $$ , where k is the iteration counter. This rate is better than the maximal possible rate of convergence for this type of methods, as applied to functions with Lipschitz continuous Hessian. We also present new methods with the exact auxiliary search procedure, which have the rate of convergence $$O\left( k^{-(3p+1)/ 2}\right) $$ , where $$p \ge 1$$ is the order of the proximal operator. The auxiliary problem at each iteration of these schemes is convex.

...read moreread less

Journal Article•DOI•

Tikhonov regularization of a second order dynamical system with Hessian driven damping

[...]

Radu Ioan Boţ¹, Ernö Robert Csetnek¹, Szilárd László²•Institutions (2)

University of Vienna¹, Technical University of Cluj-Napoca²

01 Sep 2021-Mathematical Programming

TL;DR: In this paper, the authors investigate the asymptotic properties of the trajectories generated by a second-order dynamical system with Hessian driven damping and a Tikhonov regularization term in connection with the minimization of a smooth convex function in Hilbert spaces.

...read moreread less

Abstract: We investigate the asymptotic properties of the trajectories generated by a second-order dynamical system with Hessian driven damping and a Tikhonov regularization term in connection with the minimization of a smooth convex function in Hilbert spaces. We obtain fast convergence results for the function values along the trajectories. The Tikhonov regularization term enables the derivation of strong convergence results of the trajectory to the minimizer of the objective function of minimum norm.

...read moreread less

Journal Article•DOI•

Why Random Reshuffling Beats Stochastic Gradient Descent

[...]

Mert Gurbuzbalaban¹, Asuman Ozdaglar², Pablo A. Parrilo²•Institutions (2)

Rutgers University¹, Massachusetts Institute of Technology²

01 Mar 2021-Mathematical Programming

TL;DR: In this article, the convergence rate of the random reshuffling (RR) method was analyzed for quadratic component functions and it was shown that the expected distance of the iterates generated by RR with iterate averaging and a diminishing stepsize converges to zero at rate.

...read moreread less

Abstract: We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by providing various convergence rate results for RR and variants when the sum function is strongly convex. We first focus on quadratic component functions and show that the expected distance of the iterates generated by RR with stepsize $$\alpha _k=\varTheta (1/k^s)$$ for $$s\in (0,1]$$ converges to zero at rate $$\mathcal{O}(1/k^s)$$ (with $$s=1$$ requiring adjusting the stepsize to the strong convexity constant). Our main result shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize $$\alpha _k=\varTheta (1/k^s)$$ for $$s\in (1/2,1)$$ converges at rate $$\varTheta (1/k^{2s})$$ with probability one in the suboptimality of the objective value, thus improving upon the $$\varOmega (1/k)$$ rate of SGD. Our analysis draws on the theory of Polyak–Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by $$\alpha _k^2$$ . This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate $$\mathcal{O}(\frac{1}{k^2})$$ .

...read moreread less

Journal Article•DOI•

On the equivalence of inexact proximal ALM and ADMM for a class of convex composite programming

[...]

Liang Chen¹, Liang Chen², Xudong Li³, Defeng Sun¹, Kim-Chuan Toh⁴ - Show less +1 more•Institutions (4)

Hong Kong Polytechnic University¹, Hunan University², Fudan University³, National University of Singapore⁴

01 Jan 2021-Mathematical Programming

TL;DR: This paper shows that for a class of linearly constrained convex composite optimization problems, an (inexact) symmetric Gauss–Seidel based majorized multi-block proximal alternating direction method of multipliers (ADMM) is equivalent to an inexact proximal augmented Lagrangian method.

...read moreread less

Abstract: In this paper, we show that for a class of linearly constrained convex composite optimization problems, an (inexact) symmetric Gauss–Seidel based majorized multi-block proximal alternating direction method of multipliers (ADMM) is equivalent to an inexact proximal augmented Lagrangian method. This equivalence not only provides new perspectives for understanding some ADMM-type algorithms but also supplies meaningful guidelines on implementing them to achieve better computational efficiency. Even for the two-block case, a by-product of this equivalence is the convergence of the whole sequence generated by the classic ADMM with a step-length that exceeds the conventional upper bound of $$(1+\sqrt{5})/2$$ , if one part of the objective is linear. This is exactly the problem setting in which the very first convergence analysis of ADMM was conducted by Gabay and Mercier (Comput Math Appl 2(1):17–40, 1976), but, even under notably stronger assumptions, only the convergence of the primal sequence was known. A collection of illustrative examples are provided to demonstrate the breadth of applications for which our results can be used. Numerical experiments on solving a large number of linear and convex quadratic semidefinite programming problems are conducted to illustrate how the theoretical results established here can lead to improvements on the corresponding practical implementations.

...read moreread less

Journal Article•DOI•

Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization

[...]

Haihao Lu¹, Robert M. Freund²•Institutions (2)

University of Chicago¹, Massachusetts Institute of Technology²

01 May 2021-Mathematical Programming

TL;DR: This work presents a new generalized stochastic Frank–Wolfe method which closes the gap in the dependence on the optimality tolerance, and introduces the notion of a “substitute gradient” that is a not-necessarily-unbiased estimate of the gradient.

...read moreread less

Abstract: The stochastic Frank–Wolfe method has recently attracted much general interest in the context of optimization for statistical and machine learning due to its ability to work with a more general feasible region. However, there has been a complexity gap in the dependence on the optimality tolerance $$\varepsilon $$ in the guaranteed convergence rate for stochastic Frank–Wolfe compared to its deterministic counterpart. In this work, we present a new generalized stochastic Frank–Wolfe method which closes this gap for the class of structured optimization problems encountered in statistical and machine learning characterized by empirical loss minimization with a certain type of “linear prediction” property (formally defined in the paper), which is typically present in loss minimization problems in practice. Our method also introduces the notion of a “substitute gradient” that is a not-necessarily-unbiased estimate of the gradient. We show that our new method is equivalent to a particular randomized coordinate mirror descent algorithm applied to the dual problem, which in turn provides a new interpretation of randomized dual coordinate descent in the primal space. Also, in the special case of a strongly convex regularizer our generalized stochastic Frank–Wolfe method (as well as the randomized dual coordinate descent method) exhibits linear convergence. Furthermore, we present computational experiments that indicate that our method outperforms other stochastic Frank–Wolfe methods for a sufficiently small optimality tolerance, consistent with the theory developed herein.

...read moreread less

Journal Article•DOI•

New characterizations of Hoffman constants for systems of linear constraints

[...]

Javier Peña¹, Juan C. Vera², Luis F. Zuluaga³•Institutions (3)

Carnegie Mellon University¹, Tilburg University², Lehigh University³

01 May 2021-Mathematical Programming

TL;DR: The characterization of the Hoffman constant is stated in terms of the largest of a canonical collection of easily computable Hoffman constants, which suggests new algorithmic procedures to compute Hoffman constants.

...read moreread less

Abstract: We give a characterization of the Hoffman constant of a system of linear constraints in $${{\mathbb {R}}}^n$$ relative to a reference polyhedron $$R\subseteq {{\mathbb {R}}}^n$$ . The reference polyhedron R represents constraints that are easy to satisfy such as box constraints. In the special case $$R = {{\mathbb {R}}}^n$$ , we obtain a novel characterization of the classical Hoffman constant. More precisely, suppose $$R\subseteq \mathbb {R}^n$$ is a reference polyhedron, $$A\in {{\mathbb {R}}}^{m\times n},$$ and $$A(R):=\{Ax: x\in R\}$$ . We characterize the sharpest constant $$H(A\vert R)$$ such that for all $$b \in A(R) + {{\mathbb {R}}}^m_+$$ and $$u\in R$$ $$\begin{aligned} {\mathrm {dist}}(u, P_{A}(b)\cap R) \le H(A\vert R) \cdot \Vert (Au-b)_+\Vert , \end{aligned}$$ where $$P_A(b) = \{x\in {{\mathbb {R}}}^n:Ax\le b\}$$ . Our characterization is stated in terms of the largest of a canonical collection of easily computable Hoffman constants. Our characterization in turn suggests new algorithmic procedures to compute Hoffman constants.

...read moreread less

Journal Article•DOI•

Mixed-integer optimal control problems with switching costs: a shortest path approach

[...]

Felix Bestehorn¹, Christoph Hansknecht¹, Christian Kirches¹, Paul Manns¹•Institutions (1)

Braunschweig University of Technology¹

01 Aug 2021-Mathematical Programming

TL;DR: This work reformulates the rounding problem into a shortest path problem on a parameterized family of directed acyclic graphs (DAGs) and provides a proof of a runtime bound on equidistant rounding grids.

...read moreread less

Abstract: We investigate an extension of Mixed-Integer Optimal Control Problems by adding switching costs, which enables the penalization of chattering and extends current modeling capabilities. The decomposition approach, consisting of solving a partial outer convexification to obtain a relaxed solution and using rounding schemes to obtain a discrete-valued control can still be applied, but the rounding turns out to be difficult in the presence of switching costs or switching constraints as the underlying problem is an Integer Program. We therefore reformulate the rounding problem into a shortest path problem on a parameterized family of directed acyclic graphs (DAGs). Solving the shortest path problem then allows to minimize switching costs and still maintain approximability with respect to the tunable DAG parameter $$\theta $$ . We provide a proof of a runtime bound on equidistant rounding grids, where the bound is linear in time discretization granularity and polynomial in $$\theta $$ . The efficacy of our approach is demonstrated by a comparison with an integer programming approach on a benchmark problem.

...read moreread less

Journal Article•DOI•

On the tightness of SDP relaxations of QCQPs

[...]

Alex Wang¹, Fatma Kılınç-Karzan¹•Institutions (1)

Carnegie Mellon University¹

07 Jan 2021-Mathematical Programming

TL;DR: In this paper, the authors study conditions under which the standard semidefinite program (SDP) relaxation of a quadratic program (QCQP) is tight.

...read moreread less

Abstract: Quadratically constrained quadratic programs (QCQPs) are a fundamental class of optimization problems well-known to be NP-hard in general. In this paper we study conditions under which the standard semidefinite program (SDP) relaxation of a QCQP is tight. We begin by outlining a general framework for proving such sufficient conditions. Then using this framework, we show that the SDP relaxation is tight whenever the quadratic eigenvalue multiplicity, a parameter capturing the amount of symmetry present in a given problem, is large enough. We present similar sufficient conditions under which the projected epigraph of the SDP gives the convex hull of the epigraph in the original QCQP. Our results also imply new sufficient conditions for the tightness (as well as convex hull exactness) of a second order cone program relaxation of simultaneously diagonalizable QCQPs.

...read moreread less

Journal Article•DOI•

Better and simpler error analysis of the Sinkhorn–Knopp algorithm for matrix scaling

[...]

Deeparnab Chakrabarty¹, Sanjeev Khanna²•Institutions (2)

Dartmouth College¹, University of Pennsylvania²

01 Jul 2021-Mathematical Programming

TL;DR: An elementary convergence analysis for the Sinkhorn–Knopp algorithm is presented that improves upon the previous best bound and establishes a new inequality, referred to as (KL vs -error), which is a strengthening of Pinsker’s inequality and may be of independent interest.

...read moreread less

Abstract: Given a non-negative $$n \times m$$ real matrix A, the matrix scaling problem is to determine if it is possible to scale the rows and columns so that each row and each column sums to a specified positive target values. The Sinkhorn–Knopp algorithm is a simple and classic procedure which alternately scales all rows and all columns to meet these targets. The focus of this paper is the worst-case theoretical analysis of this algorithm. We present an elementary convergence analysis for this algorithm that improves upon the previous best bound. In a nutshell, our approach is to show (i) a simple bound on the number of iterations needed so that the KL-divergence between the current row-sums and the target row-sums drops below a specified threshold $$\delta $$ , and (ii) then show that for a suitable choice of $$\delta $$ , whenever KL-divergence is below $$\delta $$ , then the $$\ell _1$$ -error or the $$\ell _2$$ -error is below $$\varepsilon $$ . The well-known Pinsker’s inequality immediately allows us to translate a bound on the KL divergence to a bound on $$\ell _1$$ -error. To bound the $$\ell _2$$ -error in terms of the KL-divergence, we establish a new inequality, referred to as (KL vs $$\ell _1/\ell _2$$ ). This inequality is a strengthening of Pinsker’s inequality and may be of independent interest.

...read moreread less

Journal Article•DOI•

Accelerated proximal point method for maximally monotone operators

[...]

Donghwan Kim¹•Institutions (1)

KAIST¹

23 Mar 2021-Mathematical Programming

TL;DR: In this article, an accelerated proximal point method for maximally monotone operators is proposed via the performance estimation problem approach, which includes various well-known convex optimization methods, such as the proximal method of multipliers, and thus the proposed acceleration has wide applications.

...read moreread less

Abstract: This paper proposes an accelerated proximal point method for maximally monotone operators. The proof is computer-assisted via the performance estimation problem approach. The proximal point method includes various well-known convex optimization methods, such as the proximal method of multipliers and the alternating direction method of multipliers, and thus the proposed acceleration has wide applications. Numerical experiments are presented to demonstrate the accelerating behaviors.

...read moreread less

Journal Article•DOI•

Sparse semidefinite programs with guaranteed near-linear time complexity via dualized clique tree conversion

[...]

Richard Zhang¹, Richard Zhang², Javad Lavaei¹•Institutions (2)

University of California, Berkeley¹, University of Illinois at Urbana–Champaign²

01 Jul 2021-Mathematical Programming

TL;DR: In this article, the authors show that the coupling due to overlap constraints is guaranteed to be sparse over dense blocks, with a block sparsity pattern that coincides with the adjacency matrix of a tree.

...read moreread less

Abstract: Clique tree conversion solves large-scale semidefinite programs by splitting an $$n\times n$$ matrix variable into up to n smaller matrix variables, each representing a principal submatrix of up to $$\omega \times \omega $$ . Its fundamental weakness is the need to introduce overlap constraints that enforce agreement between different matrix variables, because these can result in dense coupling. In this paper, we show that by dualizing the clique tree conversion, the coupling due to the overlap constraints is guaranteed to be sparse over dense blocks, with a block sparsity pattern that coincides with the adjacency matrix of a tree. We consider two classes of semidefinite programs with favorable sparsity patterns that encompass the MAXCUT and MAX k-CUT relaxations, the Lovasz Theta problem, and the AC optimal power flow relaxation. Assuming that $$\omega \ll n$$ , we prove that the per-iteration cost of an interior-point method is linear O(n) time and memory, so an $$\epsilon $$ -accurate and $$\epsilon $$ -feasible iterate is obtained after $$O(\sqrt{n}\log (1/\epsilon ))$$ iterations in near-linear $$O(n^{1.5}\log (1/\epsilon ))$$ time. We confirm our theoretical insights with numerical results on semidefinite programs as large as $$n=13{,}659$$ .

...read moreread less

Collapse