scispace - formally typeset
Search or ask a question

Showing papers in "Mathematical Programming in 2019"


Journal ArticleDOI
TL;DR: An extension to SDDP—called stochastic dual dynamic integer programming (SDDiP)—for solving MSIP problems with binary state variables is proposed and it is shown that, under fairly reasonable assumptions, an MSIP problem with general state variables can be approximated by one withbinary state variables to desired precision with only a modest increase in problem size.
Abstract: Multistage stochastic integer programming (MSIP) combines the difficulty of uncertainty, dynamics, and non-convexity, and constitutes a class of extremely challenging problems. A common formulation for these problems is a dynamic programming formulation involving nested cost-to-go functions. In the linear setting, the cost-to-go functions are convex polyhedral, and decomposition algorithms, such as nested Benders’ decomposition and its stochastic variant, stochastic dual dynamic programming (SDDP), which proceed by iteratively approximating these functions by cuts or linear inequalities, have been established as effective approaches. However, it is difficult to directly adapt these algorithms to MSIP due to the nonconvexity of integer programming value functions. In this paper we propose an extension to SDDP—called stochastic dual dynamic integer programming (SDDiP)—for solving MSIP problems with binary state variables. The crucial component of the algorithm is a new reformulation of the subproblems in each stage and a new class of cuts, termed Lagrangian cuts, derived from a Lagrangian relaxation of a specific reformulation of the subproblems in each stage, where local copies of state variables are introduced. We show that the Lagrangian cuts satisfy a tightness condition and provide a rigorous proof of the finite convergence of SDDiP with probability one. We show that, under fairly reasonable assumptions, an MSIP problem with general state variables can be approximated by one with binary state variables to desired precision with only a modest increase in problem size. Thus our proposed SDDiP approach is applicable to very general classes of MSIP problems. Extensive computational experiments on three classes of real-world problems, namely electric generation expansion, financial portfolio management, and network revenue management, show that the proposed methodology is very effective in solving large-scale multistage stochastic integer optimization problems.

196 citations


Journal ArticleDOI
TL;DR: In this article, the authors derive linear convergence rates of several first order methods for solving smooth non-strongly convex constrained optimization problems, i.e. involving an objective function with a Lipschitz continuous gradient that satisfies some relaxed strong convexity condition.
Abstract: The standard assumption for proving linear convergence of first order methods for smooth convex optimization is the strong convexity of the objective function, an assumption which does not hold for many practical applications. In this paper, we derive linear convergence rates of several first order methods for solving smooth non-strongly convex constrained optimization problems, i.e. involving an objective function with a Lipschitz continuous gradient that satisfies some relaxed strong convexity condition. In particular, in the case of smooth constrained convex optimization, we provide several relaxations of the strong convexity conditions and prove that they are sufficient for getting linear convergence for several first order methods such as projected gradient, fast gradient and feasible descent methods. We also provide examples of functional classes that satisfy our proposed relaxations of strong convexity conditions. Finally, we show that the proposed relaxed strong convexity conditions cover important applications ranging from solving linear systems, Linear Programming, and dual formulations of linearly constrained convex problems.

171 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated the efficacy of gradient descent for the nonconvex least squares problem and showed that under Gaussian designs, gradient descent yields an accurate solution in O(log n+log log n+ log (1/πσon )\big ) iterations given nearly minimal samples, thus achieving nearoptimal computational and sample complexities at once.
Abstract: This paper considers the problem of solving systems of quadratic equations, namely, recovering an object of interest $$\varvec{x}^{ atural }\in {\mathbb {R}}^{n}$$ from m quadratic equations/samples $$y_{i}=(\varvec{a}_{i}^{\top }\varvec{x}^{ atural })^{2}, 1\le i\le m$$ . This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning. We investigate the efficacy of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent—when randomly initialized—yields an $$\epsilon $$ -accurate solution in $$O\big (\log n+\log (1/\epsilon )\big )$$ iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.

161 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider global efficiency of algorithms for minimizing a sum of convex functions and a composition of a Lipschitz convex function with a smooth map, and show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity with complexity
Abstract: We consider global efficiency of algorithms for minimizing a sum of a convex function and a composition of a Lipschitz convex function with a smooth map. The basic algorithm we rely on is the prox-linear method, which in each iteration solves a regularized subproblem formed by linearizing the smooth map. When the subproblems are solved exactly, the method has efficiency $$\mathcal {O}(\varepsilon ^{-2})$$ , akin to gradient descent for smooth minimization. We show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity $$\widetilde{\mathcal {O}}(\varepsilon ^{-3})$$ . We round off the paper with an inertial prox-linear method that automatically accelerates in presence of convexity.

154 citations


Journal ArticleDOI
TL;DR: It is established that first-order methods avoid strict saddle points for almost all initializations, and neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid strict Saddle points.
Abstract: We establish that first-order methods avoid strict saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including (manifold) gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Theorem allow for a global stability analysis. Thus, neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid strict saddle points.

150 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider variants of trust-region and adaptive cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated, and provide iteration complexity to achieve $$\varepsilon $$ -approximate second-order optimality which have been shown to be tight.
Abstract: We consider variants of trust-region and adaptive cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under certain condition on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve $$\varepsilon $$ -approximate second-order optimality which have been shown to be tight. Our Hessian approximation condition offers a range of advantages as compared with the prior works and allows for direct construction of the approximate Hessian with a priori guarantees through various techniques, including randomized sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods.

147 citations


Journal ArticleDOI
TL;DR: This paper focuses on learning via “dual averaging”, a widely used class of no-regret learning schemes where players take small steps along their individual payoff gradients and then “mirror” the output back to their action sets, and introduces the notion of variational stability.
Abstract: This paper examines the convergence of no-regret learning in games with continuous action sets. For concreteness, we focus on learning via “dual averaging”, a widely used class of no-regret learning schemes where players take small steps along their individual payoff gradients and then “mirror” the output back to their action sets. In terms of feedback, we assume that players can only estimate their payoff gradients up to a zero-mean error with bounded variance. To study the convergence of the induced sequence of play, we introduce the notion of variational stability, and we show that stable equilibria are locally attracting with high probability whereas globally stable equilibria are globally attracting with probability 1. We also discuss some applications to mixed-strategy learning in finite games, and we provide explicit estimates of the method’s convergence speed.

144 citations


Journal ArticleDOI
TL;DR: This paper considers nonconvex distributed constrained optimization over networks, modeled as directed (possibly time-varying) graphs, and introduces the first algorithmic framework for the minimization of the sum of a smooth non Convex (nonseparable) function—the agent’s sum-utility—plus a difference-of-conveX function (with nonsmooth convex part).
Abstract: This paper considers nonconvex distributed constrained optimization over networks, modeled as directed (possibly time-varying) graphs. We introduce the first algorithmic framework for the minimization of the sum of a smooth nonconvex (nonseparable) function—the agent’s sum-utility—plus a difference-of-convex function (with nonsmooth convex part). This general formulation arises in many applications, from statistical machine learning to engineering. The proposed distributed method combines successive convex approximation techniques with a judiciously designed perturbed push-sum consensus mechanism that aims to track locally the gradient of the (smooth part of the) sum-utility. Sublinear convergence rate is proved when a fixed step-size (possibly different among the agents) is employed whereas asymptotic convergence to stationary solutions is proved using a diminishing step-size. Numerical results show that our algorithms compare favorably with current schemes on both convex and nonconvex problems.

135 citations


Journal ArticleDOI
TL;DR: For large-scale finite-sum minimization problems, non-asymptotic and high-probability global as well as local convergence properties of variants of Newton’s method where the Hessian and/or gradients are randomly sub-sampled are studied.
Abstract: For large-scale finite-sum minimization problems, we study non-asymptotic and high-probability global as well as local convergence properties of variants of Newton’s method where the Hessian and/or gradients are randomly sub-sampled. For Hessian sub-sampling, using random matrix concentration inequalities, one can sub-sample in a way that second-order information, i.e., curvature, is suitably preserved. For gradient sub-sampling, approximate matrix multiplication results from randomized numerical linear algebra provide a way to construct the sub-sampled gradient which contains as much of the first-order information as possible. While sample sizes all depend on problem specific constants, e.g., condition number, we demonstrate that local convergence rates are problem-independent.

129 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that most of the results relating submodularity and convexity for set-functions can be extended to all submodular functions and provide a new interpretation of existing results for set functions.
Abstract: Submodular set-functions have many applications in combinatorial optimization, as they can be minimized and approximately maximized in polynomial time. A key element in many of the algorithms and analyses is the possibility of extending the submodular set-function to a convex function, which opens up tools from convex optimization. Submodularity goes beyond set-functions and has naturally been considered for problems with multiple labels or for functions defined on continuous domains, where it corresponds essentially to cross second-derivatives being nonpositive. In this paper, we show that most results relating submodularity and convexity for set-functions can be extended to all submodular functions. In particular, (a) we naturally define a continuous extension in a set of probability measures, (b) show that the extension is convex if and only if the original function is submodular, (c) prove that the problem of minimizing a submodular function is equivalent to a typically non-smooth convex optimization problem, and (d) propose another convex optimization problem with better computational properties (e.g., a smooth dual problem). Most of these extensions from the set-function situation are obtained by drawing links with the theory of multi-marginal optimal transport, which provides also a new interpretation of existing results for set-functions. We then provide practical algorithms to minimize generic submodular functions on discrete domains, with associated convergence rates.

114 citations


Journal ArticleDOI
TL;DR: The lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newton’s method, and generalized pth order regularization are worst-case optimal within their natural function classes.
Abstract: We prove lower bounds on the complexity of finding $$\epsilon $$ -stationary points (points x such that $$\Vert abla f(x)\Vert \le \epsilon $$ ) of smooth, high-dimensional, and potentially non-convex functions f. We consider oracle-based complexity measures, where an algorithm is given access to the value and all derivatives of f at a query point x. We show that for any (potentially randomized) algorithm $$\mathsf {A}$$ , there exists a function f with Lipschitz pth order derivatives such that $$\mathsf {A}$$ requires at least $$\epsilon ^{-(p+1)/p}$$ queries to find an $$\epsilon $$ -stationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newton’s method, and generalized pth order regularization are worst-case optimal within their natural function classes.

Journal ArticleDOI
TL;DR: In this article, a fully adaptive algorithm for monotone variational inequalities is presented, which uses two previous iterates for an approximation of the local Lipschitz constant without running a linesearch.
Abstract: The paper presents a fully adaptive algorithm for monotone variational inequalities. In each iteration the method uses two previous iterates for an approximation of the local Lipschitz constant without running a linesearch. Thus, every iteration of the method requires only one evaluation of a monotone operator F and a proximal mapping g. The operator F need not be Lipschitz continuous, which also makes the algorithm interesting in the area of composite minimization. The method exhibits an ergodic O(1 / k) convergence rate and R-linear rate under an error bound condition. We discuss possible applications of the method to fixed point problems as well as its different generalizations.

Journal ArticleDOI
TL;DR: It is found that the ambiguous risk constraints in this setting can be recast as a set of second-order cone (SOC) constraints in order to facilitate the algorithmic implementation and derive efficient ways of finding violated SOC constraints.
Abstract: Optimization problems face random constraint violations when uncertainty arises in constraint parameters. Effective ways of controlling such violations include risk constraints, e.g., chance constraints and conditional Value-at-Risk constraints. This paper studies these two types of risk constraints when the probability distribution of the uncertain parameters is ambiguous. In particular, we assume that the distributional information consists of the first two moments of the uncertainty and a generalized notion of unimodality. We find that the ambiguous risk constraints in this setting can be recast as a set of second-order cone (SOC) constraints. In order to facilitate the algorithmic implementation, we also derive efficient ways of finding violated SOC constraints. Finally, we demonstrate the theoretical results via computational case studies on power system operations.

Journal ArticleDOI
TL;DR: One distinctive feature of the proposed perturbed proximal primal–dual algorithm is that the primal and dual steps are both perturbed appropriately using past iterates so that a number of asymptotic convergence and rate of convergence results can be obtained.
Abstract: In this paper, we propose a perturbed proximal primal–dual algorithm (PProx-PDA) for an important class of linearly constrained optimization problems, whose objective is the sum of smooth (possibly nonconvex) and convex (possibly nonsmooth) functions. This family of problems can be used to model many statistical and engineering applications, such as high-dimensional subspace estimation and the distributed machine learning. The proposed method is of the Uzawa type, in which a primal gradient descent step is performed followed by an (approximate) dual gradient ascent step. One distinctive feature of the proposed algorithm is that the primal and dual steps are both perturbed appropriately using past iterates so that a number of asymptotic convergence and rate of convergence results (to first-order stationary solutions) can be obtained. Finally, we conduct extensive numerical experiments to validate the effectiveness of the proposed algorithm.

Journal ArticleDOI
TL;DR: The growth behavior of the objective function around the critical points of the QP-OC problem is characterized and it is demonstrated how such characterization can be used to obtain strong convergence rate results for iterative methods that exploit the manifold structure of the orthogonality constraint to find a critical point of the problem.
Abstract: The problem of optimizing a quadratic form over an orthogonality constraint (QP-OC for short) is one of the most fundamental matrix optimization problems and arises in many applications. In this paper, we characterize the growth behavior of the objective function around the critical points of the QP-OC problem and demonstrate how such characterization can be used to obtain strong convergence rate results for iterative methods that exploit the manifold structure of the orthogonality constraint (i.e., the Stiefel manifold) to find a critical point of the problem. Specifically, our primary contribution is to show that the Łojasiewicz exponent at any critical point of the QP-OC problem is 1 / 2. Such a result is significant, as it expands the currently very limited repertoire of optimization problems for which the Łojasiewicz exponent is explicitly known. Moreover, it allows us to show, in a unified manner and for the first time, that a large family of retraction-based line-search methods will converge linearly to a critical point of the QP-OC problem. Then, as our secondary contribution, we propose a stochastic variance-reduced gradient (SVRG) method called Stiefel-SVRG for solving the QP-OC problem and present a novel Łojasiewicz inequality-based linear convergence analysis of the method. An important feature of Stiefel-SVRG is that it allows for general retractions and does not require the computation of any vector transport on the Stiefel manifold. As such, it is computationally more advantageous than other recently-proposed SVRG-type algorithms for manifold optimization.

Journal ArticleDOI
TL;DR: This study provides an algorithmic version of the convergence results obtained by Attouch–Cabot (J Differ Equ 264:7138–7182, 2018) in the case of continuous dynamical systems.
Abstract: In a Hilbert space $${\mathcal {H}}$$ , given $$A{:}\;{\mathcal {H}}\rightarrow 2^{\mathcal {H}}$$ a maximally monotone operator, we study the convergence properties of a general class of relaxed inertial proximal algorithms. This study aims to extend to the case of the general monotone inclusion $$Ax i 0$$ the acceleration techniques initially introduced by Nesterov in the case of convex minimization. The relaxed form of the proximal algorithms plays a central role. It comes naturally with the regularization of the operator A by its Yosida approximation with a variable parameter, a technique recently introduced by Attouch–Peypouquet (Math Program Ser B, 2018. https://doi.org/10.1007/s10107-018-1252-x ) for a particular class of inertial proximal algorithms. Our study provides an algorithmic version of the convergence results obtained by Attouch–Cabot (J Differ Equ 264:7138–7182, 2018) in the case of continuous dynamical systems.

Journal ArticleDOI
TL;DR: The progressive hedging algorithm is demonstrated here to be applicable also to solving multistage stochastic variational inequality problems under monotonicity, thus increasing the range of applications for progressive hedges.
Abstract: The concept of a stochastic variational inequality has recently been articulated in a new way that is able to cover, in particular, the optimality conditions for a multistage stochastic programming problem. One of the long-standing methods for solving such an optimization problem under convexity is the progressive hedging algorithm. That approach is demonstrated here to be applicable also to solving multistage stochastic variational inequality problems under monotonicity, thus increasing the range of applications for progressive hedging. Stochastic complementarity problems as a special case are explored numerically in a linear two-stage formulation.

Journal ArticleDOI
TL;DR: This paper provides a counterexample which shows that in general this claim that maximum a posteriori estimators are a limiting case of Bayes estimators with 0–1 loss is false and corrects that by providing a level-set condition for posterior densities such that the result holds.
Abstract: Maximum a posteriori and Bayes estimators are two common methods of point estimation in Bayesian statistics. It is commonly accepted that maximum a posteriori estimators are a limiting case of Bayes estimators with 0–1 loss. In this paper, we provide a counterexample which shows that in general this claim is false. We then correct the claim that by providing a level-set condition for posterior densities such that the result holds. Since both estimators are defined in terms of optimization problems, the tools of variational analysis find a natural application to Bayesian point estimation.

Journal ArticleDOI
TL;DR: It is shown that even linear rates are expected for Bregman projections with respect to smooth or piecewise linear-quadratic functions, and also the regularized nuclear norm, which is used in the area of low rank matrix problems.
Abstract: The randomized version of the Kaczmarz method for the solution of consistent linear systems is known to converge linearly in expectation. And even in the possibly inconsistent case, when only noisy data is given, the iterates are expected to reach an error threshold in the order of the noise-level with the same rate as in the noiseless case. In this work we show that the same also holds for the iterates of the recently proposed randomized sparse Kaczmarz method for recovery of sparse solutions. Furthermore we consider the more general setting of convex feasibility problems and their solution by the method of randomized Bregman projections. This is motivated by the observation that, similarly to the Kaczmarz method, the Sparse Kaczmarz method can also be interpreted as an iterative Bregman projection method to solve a convex feasibility problem. We obtain expected sublinear rates for Bregman projections with respect to a general strongly convex function. Moreover, even linear rates are expected for Bregman projections with respect to smooth or piecewise linear-quadratic functions, and also the regularized nuclear norm, which is used in the area of low rank matrix problems.

Journal ArticleDOI
TL;DR: In this article, the authors prove tight bounds on the oracle complexity of second-order methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy.
Abstract: Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.

Journal ArticleDOI
TL;DR: In this paper, the trajectories of a second-order differential equation with vanishing damping were studied, governed by the Yosida regularization of a maximally monotone operator with time-varying index, along with a new Regularized Inertial Proximal Algorithm obtained by means of a convenient finite difference discretization.
Abstract: We study the behavior of the trajectories of a second-order differential equation with vanishing damping, governed by the Yosida regularization of a maximally monotone operator with time-varying index, along with a new Regularized Inertial Proximal Algorithm obtained by means of a convenient finite-difference discretization. These systems are the counterpart to accelerated forward–backward algorithms in the context of maximally monotone operators. A proper tuning of the parameters allows us to prove the weak convergence of the trajectories to zeroes of the operator. Moreover, it is possible to estimate the rate at which the speed and acceleration vanish. We also study the effect of perturbations or computational errors that leave the convergence properties unchanged. We also analyze a growth condition under which strong convergence can be guaranteed. A simple example shows the criticality of the assumptions on the Yosida approximation parameter, and allows us to illustrate the behavior of these systems compared with some of their close relatives.

Journal ArticleDOI
TL;DR: It is illustrated that not all, but only some scenarios might have “effect” on the optimal value, and this notion is formally defined for the general class of problems where the distributional ambiguity is modeled by the so-called total variation distance.
Abstract: Traditional stochastic programs assume that the probability distribution of uncertainty is known. However, in practice, the probability distribution oftentimes is not known or cannot be accurately approximated. One way to address such distributional ambiguity is to work with distributionally robust convex stochastic programs (DRSPs), which minimize the worst-case expected cost with respect to a set of probability distributions. In this paper we analyze the case where there is a finite number of possible scenarios and study the question of how to identify the critical scenarios resulting from solving a DRSP. We illustrate that not all, but only some scenarios might have “effect” on the optimal value, and we formally define this notion for our general class of problems. In particular, we examine problems where the distributional ambiguity is modeled by the so-called total variation distance. We propose easy-to-check conditions to identify effective and ineffective scenarios for that class of problems. Computational results show that identifying effective scenarios provides useful insight on the underlying uncertainties of the problem.

Journal ArticleDOI
TL;DR: It is shown how the SLCP and DRLCP models can be used to study equilibrium arising from two-stage duopoly game where each player plans to set up its optimal capacity at present with anticipated competition for production in future.
Abstract: In this paper, we propose a discretization scheme for the two-stage stochastic linear complementarity problem (LCP) where the underlying random data are continuously distributed. Under some moderate conditions, we derive qualitative and quantitative convergence for the solutions obtained from solving the discretized two-stage stochastic LCP (SLCP). We explain how the discretized two-stage SLCP may be solved by the well-known progressive hedging method (PHM). Moreover, we extend the discussion by considering a two-stage distributionally robust LCP (DRLCP) with moment constraints and proposing a discretization scheme for the DRLCP. As an application, we show how the SLCP and DRLCP models can be used to study equilibrium arising from two-stage duopoly game where each player plans to set up its optimal capacity at present with anticipated competition for production in future.

Journal ArticleDOI
TL;DR: This work shows that each cycle of the classical block symmetric Gauss–Seidel (sGS) method exactly solves the associated quadratic programming (QP) problem but added with an extra proximal term of the form $$\frac{1}{2}\Vert {{\varvec{x}}}-{{-x}}}^k\Vert _\mathcal{T}^2$$12‖x-xk‖T2, and extends the block sGS
Abstract: For a symmetric positive semidefinite linear system of equations $$\mathcal{Q}{{\varvec{x}}}= {{\varvec{b}}}$$ , where $${{\varvec{x}}}= (x_1,\ldots ,x_s)$$ is partitioned into s blocks, with $$s \ge 2$$ , we show that each cycle of the classical block symmetric Gauss–Seidel (sGS) method exactly solves the associated quadratic programming (QP) problem but added with an extra proximal term of the form $$\frac{1}{2}\Vert {{\varvec{x}}}-{{\varvec{x}}}^k\Vert _\mathcal{T}^2$$ , where $$\mathcal{T}$$ is a symmetric positive semidefinite matrix related to the sGS decomposition of $$\mathcal{Q}$$ and $${{\varvec{x}}}^k$$ is the previous iterate. By leveraging on such a connection to optimization, we are able to extend the result (which we name as the block sGS decomposition theorem) for solving convex composite QP (CCQP) with an additional possibly nonsmooth term in $$x_1$$ , i.e., $$\min \{ p(x_1) + \frac{1}{2}\langle {{\varvec{x}}},\,\mathcal{Q}{{\varvec{x}}}\rangle -\langle {{\varvec{b}}},\,{{\varvec{x}}}\rangle \}$$ , where $$p(\cdot )$$ is a proper closed convex function. Based on the block sGS decomposition theorem, we extend the classical block sGS method to solve CCQP. In addition, our extended block sGS method has the flexibility of allowing for inexact computation in each step of the block sGS cycle. At the same time, we can also accelerate the inexact block sGS method to achieve an iteration complexity of $$O(1/k^2)$$ after performing k cycles. As a fundamental building block, the block sGS decomposition theorem has played a key role in various recently developed algorithms such as the inexact semiproximal ALM/ADMM for linearly constrained multi-block convex composite conic programming (CCCP), and the accelerated block coordinate descent method for multi-block CCCP.

Journal ArticleDOI
TL;DR: A one-step approach to enhance a simple-averaging based distributed estimator by utilizing a single Newton–Raphson updating is proposed and the corresponding asymptotic properties of the newly proposed estimator are derived.
Abstract: Distributed statistical inference has recently attracted enormous attention. Many existing work focuses on the averaging estimator, e.g., Zhang and Duchi (J Mach Learn Res 14:3321–3363, 2013) together with many others. We propose a one-step approach to enhance a simple-averaging based distributed estimator by utilizing a single Newton–Raphson updating. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the idealized centralized estimator. In particular, the asymptotic normality was established for the proposed estimator, while other competitors may not enjoy the same property. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. The proposed one-step approach leads to a lower upper bound of the mean squared error than other alternatives. In finite sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the sample mean squared error. A potential application of the one-step approach is that one can use multiple machines to speed up large scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth.

Journal ArticleDOI
TL;DR: Two proximal DC algorithms with extrapolation are proposed that have much simpler subproblems and also incorporate the extrapolation for possible acceleration, and one of them is potentially applicable to the DC problem in which the second convex component is the supremum of infinitely many convex smooth functions.
Abstract: In this paper we consider a class of structured nonsmooth difference-of-convex (DC) minimization in which the first convex component is the sum of a smooth and nonsmooth functions while the second convex component is the supremum of possibly infinitely many convex smooth functions. We first propose an inexact enhanced DC algorithm for solving this problem in which the second convex component is the supremum of finitely many convex smooth functions, and show that every accumulation point of the generated sequence is an $$(\alpha ,\eta )$$ -D-stationary point of the problem, which is generally stronger than an ordinary D-stationary point. In addition, inspired by the recent work (Pang et al. in Math Oper Res 42(1):95–118, 2017; Wen et al. in Comput Optim Appl 69(2):297–324, 2018), we propose two proximal DC algorithms with extrapolation for solving this problem. We show that every accumulation point of the solution sequence generated by them is an $$(\alpha ,\eta )$$ -D-stationary point of the problem, and establish the convergence of the entire sequence under some suitable assumption. We also introduce a concept of approximate $$(\alpha ,\eta )$$ -D-stationary point and derive iteration complexity of the proposed algorithms for finding an approximate $$(\alpha ,\eta )$$ -D-stationary point. In contrast with the DC algorithm (Pang et al. 2017), our proximal DC algorithms have much simpler subproblems and also incorporate the extrapolation for possible acceleration. Moreover, one of our proximal DC algorithms is potentially applicable to the DC problem in which the second convex component is the supremum of infinitely many convex smooth functions. In addition, our algorithms have stronger convergence results than the proximal DC algorithm in Wen et al. (2018).

Journal ArticleDOI
TL;DR: It is shown that the joint probability distribution of risk and complexity is concentrated in such a way that the complexity carries fundamental information to tightly judge the risk.
Abstract: Scenario optimization is a broad methodology to perform optimization based on empirical knowledge. One collects previous cases, called “scenarios”, for the set-up in which optimization is being performed, and makes a decision that is optimal for the cases that have been collected. For convex optimization, a solid theory has been developed that provides guarantees of performance, and constraint satisfaction, of the scenario solution. In this paper, we open a new direction of investigation: the risk that a performance is not achieved, or that constraints are violated, is studied jointly with the complexity (as precisely defined in the paper) of the solution. It is shown that the joint probability distribution of risk and complexity is concentrated in such a way that the complexity carries fundamental information to tightly judge the risk. This result is obtained without requiring extra knowledge on the underlying optimization problem than that carried by the scenarios; in particular, no extra knowledge on the distribution by which scenarios are generated is assumed, so that the result is broadly applicable. This deep-seated result unveils a fundamental and general structure of data-driven optimization and suggests practical approaches for risk assessment.

Journal ArticleDOI
TL;DR: This work proves that when the problem possesses the so-called Luo–Tseng error bound (EB) property, IRPN converges globally to an optimal solution, and the local convergence rate of the sequence of iterates generated by IRPN is linear, superlinear, or even quadratic, depending on the choice of parameters of the algorithm.
Abstract: We propose a new family of inexact sequential quadratic approximation (SQA) methods, which we call the inexact regularized proximal Newton (IRPN) method, for minimizing the sum of two closed proper convex functions, one of which is smooth and the other is possibly non-smooth. Our proposed method features strong convergence guarantees even when applied to problems with degenerate solutions while allowing the inner minimization to be solved inexactly. Specifically, we prove that when the problem possesses the so-called Luo–Tseng error bound (EB) property, IRPN converges globally to an optimal solution, and the local convergence rate of the sequence of iterates generated by IRPN is linear, superlinear, or even quadratic, depending on the choice of parameters of the algorithm. Prior to this work, such EB property has been extensively used to establish the linear convergence of various first-order methods. However, to the best of our knowledge, this work is the first to use the Luo–Tseng EB property to establish the superlinear convergence of SQA-type methods for non-smooth convex minimization. As a consequence of our result, IRPN is capable of solving regularized regression or classification problems under the high-dimensional setting with provable convergence guarantees. We compare our proposed IRPN with several empirically efficient algorithms by applying them to the $$\ell _1$$ -regularized logistic regression problem. Experiment results show the competitiveness of our proposed method.

Journal ArticleDOI
TL;DR: In this paper, the authors study the smooth structure of convex functions and develop several Newton-type methods for solving a class of smooth convex optimization problems involving generalized self-concordant functions.
Abstract: We study the smooth structure of convex functions by generalizing a powerful concept so-called self-concordance introduced by Nesterov and Nemirovskii in the early 1990s to a broader class of convex functions which we call generalized self-concordant functions. This notion allows us to develop a unified framework for designing Newton-type methods to solve convex optimization problems. The proposed theory provides a mathematical tool to analyze both local and global convergence of Newton-type methods without imposing unverifiable assumptions as long as the underlying functionals fall into our class of generalized self-concordant functions. First, we introduce the class of generalized self-concordant functions which covers the class of standard self-concordant functions as a special case. Next, we establish several properties and key estimates of this function class which can be used to design numerical methods. Then, we apply this theory to develop several Newton-type methods for solving a class of smooth convex optimization problems involving generalized self-concordant functions. We provide an explicit step-size for a damped-step Newton-type scheme which can guarantee a global convergence without performing any globalization strategy. We also prove a local quadratic convergence of this method and its full-step variant without requiring the Lipschitz continuity of the objective Hessian mapping. Then, we extend our result to develop proximal Newton-type methods for a class of composite convex minimization problems involving generalized self-concordant functions. We also achieve both global and local convergence without additional assumptions. Finally, we verify our theoretical results via several numerical examples, and compare them with existing methods.

Journal ArticleDOI
TL;DR: It is shown that two Newton-like zero-finding procedures for nonsmooth convex functions, based on inexact evaluations and sensitivity information, are introduced and lead to efficient solution schemes for the original problem.
Abstract: Convex optimization problems arising in applications often have favorable objective functions and complicated constraints, thereby precluding first-order methods from being immediately applicable. We describe an approach that exchanges the roles of the objective with one of the constraint functions, and instead approximately solves a sequence of parametric level-set problems. Two Newton-like zero-finding procedures for nonsmooth convex functions, based on inexact evaluations and sensitivity information, are introduced. It is shown that they lead to efficient solution schemes for the original problem. We describe the theoretical and practical properties of this approach for a broad range of problems, including low-rank semidefinite optimization, sparse optimization, and gauge optimization.