scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Nonsmooth optimization via quasi-Newton methods

01 Oct 2013-Mathematical Programming (Springer Berlin Heidelberg)-Vol. 141, Iss: 1, pp 135-163
TL;DR: It is found that when f is locally Lipschitz and semi-algebraic with bounded sublevel sets, the BFGS method with the inexact line search almost always generates sequences whose cluster points are Clarke stationary and with function values converging R-linearly to a Clarke stationary value.
Abstract: We investigate the behavior of quasi-Newton algorithms applied to minimize a nonsmooth function f , not necessarily convex. We introduce an inex- act line search that generates a sequence of nested intervals containing a set of points of nonzero measure that satisfy the Armijo and Wolfe conditions if f is absolutely continuous along the line. Furthermore, the line search is guaranteed to terminate if f is semi-algebraic. It seems quite difficult to establish a convergence theorem for quasi-Newton methods applied to such general classes of functions, so we give a care- ful analysis of a special but illuminating case, the Euclidean norm, in one variable using the inexact line search and in two variables assuming that the line search is exact. In practice, we find that when f is locally Lipschitz and semi-algebraic with bounded sublevel sets, the BFGS (Broyden-Fletcher-Goldfarb-Shanno) method with the inexact line search almost always generates sequences whose cluster points are Clarke stationary and with function values converging R-linearly to a Clarke station- ary value. We give references documenting the successful use of BFGS in a variety of nonsmooth applications, particularly the design of low-order controllers for linear dynamical systems. We conclude with a challenging open question.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
07 Dec 2015
TL;DR: This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem.
Abstract: This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem. In the BLBF setting, the learner does not receive full-information feedback like in supervised learning, but observes feedback only for the actions taken by a historical policy. This makes BLBF algorithms particularly attractive for training online systems (e.g., ad placement, web search, recommendation) using their historical logs. The Counterfactual Risk Minimization (CRM) principle [1] offers a general recipe for designing BLBF algorithms. It requires a counterfactual risk estimator, and virtually all existing works on BLBF have focused on a particular unbiased estimator. We show that this conventional estimator suffers from a propensity overfitting problem when used for learning over complex hypothesis spaces. We propose to replace the risk estimator with a self-normalized estimator, showing that it neatly avoids this problem. This naturally gives rise to a new learning algorithm - Normalized Policy Optimizer for Exponential Models (Norm-POEM) - for structured output prediction using linear rules. We evaluate the empirical effectiveness of Norm-POEM on several multi-label classification problems, finding that it consistently outperforms the conventional estimator.

306 citations


Cites background from "Nonsmooth optimization via quasi-Ne..."

  • ...The choice of L-BFGS for non-convex and non-smooth optimization is well supported [25, 26]....

    [...]

Journal Article
TL;DR: The empirical results show that the CRM objective implemented in POEM provides improved robustness and generalization performance compared to the state-of-the-art, and a decomposition of the POEM objective that enables efficient stochastic gradient optimization is presented.
Abstract: We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem (Bottou et al., 2013) through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. In analogy to the Structural Risk Minimization principle of Wapnik and Tscherwonenkis (1979), these constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method--called Policy Optimizer for Exponential Models (POEM)--for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. The effectiveness and efficiency of POEM is evaluated on several simulated multi-label classification problems, as well as on a real-world information retrieval problem. The empirical results show that the CRM objective implemented in POEM provides improved robustness and generalization performance compared to the state-of-the-art.

297 citations


Cites methods from "Nonsmooth optimization via quasi-Ne..."

  • ...In particular, prior work (Yu et al., 2010; Lewis and Overton, 2013) has established theoretically sound modifications to L-BFGS for non-smooth non-convex optimization....

    [...]

Proceedings Article
06 Jul 2015
TL;DR: In this paper, the counterfactual risk minimization (CRM) principle is used to derive a new learning method called Policy Optimizer for Exponential Models (POEM) for learning stochastic linear rules for structured output prediction.
Abstract: We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM) - for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

275 citations

Journal ArticleDOI
27 Jul 2015
TL;DR: This work introduces the first structure-transcending style similarity measure and validate it to be well aligned with human perception of stylistic similarity, and employs crowdsourcing to quantify the different components of this measure.
Abstract: The human perception of stylistic similarity transcends structure and function: for instance, a bed and a dresser may share a common style. An algorithmically computed style similarity measure that mimics human perception can benefit a range of computer graphics applications. Previous work in style analysis focused on shapes within the same class, and leveraged structural similarity between these shapes to facilitate analysis. In contrast, we introduce the first structure-transcending style similarity measure and validate it to be well aligned with human perception of stylistic similarity. Our measure is inspired by observations about style similarity in art history literature, which point to the presence of similarly shaped, salient, geometric elements as one of the key indicators of stylistic similarity. We translate these observations into an algorithmic measure by first quantifying the geometric properties that make humans perceive geometric elements as similarly shaped and salient in the context of style, then employing this quantification to detect pairs of matching style related elements on the analyzed models, and finally collating the element-level geometric similarity measurements into an object-level style measure consistent with human perception. To achieve this consistency we employ crowdsourcing to quantify the different components of our measure; we learn the relative perceptual importance of a range of elementary shape distances and other parameters used in our measurement from 50K responses to cross-structure style similarity queries provided by over 2500 participants.We train and validate our method on this dataset, showing it to successfully predict relative style similarity with near 90% accuracy based on 10-fold cross-validation.

226 citations


Additional excerpts

  • ...Even if convergence is not guaranteed with this implementation, in practice due to the use of inexact line search [Lewis and Overton 2013] and our bound constraints, the parameter updates do not run into non-differentiable values....

    [...]

Journal ArticleDOI
TL;DR: This paper provides a comprehensive review on existing design approaches including iterative linear matrix inequalities heuristics,linear matrix inequalities with rank constraints, methods with decoupled Lyapunov matrices, and non-Lyapunovo-based approaches.

165 citations

References
More filters
Book
01 Jan 1983
TL;DR: The Calculus of Variations as discussed by the authors is a generalization of the calculus of variations, which is used in many aspects of analysis, such as generalized gradient descent and optimal control.
Abstract: 1. Introduction and Preview 2. Generalized Gradients 3. Differential Inclusions 4. The Calculus of Variations 5. Optimal Control 6. Mathematical Programming 7. Topics in Analysis.

9,498 citations


"Nonsmooth optimization via quasi-Ne..." refers methods in this paper

  • ...We use ∂ f (x) to denote the Clarke subdifferential [9,45] of f at x , which for locally Lipschitz f is simply the convex hull of the limits of gradients of f evaluated at sequences converging to x [6, Theorem 6....

    [...]

Book
01 Jan 2000
TL;DR: In this paper, the Karush-Kuhn-Tucker Theorem and Fenchel duality were used for infinite versus finite dimensions, with a list of results and notation.
Abstract: Background * Inequality constraints * Fenchel duality * Convex analysis * Special cases * Nonsmooth optimization * The Karush-Kuhn-Tucker Theorem * Fixed points * Postscript: infinite versus finite dimensions * List of results and notation.

1,063 citations

Book
01 Jan 2006
TL;DR: This book will help readers to understand the mathematical foundations of the modern theory and methods of nonlinear optimization and to analyze new problems, develop optimality theory for them, and choose or construct numerical solution methods.
Abstract: Optimization is one of the most important areas of modern applied mathematics, with applications in fields from engineering and economics to finance, statistics, management science, and medicine. While many books have addressed its various aspects, Nonlinear Optimization is the first comprehensive treatment that will allow graduate students and researchers to understand its modern ideas, principles, and methods within a reasonable time, but without sacrificing mathematical precision. Andrzej Ruszczynski, a leading expert in the optimization of nonlinear stochastic systems, integrates the theory and the methods of nonlinear optimization in a unified, clear, and mathematically rigorous fashion, with detailed and easy-to-follow proofs illustrated by numerous examples and figures. The book covers convex analysis, the theory of optimality conditions, duality theory, and numerical methods for solving unconstrained and constrained optimization problems. It addresses not only classical material but also modern topics such as optimality conditions and numerical methods for problems involving nondifferentiable functions, semidefinite programming, metric regularity and stability theory of set-constrained systems, and sensitivity analysis of optimization problems. Based on a decade's worth of notes the author compiled in successfully teaching the subject, this book will help readers to understand the mathematical foundations of the modern theory and methods of nonlinear optimization and to analyze new problems, develop optimality theory for them, and choose or construct numerical solution methods. It is a must for anyone seriously interested in optimization.

913 citations

Book
01 Jun 1985
TL;DR: Methods with subgradient locality measures for minimizing nonconvex functions and methods of feasible directions for non Convex constrained problems for convex constrained minimization problems are described.
Abstract: Fundamentals.- Aggregate subgradient methods for unconstrained convex minimization.- Methods with subgradient locality measures for minimizing nonconvex functions.- Methods with subgradient deletion rules for unconstrained nonconvex minimization.- Feasible point methods for convex constrained minimization problems.- Methods of feasible directions for nonconvex constrained problems.- Bundle methods.- Numerical examples.

503 citations


"Nonsmooth optimization via quasi-Ne..." refers methods in this paper

  • ...Motivated by the low overhead of quasi-Newton methods, Lukšan and Vlček proposed new methods intended to combine the global convergence properties of bundle methods [19,22] with the efficiency of quasi-Newton methods; Haarala [18] gives a good overview....

    [...]

  • ...The traditional approach to designing algorithms for nonsmooth optimization is to stabilize steepest descent by exploiting gradient or subgradient information evaluated at multiple points: this is the essential idea of bundle methods [19,22] and also of the gradient sampling algorithm [7,23]....

    [...]

Journal ArticleDOI
TL;DR: A practical, robust algorithm to locally minimize such functions as f, a continuous function on $\Rl^n$, which is continuously differentiable on an open dense subset, based on gradient sampling is presented.
Abstract: Let f be a continuous function on $\Rl^n$, and suppose f is continuously differentiable on an open dense subset. Such functions arise in many applications, and very often minimizers are points at which f is not differentiable. Of particular interest is the case where f is not convex, and perhaps not even locally Lipschitz, but is a function whose gradient is easily computed where it is defined. We present a practical, robust algorithm to locally minimize such functions, based on gradient sampling. No subgradient information is required by the algorithm. When f is locally Lipschitz and has bounded level sets, and the sampling radius $\eps$ is fixed, we show that, with probability 1, the algorithm generates a sequence with a cluster point that is Clarke $\eps$-stationary. Furthermore, we show that if f has a unique Clarke stationary point $\bar x$, then the set of all cluster points generated by the algorithm converges to $\bar x$ as $\eps$ is reduced to zero. Numerical results are presented demonstrating the robustness of the algorithm and its applicability in a wide variety of contexts, including cases where f is not locally Lipschitz at minimizers. We report approximate local minimizers for functions in the applications literature which have not, to our knowledge, been obtained previously. When the termination criteria of the algorithm are satisfied, a precise statement about nearness to Clarke $\eps$-stationarity is available. A MATLAB implementation of the algorithm is posted at http://www.cs.nyu.edu/overton/papers/gradsamp/alg.

477 citations


"Nonsmooth optimization via quasi-Ne..." refers methods in this paper

  • ...This problem was one of the examples in [7]; in the results reported there, the objective function was defined without the logarithm and we enforced the semidefinite constraint by an exact penalty function....

    [...]

  • ...The gradient sampling method is far more computationally intensive than BFGS, but it does enjoy convergence guarantees with probability one [7,23]....

    [...]

  • ...If the algorithm breaks down (in practice) without satisfying the desired termination condition, the user has the option to continue the optimization using the gradient sampling method of [7]....

    [...]

  • ...The traditional approach to designing algorithms for nonsmooth optimization is to stabilize steepest descent by exploiting gradient or subgradient information evaluated at multiple points: this is the essential idea of bundle methods [19,22] and also of the gradient sampling algorithm [7,23]....

    [...]