scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

01 Jan 2009-Siam Journal on Imaging Sciences (Society for Industrial and Applied Mathematics)-Vol. 2, Iss: 1, pp 183-202
TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.
Abstract: We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an extension of the classical gradient algorithm, is attractive due to its simplicity and thus is adequate for solving large-scale problems even with dense matrix data. However, such methods are also known to converge quite slowly. In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Initial promising numerical results for wavelet-based image deblurring demonstrate the capabilities of FISTA which is shown to be faster than ISTA by several orders of magnitude.
Citations
More filters
Book
23 May 2011
TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Abstract: Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for l1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

17,433 citations


Cites background from "A Fast Iterative Shrinkage-Threshol..."

  • ...[173] compares and benchmarks a number of representative algorithms, including gradient projection [73, 102], homotopy methods [52], iterative shrinkage-thresholding [45], proximal gradient [132, 133, 11, 12], augmented Lagrangian methods [175], and interior-point methods [103]....

    [...]

Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

8,059 citations


Cites methods from "A Fast Iterative Shrinkage-Threshol..."

  • ...When this method is combined with the iterative soft thresholding technique (for R(θ) = λ||θ||1), plus a continuation method that gradually reduces λ, we get a fast method for the BPDN problem known as the fast iterative shrinkage thesholding algorithm or FISTA (Beck and Teboulle 2009)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the e1 norm.
Abstract: This article is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a low-rank component and a sparse component. Can we recover each component individuallyq We prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the e1 norm. This suggests the possibility of a principled approach to robust principal component analysis since our methodology and results assert that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries are missing as well. We discuss an algorithm for solving this optimization problem, and present applications in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered background, and in the area of face recognition, where it offers a principled way of removing shadows and specularities in images of faces.

6,783 citations

Journal ArticleDOI
TL;DR: A first-order primal-dual algorithm for non-smooth convex optimization problems with known saddle-point structure can achieve O(1/N2) convergence on problems, where the primal or the dual objective is uniformly convex, and it can show linear convergence, i.e. O(ωN) for some ω∈(0,1), on smooth problems.
Abstract: In this paper we study a first-order primal-dual algorithm for non-smooth convex optimization problems with known saddle-point structure. We prove convergence to a saddle-point with rate O(1/N) in finite dimensions for the complete class of problems. We further show accelerations of the proposed algorithm to yield improved rates on problems with some degree of smoothness. In particular we show that we can achieve O(1/N 2) convergence on problems, where the primal or the dual objective is uniformly convex, and we can show linear convergence, i.e. O(? N ) for some ??(0,1), on smooth problems. The wide applicability of the proposed algorithm is demonstrated on several imaging problems such as image denoising, image deconvolution, image inpainting, motion estimation and multi-label image segmentation.

4,487 citations


Cites background or methods from "A Fast Iterative Shrinkage-Threshol..."

  • ...Is shown in [2, 25, 27, 28] that if G or F ∗ is uniformly convex (such that G∗, or respectively F , has a Lipschitz continuous gradient), O(1/N2) convergence can be guaranteed....

    [...]

  • ...Remark 4 In [2, 25, 27], the O(1/N2) estimate is theoretically better than ours since it is on the dual energy G∗(−K∗yN)+F ∗(yN)−(G∗(−K∗ŷ)+F ∗(ŷ)) (which can easily be shown to bound ‖xN − x̂‖2, see for instance [13])....

    [...]

  • ...• FISTA: O(1/N2) fast iterative shrinkage thresholding algorithm on the dual ROF problem (66) [2, 25]....

    [...]

  • ...(35) In that case one can show that ∇G∗ is 1/γ -Lipschitz so that the dual problem (4) can be solved in O(1/N2) using any of the accelerated first order methods of [2, 25, 27], in the sense that the objective (in this case, the dual energy) approaches its optimal value at the rate O(1/N2), where N is the number of first order iterations....

    [...]

  • ...• NEST: Restarted version of Nesterov’s algorithm [2, 25, 28], on the dual Huber-ROF problem....

    [...]

Book
27 Nov 2013
TL;DR: The many different interpretations of proximal operators and algorithms are discussed, their connections to many other topics in optimization and applied mathematics are described, some popular algorithms are surveyed, and a large number of examples of proxiesimal operators that commonly arise in practice are provided.
Abstract: This monograph is about a class of optimization algorithms called proximal algorithms. Much like Newton's method is a standard tool for solving unconstrained smooth optimization problems of modest size, proximal algorithms can be viewed as an analogous tool for nonsmooth, constrained, large-scale, or distributed versions of these problems. They are very generally applicable, but are especially well-suited to problems of substantial recent interest involving large or high-dimensional datasets. Proximal methods sit at a higher level of abstraction than classical algorithms like Newton's method: the base operation is evaluating the proximal operator of a function, which itself involves solving a small convex optimization problem. These subproblems, which generalize the problem of projecting a point onto a convex set, often admit closed-form solutions or can be solved very quickly with standard or simple specialized methods. Here, we discuss the many different interpretations of proximal operators and algorithms, describe their connections to many other topics in optimization and applied mathematics, survey some popular algorithms, and provide a large number of examples of proximal operators that commonly arise in practice.

3,627 citations


Cites background or methods from "A Fast Iterative Shrinkage-Threshol..."

  • ...Important papers on forward-backward splitting include those by Passty [159], Lions and Mercier [129], Fukushima and Mine [88], Gabay [90], Lemaire [120], Eckstein [78], Chen [54], Chen and Rockafellar [55], Tseng [184, 185, 187], Combettes and Wajs [62], and Beck and Teboulle [17, 18]....

    [...]

  • ...50 k f(k )− fs ta r Subgradient method Generalized gradient 9 # iterations 1This is taken from the lecture notes of Geoff Gordon and Ryan Tibshirani; “generalized gradient” in the legend means ISTA....

    [...]

  • ...This is also called the iterative soft thresholding algorithm, or ISTA....

    [...]

  • ...Here are typical runs2 for the LASSO, which compares the standard proximal gradient method (ISTA) to its accelerated version (FISTA): f (xk)− f ?...

    [...]

  • ...Hence generalized gradient update step is: x+ = S t(x + tA T (y Ax)) Resulting algorithm called ISTA (Iterative Soft-Thresholding Algorithm)....

    [...]

References
More filters
Book
01 Jan 1995

12,671 citations

Journal ArticleDOI
TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.
Abstract: The time-frequency and time-scale communities have recently developed a large number of overcomplete waveform dictionaries --- stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the method of frames (MOF), Matching pursuit (MP), and, for special dictionaries, the best orthogonal basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation denoising, and multiscale edge denoising. BP in highly overcomplete dictionaries leads to large-scale optimization problems. With signals of length 8192 and a wavelet packet dictionary, one gets an equivalent linear program of size 8192 by 212,992. Such problems can be attacked successfully only because of recent advances in linear programming by interior-point methods. We obtain reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.

9,950 citations


Additional excerpts

  • ...R ed is tr ib ut io n su bj ec t t o SI A M li ce ns e or c op yr ig ht ; s ee h ttp :// w w w .s ia m .o rg /jo ur na ls /o js a. ph p Copyright © by SIAM....

    [...]

Book
01 Jun 1970
TL;DR: In this article, the authors present a list of basic reference books for convergence of Minimization Methods in linear algebra and linear algebra with a focus on convergence under partial ordering.
Abstract: Preface to the Classics Edition Preface Acknowledgments Glossary of Symbols Introduction Part I. Background Material. 1. Sample Problems 2. Linear Algebra 3. Analysis Part II. Nonconstructive Existence Theorems. 4. Gradient Mappings and Minimization 5. Contractions and the Continuation Property 6. The Degree of a Mapping Part III. Iterative Methods. 7. General Iterative Methods 8. Minimization Methods Part IV. Local Convergence. 9. Rates of Convergence-General 10. One-Step Stationary Methods 11. Multistep Methods and Additional One-Step Methods Part V. Semilocal and Global Convergence. 12. Contractions and Nonlinear Majorants 13. Convergence under Partial Ordering 14. Convergence of Minimization Methods An Annotated List of Basic Reference Books Bibliography Author Index Subject Index.

7,669 citations

Journal ArticleDOI
TL;DR: In this article, the authors proposed a smoothness adaptive thresholding procedure, called SureShrink, which is adaptive to the Stein unbiased estimate of risk (sure) for threshold estimates and is near minimax simultaneously over a whole interval of the Besov scale; the size of this interval depends on the choice of mother wavelet.
Abstract: We attempt to recover a function of unknown smoothness from noisy sampled data. We introduce a procedure, SureShrink, that suppresses noise by thresholding the empirical wavelet coefficients. The thresholding is adaptive: A threshold level is assigned to each dyadic resolution level by the principle of minimizing the Stein unbiased estimate of risk (Sure) for threshold estimates. The computational effort of the overall procedure is order N · log(N) as a function of the sample size N. SureShrink is smoothness adaptive: If the unknown function contains jumps, then the reconstruction (essentially) does also; if the unknown function has a smooth piece, then the reconstruction is (essentially) as smooth as the mother wavelet will allow. The procedure is in a sense optimally smoothness adaptive: It is near minimax simultaneously over a whole interval of the Besov scale; the size of this interval depends on the choice of mother wavelet. We know from a previous paper by the authors that traditional smoot...

4,699 citations


Additional excerpts

  • ...R ed is tr ib ut io n su bj ec t t o SI A M li ce ns e or c op yr ig ht ; s ee h ttp :// w w w .s ia m .o rg /jo ur na ls /o js a. ph p Copyright © by SIAM....

    [...]

Book
31 Jul 1996
TL;DR: Inverse problems have been studied in this article, where Tikhonov regularization of nonlinear problems has been applied to weighted polynomial minimization problems, and the Conjugate Gradient Method has been used for numerical realization.
Abstract: Preface. 1. Introduction: Examples of Inverse Problems. 2. Ill-Posed Linear Operator Equations. 3. Regularization Operators. 4. Continuous Regularization Methods. 5. Tikhonov Regularization. 6. Iterative Regularization Methods. 7. The Conjugate Gradient Method. 8. Regularization with Differential Operators. 9. Numerical Realization. 10. Tikhonov Regularization of Nonlinear Problems. 11. Iterative Methods for Nonlinear Problems. A. Appendix: A.1. Weighted Polynomial Minimization Problems. A.2. Orthogonal Polynomials. A.3. Christoffel Functions. Bibliography. Index.

4,690 citations