A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

doi:10.1137/080716542

Home
/
Papers
/
A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

Journal Article•DOI•

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

Amir Beck¹, Marc Teboulle•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jan 2009-Siam Journal on Imaging Sciences (Society for Industrial and Applied Mathematics)-Vol. 2, Iss: 1, pp 183-202

TL;DR: A new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically.

read less

Abstract: We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an extension of the classical gradient algorithm, is attractive due to its simplicity and thus is adequate for solving large-scale problems even with dense matrix data. However, such methods are also known to converge quite slowly. In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Initial promising numerical results for wavelet-based image deblurring demonstrate the capabilities of FISTA which is shown to be faster than ISTA by several orders of magnitude.

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

[...]

Stephen Boyd¹, Neal Parikh¹, Eric Chu¹, Borja Peleato¹, Jonathan Eckstein² - Show less +1 more•Institutions (2)

Stanford University¹, Rutgers University²

23 May 2011

TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.

...read moreread less

Abstract: Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for l1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop MapReduce implementations.

...read moreread less

17,433 citations

Cites background from "A Fast Iterative Shrinkage-Threshol..."

...[173] compares and benchmarks a number of representative algorithms, including gradient projection [73, 102], homotopy methods [52], iterative shrinkage-thresholding [45], proximal gradient [132, 133, 11, 12], augmented Lagrangian methods [175], and interior-point methods [103]....
[...]

Book•

Machine Learning : A Probabilistic Perspective

[...]

Kevin P. Murphy

24 Aug 2012

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

8,059 citations

Cites methods from "A Fast Iterative Shrinkage-Threshol..."

...When this method is combined with the iterative soft thresholding technique (for R(θ) = λ||θ||1), plus a continuation method that gradually reduces λ, we get a fast method for the BPDN problem known as the fast iterative shrinkage thesholding algorithm or FISTA (Beck and Teboulle 2009)....
[...]

Journal Article•DOI•

Robust principal component analysis

[...]

Emmanuel J. Candès¹, Xiaodong Li¹, Yi Ma², John Wright³•Institutions (3)

Stanford University¹, University of Illinois at Urbana–Champaign², Microsoft³

09 Jun 2011-Journal of the ACM

TL;DR: In this paper, the authors prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the e1 norm.

...read moreread less

Abstract: This article is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a low-rank component and a sparse component. Can we recover each component individuallyq We prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the e1 norm. This suggests the possibility of a principled approach to robust principal component analysis since our methodology and results assert that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries are missing as well. We discuss an algorithm for solving this optimization problem, and present applications in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered background, and in the area of face recognition, where it offers a principled way of removing shadows and specularities in images of faces.

...read moreread less

6,783 citations

Journal Article•DOI•

A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging

[...]

Antonin Chambolle¹, Thomas Pock²•Institutions (2)

École Polytechnique¹, Graz University of Technology²

01 May 2011-Journal of Mathematical Imaging and Vision

TL;DR: A first-order primal-dual algorithm for non-smooth convex optimization problems with known saddle-point structure can achieve O(1/N2) convergence on problems, where the primal or the dual objective is uniformly convex, and it can show linear convergence, i.e. O(ωN) for some ω∈(0,1), on smooth problems.

...read moreread less

Abstract: In this paper we study a first-order primal-dual algorithm for non-smooth convex optimization problems with known saddle-point structure. We prove convergence to a saddle-point with rate O(1/N) in finite dimensions for the complete class of problems. We further show accelerations of the proposed algorithm to yield improved rates on problems with some degree of smoothness. In particular we show that we can achieve O(1/N 2) convergence on problems, where the primal or the dual objective is uniformly convex, and we can show linear convergence, i.e. O(? N ) for some ??(0,1), on smooth problems. The wide applicability of the proposed algorithm is demonstrated on several imaging problems such as image denoising, image deconvolution, image inpainting, motion estimation and multi-label image segmentation.

...read moreread less

4,487 citations

Cites background or methods from "A Fast Iterative Shrinkage-Threshol..."

...Is shown in [2, 25, 27, 28] that if G or F ∗ is uniformly convex (such that G∗, or respectively F , has a Lipschitz continuous gradient), O(1/N2) convergence can be guaranteed....
[...]
...Remark 4 In [2, 25, 27], the O(1/N2) estimate is theoretically better than ours since it is on the dual energy G∗(−K∗yN)+F ∗(yN)−(G∗(−K∗ŷ)+F ∗(ŷ)) (which can easily be shown to bound ‖xN − x̂‖2, see for instance [13])....
[...]
...• FISTA: O(1/N2) fast iterative shrinkage thresholding algorithm on the dual ROF problem (66) [2, 25]....
[...]
...(35) In that case one can show that ∇G∗ is 1/γ -Lipschitz so that the dual problem (4) can be solved in O(1/N2) using any of the accelerated first order methods of [2, 25, 27], in the sense that the objective (in this case, the dual energy) approaches its optimal value at the rate O(1/N2), where N is the number of first order iterations....
[...]
...• NEST: Restarted version of Nesterov’s algorithm [2, 25, 28], on the dual Huber-ROF problem....
[...]

Book•

Proximal Algorithms

[...]

Neal Parikh¹, Stephen Boyd¹•Institutions (1)

Stanford University¹

27 Nov 2013

TL;DR: The many different interpretations of proximal operators and algorithms are discussed, their connections to many other topics in optimization and applied mathematics are described, some popular algorithms are surveyed, and a large number of examples of proxiesimal operators that commonly arise in practice are provided.

...read moreread less

Abstract: This monograph is about a class of optimization algorithms called proximal algorithms. Much like Newton's method is a standard tool for solving unconstrained smooth optimization problems of modest size, proximal algorithms can be viewed as an analogous tool for nonsmooth, constrained, large-scale, or distributed versions of these problems. They are very generally applicable, but are especially well-suited to problems of substantial recent interest involving large or high-dimensional datasets. Proximal methods sit at a higher level of abstraction than classical algorithms like Newton's method: the base operation is evaluating the proximal operator of a function, which itself involves solving a small convex optimization problem. These subproblems, which generalize the problem of projecting a point onto a convex set, often admit closed-form solutions or can be solved very quickly with standard or simple specialized methods. Here, we discuss the many different interpretations of proximal operators and algorithms, describe their connections to many other topics in optimization and applied mathematics, survey some popular algorithms, and provide a large number of examples of proximal operators that commonly arise in practice.

...read moreread less

3,627 citations

Cites background or methods from "A Fast Iterative Shrinkage-Threshol..."

...Important papers on forward-backward splitting include those by Passty [159], Lions and Mercier [129], Fukushima and Mine [88], Gabay [90], Lemaire [120], Eckstein [78], Chen [54], Chen and Rockafellar [55], Tseng [184, 185, 187], Combettes and Wajs [62], and Beck and Teboulle [17, 18]....
[...]
...50 k f(k )− fs ta r Subgradient method Generalized gradient 9 # iterations 1This is taken from the lecture notes of Geoff Gordon and Ryan Tibshirani; “generalized gradient” in the legend means ISTA....
[...]
...This is also called the iterative soft thresholding algorithm, or ISTA....
[...]
...Here are typical runs2 for the LASSO, which compares the standard proximal gradient method (ISTA) to its accelerated version (FISTA): f (xk)− f ?...
[...]
...Hence generalized gradient update step is: x+ = S t(x + tA T (y Ax)) Resulting algorithm called ISTA (Iterative Soft-Thresholding Algorithm)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Nonlinear Programming

[...]

Dimitri P. Bertsekas

01 Jan 1995

12,671 citations

Journal Article•DOI•

Atomic Decomposition by Basis Pursuit

[...]

Scott Chen¹, David L. Donoho², Michael A. Saunders²•Institutions (2)

Renaissance Technologies¹, Stanford University²

11 Dec 1998-SIAM Journal on Scientific Computing

TL;DR: Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions.

...read moreread less

Abstract: The time-frequency and time-scale communities have recently developed a large number of overcomplete waveform dictionaries --- stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the method of frames (MOF), Matching pursuit (MP), and, for special dictionaries, the best orthogonal basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation denoising, and multiscale edge denoising. BP in highly overcomplete dictionaries leads to large-scale optimization problems. With signals of length 8192 and a wavelet packet dictionary, one gets an equivalent linear program of size 8192 by 212,992. Such problems can be attacked successfully only because of recent advances in linear programming by interior-point methods. We obtain reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.

...read moreread less

9,950 citations

Additional excerpts

...R ed is tr ib ut io n su bj ec t t o SI A M li ce ns e or c op yr ig ht ; s ee h ttp :// w w w .s ia m .o rg /jo ur na ls /o js a. ph p Copyright © by SIAM....
[...]

Book•

Iterative Solution of Nonlinear Equations in Several Variables

[...]

J.M. Ortega¹, Werner C. Rheinboldt²•Institutions (2)

Florida Institute of Technology¹, University of Pittsburgh²

01 Jun 1970

TL;DR: In this article, the authors present a list of basic reference books for convergence of Minimization Methods in linear algebra and linear algebra with a focus on convergence under partial ordering.

...read moreread less

Abstract: Preface to the Classics Edition Preface Acknowledgments Glossary of Symbols Introduction Part I. Background Material. 1. Sample Problems 2. Linear Algebra 3. Analysis Part II. Nonconstructive Existence Theorems. 4. Gradient Mappings and Minimization 5. Contractions and the Continuation Property 6. The Degree of a Mapping Part III. Iterative Methods. 7. General Iterative Methods 8. Minimization Methods Part IV. Local Convergence. 9. Rates of Convergence-General 10. One-Step Stationary Methods 11. Multistep Methods and Additional One-Step Methods Part V. Semilocal and Global Convergence. 12. Contractions and Nonlinear Majorants 13. Convergence under Partial Ordering 14. Convergence of Minimization Methods An Annotated List of Basic Reference Books Bibliography Author Index Subject Index.

...read moreread less

7,669 citations

Journal Article•DOI•

Adapting to Unknown Smoothness via Wavelet Shrinkage

[...]

David L. Donoho¹, Iain M. Johnstone¹•Institutions (1)

Stanford University¹

01 Dec 1995-Journal of the American Statistical Association

TL;DR: In this article, the authors proposed a smoothness adaptive thresholding procedure, called SureShrink, which is adaptive to the Stein unbiased estimate of risk (sure) for threshold estimates and is near minimax simultaneously over a whole interval of the Besov scale; the size of this interval depends on the choice of mother wavelet.

...read moreread less

Abstract: We attempt to recover a function of unknown smoothness from noisy sampled data. We introduce a procedure, SureShrink, that suppresses noise by thresholding the empirical wavelet coefficients. The thresholding is adaptive: A threshold level is assigned to each dyadic resolution level by the principle of minimizing the Stein unbiased estimate of risk (Sure) for threshold estimates. The computational effort of the overall procedure is order N · log(N) as a function of the sample size N. SureShrink is smoothness adaptive: If the unknown function contains jumps, then the reconstruction (essentially) does also; if the unknown function has a smooth piece, then the reconstruction is (essentially) as smooth as the mother wavelet will allow. The procedure is in a sense optimally smoothness adaptive: It is near minimax simultaneously over a whole interval of the Besov scale; the size of this interval depends on the choice of mother wavelet. We know from a previous paper by the authors that traditional smoot...

...read moreread less

4,699 citations

Additional excerpts

...R ed is tr ib ut io n su bj ec t t o SI A M li ce ns e or c op yr ig ht ; s ee h ttp :// w w w .s ia m .o rg /jo ur na ls /o js a. ph p Copyright © by SIAM....
[...]

Book•

Regularization of Inverse Problems

[...]

Heinz W. Engl, Martin Hanke, Andreas Neubauer

31 Jul 1996

TL;DR: Inverse problems have been studied in this article, where Tikhonov regularization of nonlinear problems has been applied to weighted polynomial minimization problems, and the Conjugate Gradient Method has been used for numerical realization.

...read moreread less

Abstract: Preface. 1. Introduction: Examples of Inverse Problems. 2. Ill-Posed Linear Operator Equations. 3. Regularization Operators. 4. Continuous Regularization Methods. 5. Tikhonov Regularization. 6. Iterative Regularization Methods. 7. The Conjugate Gradient Method. 8. Regularization with Differential Operators. 9. Numerical Realization. 10. Tikhonov Regularization of Nonlinear Problems. 11. Iterative Methods for Nonlinear Problems. A. Appendix: A.1. Weighted Polynomial Minimization Problems. A.2. Orthogonal Polynomials. A.3. Christoffel Functions. Bibliography. Index.

...read moreread less

4,690 citations