Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

doi:10.1109/9.119632

Home
/
Papers
/
Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

Journal Article•DOI•

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

James C. Spall¹•Institutions (1)

Johns Hopkins University¹

01 Mar 1992-IEEE Transactions on Automatic Control (IEEE)-Vol. 37, Iss: 3, pp 332-341

TL;DR: The paper presents an SA algorithm that is based on a simultaneous perturbation gradient approximation instead of the standard finite-difference approximation of Keifer-Wolfowitz type procedures that can be significantly more efficient than the standard algorithms in large-dimensional problems.

read less

Abstract: The problem of finding a root of the multivariate gradient equation that arises in function minimization is considered. When only noisy measurements of the function are available, a stochastic approximation (SA) algorithm for the general Kiefer-Wolfowitz type is appropriate for estimating the root. The paper presents an SA algorithm that is based on a simultaneous perturbation gradient approximation instead of the standard finite-difference approximation of Keifer-Wolfowitz type procedures. Theory and numerical experience indicate that the algorithm can be significantly more efficient than the standard algorithms in large-dimensional problems. >

...read moreread less

Content maybe subject to copyright Report

HTML Viewer

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets

[...]

Abhinav Kandala¹, Antonio Mezzacapo¹, Kristan Temme¹, Maika Takita¹, Markus Brink¹, Jerry M. Chow¹, Jay M. Gambetta¹ - Show less +3 more•Institutions (1)

IBM¹

14 Sep 2017-Nature

TL;DR: The experimental optimization of Hamiltonian problems with up to six qubits and more than one hundred Pauli terms is demonstrated, determining the ground-state energy for molecules of increasing size, up to BeH2.

...read moreread less

Abstract: The ground-state energy of small molecules is determined efficiently using six qubits of a superconducting quantum processor. Quantum simulation is currently the most promising application of quantum computers. However, only a few quantum simulations of very small systems have been performed experimentally. Here, researchers from IBM present quantum simulations of larger systems using a variational quantum eigenvalue solver (or eigensolver), a previously suggested method for quantum optimization. They perform quantum chemical calculations of LiH and BeH2 and an energy minimization procedure on a four-qubit Heisenberg model. Their application of the variational quantum eigensolver is hardware-efficient, which means that it is optimized on the given architecture. Noise is a big problem in this implementation, but quantum error correction could eventually help this experimental set-up to yield a quantum simulation of chemically interesting systems on a quantum computer. Quantum computers can be used to address electronic-structure problems and problems in materials science and condensed matter physics that can be formulated as interacting fermionic problems, problems which stretch the limits of existing high-performance computers1. Finding exact solutions to such problems numerically has a computational cost that scales exponentially with the size of the system, and Monte Carlo methods are unsuitable owing to the fermionic sign problem. These limitations of classical computational methods have made solving even few-atom electronic-structure problems interesting for implementation using medium-sized quantum computers. Yet experimental implementations have so far been restricted to molecules involving only hydrogen and helium2,3,4,5,6,7,8. Here we demonstrate the experimental optimization of Hamiltonian problems with up to six qubits and more than one hundred Pauli terms, determining the ground-state energy for molecules of increasing size, up to BeH2. We achieve this result by using a variational quantum eigenvalue solver (eigensolver) with efficiently prepared trial states that are tailored specifically to the interactions that are available in our quantum processor, combined with a compact encoding of fermionic Hamiltonians9 and a robust stochastic optimization routine10. We demonstrate the flexibility of our approach by applying it to a problem of quantum magnetism, an antiferromagnetic Heisenberg model in an external magnetic field. In all cases, we find agreement between our experiments and numerical simulations using a model of the device with noise. Our results help to elucidate the requirements for scaling the method to larger systems and for bridging the gap between key problems in high-performance computing and their implementation on quantum hardware.

...read moreread less

2,348 citations

Cites background or methods from "Multivariate stochastic approximati..."

...Following Feynman’s idea for quantum simulation, a quantum algorithm for the ground state problem of interacting fermions was proposed in [14] and [15]....
[...]
...The convergence of θk to the optimal solution ~ θ ∗ can be proven even in the presence of stochastic fluctuations, if the starting point is in the domain of the attraction of the problem [15], ....
[...]
...The simultaneous perturbation stochastic approximation (SPSA) algorithm, introduced in [15], is a gradient-descent method that gives a level of accuracy in the optimization of the cost function that is comparable with finite-difference gradient approximations, while saving an order O(p) of cost function evaluations....
[...]

Posted Content•

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

[...]

Yoshua Bengio, Nicholas Léonard, Aaron Courville

15 Aug 2013-arXiv: Learning

TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.

...read moreread less

Abstract: Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

...read moreread less

2,178 citations

Cites background or methods from "Multivariate stochastic approximati..."

...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e.g., isotropic Gaussian noise of variance σ2) and estimates the gradient of the expected loss with respect to ui…...
[...]
...Unlike the SPSA (Spall, 1992) estimator, our estimator is unbiased even though the perturbations are not small (0 or 1), and it multiplies by the perturbation rather than dividing by it....
[...]
...Gradient estimators based on stochastic perturbations have long been shown to be much more efficient than standard finite-difference approximations (Spall, 1992)....
[...]
...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e.g., isotropic Gaussian noise of variance σ2) and estimates the gradient of the expected loss with respect to ui through L(u+z)−L(u−z) 2zi ....
[...]
...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e....
[...]

Journal Article•DOI•

Recent approaches to global optimization problems through Particle Swarm Optimization

[...]

Konstantinos E. Parsopoulos¹, Michael N. Vrahatis¹•Institutions (1)

University of Patras¹

01 Jun 2002-Natural Computing

TL;DR: A Composite PSO, in which the heuristic parameters of PSO are controlled by a Differential Evolution algorithm during the optimization, is described, and results for many well-known and widely used test functions are given.

...read moreread less

Abstract: This paper presents an overview of our most recent results concerning the Particle Swarm Optimization (PSO) method. Techniques for the alleviation of local minima, and for detecting multiple minimizers are described. Moreover, results on the ability of the PSO in tackling Multiobjective, Minimax, Integer Programming and e1 errors-in-variables problems, as well as problems in noisy and continuously changing environments, are reported. Finally, a Composite PSO, in which the heuristic parameters of PSO are controlled by a Differential Evolution algorithm during the optimization, is described, and results for many well-known and widely used test functions are given.

...read moreread less

1,436 citations

Cites methods from "Multivariate stochastic approximati..."

...Recently, Arnold in his Ph.D. thesis (Arnold, 2001) extensively tested numerous optimization methods under noise, including: (1) the direct pattern search algorithm of Hooke and Jeeves (Hooke and Jeeves, 1961), (2) the simplex metdod of Nelder and Mead (Nelder and Mead, 1965), (3) the multi-directional search algorithm of Torczon (Torczon, 1989), (4) the implicit filtering algorithm of Gilmore and Kelley (Gilmore and Kelley, 1995; Kelley, 1999) that is based on explicitly approximating the local gradient of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with cumulative mutation strength adaptation mechanism by Hansen and Ostermeier (Hansen, 1998; Hansen and Ostermeier, 2001)....
[...]
...…of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with…...
[...]
...(2) the simplex metdod of Nelder and Mead (Nelder and Mead, 1965), (3) the multi-directional search algorithm of Torczon (Torczon, 1989), (4) the implicit filtering algorithm of Gilmore and Kelley (Gilmore and Kelley, 1995; Kelley, 1999) that is based on explicitly approximating the local gradient of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with cumulative mutation strength adaptation mechanism by Hansen and Ostermeier (Hansen, 1998; Hansen and Ostermeier, 2001)....
[...]

Journal Article•DOI•

Deformable Medical Image Registration: A Survey

[...]

Aristeidis Sotiras¹, Christos Davatzikos¹, Nikos Paragios²•Institutions (2)

University of Pennsylvania¹, École Centrale Paris²

03 Jun 2013-IEEE Transactions on Medical Imaging

TL;DR: This paper attempts to give an overview of deformable registration methods, putting emphasis on the most recent advances in the domain, and provides an extensive account of registration techniques in a systematic manner.

...read moreread less

Abstract: Deformable image registration is a fundamental task in medical image processing. Among its most important applications, one may cite: 1) multi-modality fusion, where information acquired by different imaging devices or protocols is fused to facilitate diagnosis and treatment planning; 2) longitudinal studies, where temporal structural or anatomical changes are investigated; and 3) population modeling and statistical atlases used to study normal anatomical variability. In this paper, we attempt to give an overview of deformable registration methods, putting emphasis on the most recent advances in the domain. Additional emphasis has been given to techniques applied to medical images. In order to study image registration methods in depth, their main components are identified and studied independently. The most recent techniques are presented in a systematic fashion. The contribution of this paper is to provide an extensive account of registration techniques in a systematic manner.

...read moreread less

1,434 citations

Cites background from "Multivariate stochastic approximati..."

...The second one, known as Simultaneous Perturbation (SP) [379], estimates the gradient by perturbing it not along the basis axis but instead along a random perturbation vector ∆ whose elements are independent and symmetrically Bernoulli distributed....
[...]

Posted Content•

Evolution Strategies as a Scalable Alternative to Reinforcement Learning.

[...]

Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever

10 Mar 2017-arXiv: Machine Learning

TL;DR: This work explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients, and highlights several advantages of ES as a blackbox optimization technique.

...read moreread less

Abstract: We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

...read moreread less

1,218 citations

Cites background or methods from "Multivariate stochastic approximati..."

...…For the special case where pψ is factored Gaussian (as in this work), the resulting gradient estimator is also known as simultaneous perturbation stochastic approximation [Spall, 1992], parameterexploring policy gradients [Sehnke et al., 2010], or zero-order gradient estimation [Nesterov and…...
[...]
...Specifically, using the score function estimator for∇ψEθ∼pψF (θ) in a fashion similar to REINFORCE [Williams, 1992], NES algorithms take gradient steps on ψ with the following estimator: ∇ψEθ∼pψF (θ) = Eθ∼pψ {F (θ)∇ψ log pψ(θ)} For the special case where pψ is factored Gaussian (as in this work), the resulting gradient estimator is also known as simultaneous perturbation stochastic approximation [Spall, 1992], parameterexploring policy gradients [Sehnke et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

An introduction to probability theory and its applications

[...]

William Feller

01 Jan 1950

31,532 citations

Journal Article•DOI•

An Introduction to Probability Theory and Its Applications.

[...]

A. T. Bharucha-Reid, William Feller

01 Apr 1952-American Mathematical Monthly

11,456 citations

Journal Article•DOI•

Stochastic Estimation of the Maximum of a Regression Function

[...]

J. Kiefer, Jacob Wolfowitz

01 Sep 1952-Annals of Mathematical Statistics

TL;DR: In this article, the authors give a scheme whereby, starting from an arbitrary point, one obtains successively $x_2, x_3, \cdots$ such that the regression function converges to the unknown point in probability as n \rightarrow \infty.

...read moreread less

Abstract: Let $M(x)$ be a regression function which has a maximum at the unknown point $\theta. M(x)$ is itself unknown to the statistician who, however, can take observations at any level $x$. This paper gives a scheme whereby, starting from an arbitrary point $x_1$, one obtains successively $x_2, x_3, \cdots$ such that $x_n$ converges to $\theta$ in probability as $n \rightarrow \infty$.

...read moreread less

2,141 citations

Journal Article•DOI•

Multidimensional Stochastic Approximation Methods

[...]

J. R. Blum

01 Dec 1954-Annals of Mathematical Statistics

TL;DR: In this paper, a multidimensional stochastic approximation scheme is presented, and conditions are given for these schemes to converge a.s.p.s to the solutions of $k-stochastic equations in $k$ unknowns.

...read moreread less

Abstract: Multidimensional stochastic approximation schemes are presented, and conditions are given for these schemes to converge a.s. (almost surely) to the solutions of $k$ stochastic equations in $k$ unknowns and to the point where a regression function in $k$ variables achieves its maximum.

...read moreread less

508 citations

Journal Article•DOI•

Accelerated Stochastic Approximation

[...]

Harry Kesten

01 Mar 1958-Annals of Mathematical Statistics

TL;DR: In this article, the Robbins-Monro procedure and the Kiefer-Wolfowitz procedure are considered, for which the magnitude of the $n$th step depends on the number of changes in sign in $(X_i - X_{i - 1})$ for n = 2, \cdots, n.

...read moreread less

Abstract: Using a stochastic approximation procedure $\{X_n\}, n = 1, 2, \cdots$, for a value $\theta$, it seems likely that frequent fluctuations in the sign of $(X_n - \theta) - (X_{n - 1} - \theta) = X_n - X_{n - 1}$ indicate that $|X_n - \theta|$ is small, whereas few fluctuations in the sign of $X_n - X_{n - 1}$ indicate that $X_n$ is still far away from $\theta$. In view of this, certain approximation procedures are considered, for which the magnitude of the $n$th step (i.e., $X_{n + 1} - X_n$) depends on the number of changes in sign in $(X_i - X_{i - 1})$ for $i = 2, \cdots, n$. In theorems 2 and 3, $$X_{n + 1} - X_n$$ is of the form $b_nZ_n$, where $Z_n$ is a random variable whose conditional expectation, given $X_1, \cdots, X_n$, has the opposite sign of $X_n - \theta$ and $b_n$ is a positive real number. $b_n$ depends in our processes on the changes in sign of $$X_i - X_{i - 1}(i \leqq n)$$ in such a way that more changes in sign give a smaller $b_n$. Thus the smaller the number of changes in sign before the $n$th step, the larger we make the correction on $X_n$ at the $n$th step. These procedures may accelerate the convergence of $X_n$ to $\theta$, when compared to the usual procedures ([3] and [5]). The result that the considered procedures converge with probability one may be useful for finding optimal procedures. Application to the Robbins-Monro procedure (Theorem 2) seems more interesting than application to the Kiefer-Wolfowitz procedure (Theorem 3).

...read moreread less

403 citations