scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

01 Mar 1992-IEEE Transactions on Automatic Control (IEEE)-Vol. 37, Iss: 3, pp 332-341
TL;DR: The paper presents an SA algorithm that is based on a simultaneous perturbation gradient approximation instead of the standard finite-difference approximation of Keifer-Wolfowitz type procedures that can be significantly more efficient than the standard algorithms in large-dimensional problems.
Abstract: The problem of finding a root of the multivariate gradient equation that arises in function minimization is considered. When only noisy measurements of the function are available, a stochastic approximation (SA) algorithm for the general Kiefer-Wolfowitz type is appropriate for estimating the root. The paper presents an SA algorithm that is based on a simultaneous perturbation gradient approximation instead of the standard finite-difference approximation of Keifer-Wolfowitz type procedures. Theory and numerical experience indicate that the algorithm can be significantly more efficient than the standard algorithms in large-dimensional problems. >

Content maybe subject to copyright    Report






Citations
More filters
Journal ArticleDOI
14 Sep 2017-Nature
TL;DR: The experimental optimization of Hamiltonian problems with up to six qubits and more than one hundred Pauli terms is demonstrated, determining the ground-state energy for molecules of increasing size, up to BeH2.
Abstract: The ground-state energy of small molecules is determined efficiently using six qubits of a superconducting quantum processor. Quantum simulation is currently the most promising application of quantum computers. However, only a few quantum simulations of very small systems have been performed experimentally. Here, researchers from IBM present quantum simulations of larger systems using a variational quantum eigenvalue solver (or eigensolver), a previously suggested method for quantum optimization. They perform quantum chemical calculations of LiH and BeH2 and an energy minimization procedure on a four-qubit Heisenberg model. Their application of the variational quantum eigensolver is hardware-efficient, which means that it is optimized on the given architecture. Noise is a big problem in this implementation, but quantum error correction could eventually help this experimental set-up to yield a quantum simulation of chemically interesting systems on a quantum computer. Quantum computers can be used to address electronic-structure problems and problems in materials science and condensed matter physics that can be formulated as interacting fermionic problems, problems which stretch the limits of existing high-performance computers1. Finding exact solutions to such problems numerically has a computational cost that scales exponentially with the size of the system, and Monte Carlo methods are unsuitable owing to the fermionic sign problem. These limitations of classical computational methods have made solving even few-atom electronic-structure problems interesting for implementation using medium-sized quantum computers. Yet experimental implementations have so far been restricted to molecules involving only hydrogen and helium2,3,4,5,6,7,8. Here we demonstrate the experimental optimization of Hamiltonian problems with up to six qubits and more than one hundred Pauli terms, determining the ground-state energy for molecules of increasing size, up to BeH2. We achieve this result by using a variational quantum eigenvalue solver (eigensolver) with efficiently prepared trial states that are tailored specifically to the interactions that are available in our quantum processor, combined with a compact encoding of fermionic Hamiltonians9 and a robust stochastic optimization routine10. We demonstrate the flexibility of our approach by applying it to a problem of quantum magnetism, an antiferromagnetic Heisenberg model in an external magnetic field. In all cases, we find agreement between our experiments and numerical simulations using a model of the device with noise. Our results help to elucidate the requirements for scaling the method to larger systems and for bridging the gap between key problems in high-performance computing and their implementation on quantum hardware.

2,348 citations


Cites background or methods from "Multivariate stochastic approximati..."

  • ...Following Feynman’s idea for quantum simulation, a quantum algorithm for the ground state problem of interacting fermions was proposed in [14] and [15]....

    [...]

  • ...The convergence of θk to the optimal solution ~ θ ∗ can be proven even in the presence of stochastic fluctuations, if the starting point is in the domain of the attraction of the problem [15], ....

    [...]

  • ...The simultaneous perturbation stochastic approximation (SPSA) algorithm, introduced in [15], is a gradient-descent method that gives a level of accuracy in the optimization of the cost function that is comparable with finite-difference gradient approximations, while saving an order O(p) of cost function evaluations....

    [...]

Posted Content
TL;DR: This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.
Abstract: Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

2,178 citations


Cites background or methods from "Multivariate stochastic approximati..."

  • ...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e.g., isotropic Gaussian noise of variance σ2) and estimates the gradient of the expected loss with respect to ui…...

    [...]

  • ...Unlike the SPSA (Spall, 1992) estimator, our estimator is unbiased even though the perturbations are not small (0 or 1), and it multiplies by the perturbation rather than dividing by it....

    [...]

  • ...Gradient estimators based on stochastic perturbations have long been shown to be much more efficient than standard finite-difference approximations (Spall, 1992)....

    [...]

  • ...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e.g., isotropic Gaussian noise of variance σ2) and estimates the gradient of the expected loss with respect to ui through L(u+z)−L(u−z) 2zi ....

    [...]

  • ...Instead, a perturbation-based estimator such as found in Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 1992) chooses a random perturbation vector z (e....

    [...]

Journal ArticleDOI
TL;DR: A Composite PSO, in which the heuristic parameters of PSO are controlled by a Differential Evolution algorithm during the optimization, is described, and results for many well-known and widely used test functions are given.
Abstract: This paper presents an overview of our most recent results concerning the Particle Swarm Optimization (PSO) method. Techniques for the alleviation of local minima, and for detecting multiple minimizers are described. Moreover, results on the ability of the PSO in tackling Multiobjective, Minimax, Integer Programming and e1 errors-in-variables problems, as well as problems in noisy and continuously changing environments, are reported. Finally, a Composite PSO, in which the heuristic parameters of PSO are controlled by a Differential Evolution algorithm during the optimization, is described, and results for many well-known and widely used test functions are given.

1,436 citations


Cites methods from "Multivariate stochastic approximati..."

  • ...Recently, Arnold in his Ph.D. thesis (Arnold, 2001) extensively tested numerous optimization methods under noise, including: (1) the direct pattern search algorithm of Hooke and Jeeves (Hooke and Jeeves, 1961), (2) the simplex metdod of Nelder and Mead (Nelder and Mead, 1965), (3) the multi-directional search algorithm of Torczon (Torczon, 1989), (4) the implicit filtering algorithm of Gilmore and Kelley (Gilmore and Kelley, 1995; Kelley, 1999) that is based on explicitly approximating the local gradient of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with cumulative mutation strength adaptation mechanism by Hansen and Ostermeier (Hansen, 1998; Hansen and Ostermeier, 2001)....

    [...]

  • ...…of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with…...

    [...]

  • ...(2) the simplex metdod of Nelder and Mead (Nelder and Mead, 1965), (3) the multi-directional search algorithm of Torczon (Torczon, 1989), (4) the implicit filtering algorithm of Gilmore and Kelley (Gilmore and Kelley, 1995; Kelley, 1999) that is based on explicitly approximating the local gradient of the objective functions by means of finite differencing, (5) the simultaneous perturbation stochastic approximation algorithm due to Spall (Spall, 1992; Spall, 1998a; Spall, 1998b), (6) the evolutionary gradient search algorithm of Salomon (Salomon, 1998), (7) the evolution strategy with cumulative mutation strength adaptation mechanism by Hansen and Ostermeier (Hansen, 1998; Hansen and Ostermeier, 2001)....

    [...]

Journal ArticleDOI
TL;DR: This paper attempts to give an overview of deformable registration methods, putting emphasis on the most recent advances in the domain, and provides an extensive account of registration techniques in a systematic manner.
Abstract: Deformable image registration is a fundamental task in medical image processing. Among its most important applications, one may cite: 1) multi-modality fusion, where information acquired by different imaging devices or protocols is fused to facilitate diagnosis and treatment planning; 2) longitudinal studies, where temporal structural or anatomical changes are investigated; and 3) population modeling and statistical atlases used to study normal anatomical variability. In this paper, we attempt to give an overview of deformable registration methods, putting emphasis on the most recent advances in the domain. Additional emphasis has been given to techniques applied to medical images. In order to study image registration methods in depth, their main components are identified and studied independently. The most recent techniques are presented in a systematic fashion. The contribution of this paper is to provide an extensive account of registration techniques in a systematic manner.

1,434 citations


Cites background from "Multivariate stochastic approximati..."

  • ...The second one, known as Simultaneous Perturbation (SP) [379], estimates the gradient by perturbing it not along the basis axis but instead along a random perturbation vector ∆ whose elements are independent and symmetrically Bernoulli distributed....

    [...]

Posted Content
TL;DR: This work explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients, and highlights several advantages of ES as a blackbox optimization technique.
Abstract: We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

1,218 citations


Cites background or methods from "Multivariate stochastic approximati..."

  • ...…For the special case where pψ is factored Gaussian (as in this work), the resulting gradient estimator is also known as simultaneous perturbation stochastic approximation [Spall, 1992], parameterexploring policy gradients [Sehnke et al., 2010], or zero-order gradient estimation [Nesterov and…...

    [...]

  • ...Specifically, using the score function estimator for∇ψEθ∼pψF (θ) in a fashion similar to REINFORCE [Williams, 1992], NES algorithms take gradient steps on ψ with the following estimator: ∇ψEθ∼pψF (θ) = Eθ∼pψ {F (θ)∇ψ log pψ(θ)} For the special case where pψ is factored Gaussian (as in this work), the resulting gradient estimator is also known as simultaneous perturbation stochastic approximation [Spall, 1992], parameterexploring policy gradients [Sehnke et al....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors give a scheme whereby, starting from an arbitrary point, one obtains successively $x_2, x_3, \cdots$ such that the regression function converges to the unknown point in probability as n \rightarrow \infty.
Abstract: Let $M(x)$ be a regression function which has a maximum at the unknown point $\theta. M(x)$ is itself unknown to the statistician who, however, can take observations at any level $x$. This paper gives a scheme whereby, starting from an arbitrary point $x_1$, one obtains successively $x_2, x_3, \cdots$ such that $x_n$ converges to $\theta$ in probability as $n \rightarrow \infty$.

2,141 citations

Journal ArticleDOI
TL;DR: In this paper, a multidimensional stochastic approximation scheme is presented, and conditions are given for these schemes to converge a.s.p.s to the solutions of $k-stochastic equations in $k$ unknowns.
Abstract: Multidimensional stochastic approximation schemes are presented, and conditions are given for these schemes to converge a.s. (almost surely) to the solutions of $k$ stochastic equations in $k$ unknowns and to the point where a regression function in $k$ variables achieves its maximum.

508 citations

Journal ArticleDOI
TL;DR: In this article, the Robbins-Monro procedure and the Kiefer-Wolfowitz procedure are considered, for which the magnitude of the $n$th step depends on the number of changes in sign in $(X_i - X_{i - 1})$ for n = 2, \cdots, n.
Abstract: Using a stochastic approximation procedure $\{X_n\}, n = 1, 2, \cdots$, for a value $\theta$, it seems likely that frequent fluctuations in the sign of $(X_n - \theta) - (X_{n - 1} - \theta) = X_n - X_{n - 1}$ indicate that $|X_n - \theta|$ is small, whereas few fluctuations in the sign of $X_n - X_{n - 1}$ indicate that $X_n$ is still far away from $\theta$. In view of this, certain approximation procedures are considered, for which the magnitude of the $n$th step (i.e., $X_{n + 1} - X_n$) depends on the number of changes in sign in $(X_i - X_{i - 1})$ for $i = 2, \cdots, n$. In theorems 2 and 3, $$X_{n + 1} - X_n$$ is of the form $b_nZ_n$, where $Z_n$ is a random variable whose conditional expectation, given $X_1, \cdots, X_n$, has the opposite sign of $X_n - \theta$ and $b_n$ is a positive real number. $b_n$ depends in our processes on the changes in sign of $$X_i - X_{i - 1}(i \leqq n)$$ in such a way that more changes in sign give a smaller $b_n$. Thus the smaller the number of changes in sign before the $n$th step, the larger we make the correction on $X_n$ at the $n$th step. These procedures may accelerate the convergence of $X_n$ to $\theta$, when compared to the usual procedures ([3] and [5]). The result that the considered procedures converge with probability one may be useful for finding optimal procedures. Application to the Robbins-Monro procedure (Theorem 2) seems more interesting than application to the Kiefer-Wolfowitz procedure (Theorem 3).

403 citations