scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo

01 Mar 1987-Siam Journal on Applied Mathematics (Society for Industrial and Applied Mathematics)-Vol. 47, Iss: 1, pp 169-185
TL;DR: In this article, the authors studied the asymptotic behavior of the systems where the objective function values can only be sampled via Monte Carlo, where the discrete algorithm is a combination of stochastic approximation and simulated annealing.
Abstract: The asymptotic behavior of the systems $X_{n + 1} = X_n + a_n b( {X_n ,\xi _n } ) + a_n \sigma ( X_n )\psi_n $ and $dy = \bar b( y )dt + \sqrt {a( t )} \sigma ( y )dw$ is studied, where $\{ {\psi _n } \}$ is i.i.d. Gaussian, $\{ \xi _n \}$ is a (correlated) bounded sequence of random variables and $a_n \approx A_0/\log (A_1 + n )$. Without $\{ \xi _n \}$, such algorithms are versions of the “simulated annealing” method for global optimization. When the objective function values can only be sampled via Monte Carlo, the discrete algorithm is a combination of stochastic approximation and simulated annealing. Our forms are appropriate. The $\{ \psi _n \}$ are the “annealing” variables, and $\{ \xi _n \}$ is the sampling noise. For large $A_0 $, a full asymptotic analysis is presented, via the theory of large deviations: Mean escape time (after arbitrary time n) from neighborhoods of stable sets of the algorithm, mean transition times (after arbitrary time n) from a neighborhood of one stable set to another, a...
Citations
More filters
Journal ArticleDOI
TL;DR: Concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks are reviewed.
Abstract: The premise of this article is that learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods. We review concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks. Because of the considerable variety of available learning procedures and necessary limitations of space, we cannot provide a comprehensive treatment. Our focus is primarily on learning procedures for feedforward networks. However, many of the concepts and issues arising in this framework are also quite broadly relevant to other network learning paradigms. In addition to providing useful insights, the material reviewed here suggests some potentially useful new training methods for artificial neural ne...

969 citations


Cites background or methods from "Asymptotic global behavior for stoc..."

  • ...The results of White (1989b) and of Kuan and White (1989) can be used to construct tests of the irrelevant input hypothesis and the irrelevant hidden unit hypothesis in ways directly analogous to those discussed earlier....

    [...]

  • ...Kushner (1987) has studied a modification of equation 4.7 guaranteed to converge w.p.1 to a global solution to equation 4.1 as n + 00....

    [...]

Journal ArticleDOI
TL;DR: Theoretical neural networks from an econometric perspective: a perspective from the perspective of an economist.
Abstract: (1994). Artificial neural networks: an econometric perspective. Econometric Reviews: Vol. 13, No. 1, pp. 1-91.

484 citations

Journal ArticleDOI
Guozhong An1
TL;DR: It is shown that input noise and weight noise encourage the neural-network output to be a smooth function of the input or its weights, respectively, and in the weak-noise limit, noise added to the output of the neural networks only changes the objective function by a constant, it cannot improve generalization.
Abstract: We study the effects of adding noise to the inputs, outputs, weight connections, and weight changes of multilayer feedforward neural networks during backpropagation training. We rigorously derive and analyze the objective functions that are minimized by the noise-affected training processes. We show that input noise and weight noise encourage the neural-network output to be a smooth function of the input or its weights, respectively. In the weak-noise limit, noise added to the output of the neural networks only changes the objective function by a constant. Hence, it cannot improve generalization. Input noise introduces penalty terms in the objective function that are related to, but distinct from, those found in the regularization approaches. Simulations have been performed on a regression and a classification problem to further substantiate our analysis. Input noise is found to be effective in improving the generalization performance for both problems. However, weight noise is found to be effective in improving the generalization performance only for the classification problem. Other forms of noise have practically no effect on generalization.

465 citations

Journal ArticleDOI
TL;DR: In this article, the authors investigate the properties of a recursive estimation procedure (the method of "back-propagation") for a class of nonlinear regression models (single hidden-layer feedforward network models) recently developed by cognitive scientists.
Abstract: We investigate the properties of a recursive estimation procedure (the method of “back-propagation”) for a class of nonlinear regression models (single hidden-layer feedforward network models) recently developed by cognitive scientists. The results follow from more general results for a class of recursive m estimators, obtained using theorems of Ljung (1977) and Walk (1977) for the method of stochastic approximation. Conditions are given ensuring that the back-propagation estimator converges almost surely to a parameter value that locally minimizes expected squared error loss (provided the estimator does not diverge) and that the back-propagation estimator is asymptotically normal when centered at this minimizer. This estimator is shown to be statistically inefficient, and a two-step procedure that has efficiency equivalent to that of nonlinear least squares is proposed. Practical issues are illustrated by a numerical example involving approximation of the Henon map.

448 citations


Cites background from "Asymptotic global behavior for stoc..."

  • ...Results of Ljung (1977) and Kushner and Clark (1978) provide a basis for proving the consistency and asymptotic normality of back-propagation and related methods with dependent heterogeneous...

    [...]

  • ...Kushner (1987) established convergence to a global optimum with (n Gaussian and iin , 1/log(n + 1)....

    [...]

References
More filters
Journal ArticleDOI
13 May 1983-Science
TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

41,772 citations

Journal ArticleDOI
TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Abstract: We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

18,761 citations

Book
01 Jan 1984
TL;DR: The large deviation problem for empirical distributions of Markov Processes has been studied in this article, where it has been applied to the problem of the Wiener Sausage problem.
Abstract: Large Deviations Cramer's Theorem Multidimensional Version of Cramer's Theorem An Infinite Dimensional Example: Brownian Motion The Ventcel-Freidlin Theory The Exit Problem Empirical Distributions The Large Deviation Problem for Empirical Distributions of Markov Processes Some Properties of Entropy Upper Bounds Lower Bounds Contraction Principle Application to the Problem of the Wiener Sausage The Polaron Problem

921 citations

Journal ArticleDOI
TL;DR: In this article, the authors seek a global minimum of $U:[0, 1]^n \to R$, where R is the number of vertices in an n-dimensional (n-dimensional) Brownian motion.
Abstract: We seek a global minimum of $U:[0,1]^n \to R$ The solution to $({d / {dt}})x_t = - abla U(x_t )$ will find local minima The solution to $dx_t = - abla U(x_t )dt + \sqrt {2T} dw_t $, where w is standard (n-dimensional) Brownian motion and the boundaries are reflecting, will concentrate near the global minima of U, at least when “temperature” T is small: the equilibrium distribution for $x_t $, is Gibbs with density $\pi _T (x)\alpha \exp \{ - {{U(x)} / T}\} $ This suggests setting $T = T(t) \downarrow 0$, to find the global minima of U We give conditions on $U(x)$ and $T(t)$ such that the solution to $dx_t = - abla U(x_t )dt + \sqrt {2T} dw_t $ converges weakly to a distribution concentrated on the global minima of U

305 citations

Journal ArticleDOI
TL;DR: In this article, the average principle for stochastic differential equations is used to describe the behavior of the system over large time intervals, and the probability of large deviations from the averaged system is analyzed.
Abstract: ContentsIntroduction § 1. Null approximation and normal deviations § 2. Large deviations from the averaged system § 3. Large deviations. Continuation § 4. Moderate deviations § 5. The behaviour of the system over large time intervals § 6. Examples. Remarks § 7. The averaging principle for stochastic differential equations § 8. Inequalities for the probabilities of large deviationsReferences

94 citations