scispace - formally typeset
Open AccessJournal ArticleDOI

Weighted Means in Stochastic Approximation of Minima

Jürgen Dippon, +1 more
- 01 Sep 1997 - 
- Vol. 35, Iss: 5, pp 1811-1827
Reads0
Chats0
TLDR
Weighted averages of Kiefer--Wolfowitz-type procedures, which are driven by larger step lengths than usual, can achieve the optimal rate of convergence because a priori knowledge of a lower bound on the smallest eigenvalue of the Hessian matrix is avoided.
Abstract
Weighted averages of Kiefer--Wolfowitz-type procedures, which are driven by larger step lengths than usual, can achieve the optimal rate of convergence. A priori knowledge of a lower bound on the smallest eigenvalue of the Hessian matrix is avoided. The asymptotic mean squared error of the weighted averaging algorithm is the same as would emerge using a Newton-type adaptive algorithm. Several different gradient estimates are considered; one of them leads to a vanishing asymptotic bias. This gradient estimate applied with the weighted averaging algorithm usually yields a better asymptotic mean squared error than applied with the standard algorithm.

read more

Content maybe subject to copyright    Report

WEIGHTED MEANS IN STOCHASTIC APPROXIMATION OF
MINIMA
J. DIPPON
AND J. RENZ
SIAM J. C
ONTROL OPTIM.
c
1997 Society for Industrial and Applied Mathematics
Vol. 35, No. 5, pp. 1811–1827, September 1997 020
Abstract. Weighted averages of Kiefer–Wolfowitz-type procedures, which are driven by larger
step lengths than usual, can achieve the optimal rate of convergence. A priori knowledge of a lower
bound on the smallest eigenvalue of the Hessian matrix is avoided. The asymptotic mean squared
error of the weighted averaging algorithm is the same as would emerge using a Newton-type adaptive
algorithm. Several different gradient estimates are considered; one of them leads to a vanishing
asymptotic bias. This gradient estimate applied with the weighted averaging algorithm usually
yields a better asymptotic mean squared error than applied with the standard algorithm.
Key words. stochastic approximation, acceleration by weighted averaging, weak invariance
principle, consistency, Kiefer–Wolfowitz procedure, gradient estimation, optimization
AMS subject classifications. Primary, 60L20; Secondary, 60F05, 60F17, 93E23
PII. S0363012995283789
1. Introduction. In stochastic approximation the minimizer ϑ of an unknown
regression function f : R
d
R can be estimated by running the recursion
X
n+1
= X
n
a
n
Y
n
,(1.1)
where Y
n
is a gradient estimate of f at the point X
n
and a
n
are positive step lengths
decreasing to zero. For instance, for d = 1 and decreasing span c
n
, Kiefer and Wol-
fowitz [11] used divided differences Y
n
=(Y
n,1
Y
n,2
)/(2c
n
) as approximation of
f
0
(X
n
), where Y
n,1
and Y
n,2
are error-contaminated observations of f(X
n
+ c
n
) and
f(X
n
c
n
), respectively. If f is p-times differentiable at ϑ, and if the gradient esti-
mates Y
n
are constructed appropriately, one can obtain
n
α
2
(
1
1
p
)
(X
n
ϑ)
D
N(µ, K)(n→∞)
with step lengths a
n
= an
α
for some a>0 and α (0, 1] (see Fabian [8] for p 3
odd). Hence, for step lengths a
n
= a/n, the convergence rate n
(11/p)/2
is obtained.
This is the exact minimax order in the problem of estimating the minimizer of f for f
belonging to a certain class of p-times differentiable functions (Polyak and Tsybakov
[18]).
In this paper we investigate weighted means
e
X
n,δ
=
1+δ
n
1+δ
n
X
i=1
i
δ
X
i
(1.2)
of Kiefer–Wolfowitz-type processes (X
n
) generated by recursion (1.1) with some gra-
dient estimates Y
n
for p-times differentiable regression functions and step lengths
converging slower to zero than 1/n. We obtain
n
1
2
(
1
1
p
)
e
X
n,δ
ϑ
D
N(eµ,
e
K)(n→∞)
Received by the editors March 27, 1995; accepted for publication (in revised form) July 23, 1996.
The research of the second author was supported by a Deutsche Forschungsgemeinschaft grant.
http://www.siam.org/journals/sicon/35-5/28378.html
Mathematisches Institut A, Universit¨at Stuttgart, 70511 Stuttgart, Germany (dippon@
mathematik.uni-stuttgart.de).
Landesgirokasse, 70144 Stuttgart, Germany (RiskManagement@t-online.de).
1811

1812 J. DIPPON AND J. RENZ
for some weight parameters δ and various types of gradient estimates (Theorems 3.2
and 4.2). The main advantages are the following. First, a priori knowledge of a lower
bound on the smallest eigenvalue λ
0
of the Hessian Hf (ϑ)offat ϑ is avoided. If,
in the standard algorithm with a
n
= a/n, the constant a is chosen too small, i.e.,
a (1 1/p)/(2λ
0
), convergence can be very slow. To be safe one might choose
a pretty large. But the asymptotic mean squared error (AMSE) produced by the
standard algorithm grows approximately linearly in a. These problems do not arise
when the averaging algorithm is applied. On the other side, if an asymptotic bias is
present, the AMSE of the averaging algorithm cannot be greater than four times the
AMSE of the standard algorithm with the optimal, but usually unknown, constant a.
In this sense the averaging algorithm can be considered to be more stable than the
standard one. Furthermore, the averaging algorithm shows the same limit distribution
as the Newton-type adaptive procedure suggested by Fabian [9] (section 5).
The method proposed in this paper is inspired by an idea of Ruppert [21] and
Polyak [16], who suggested considering the arithmetic mean of the trajectories of a
Robbins–Monro process, which is driven by step lengths slower than 1/n,too. In
this case one obtains the best possible convergence rate and the optimal covariance
of the asymptotic distribution in a certain sense [17]. Since then Yin [27], Pechtl [15],
Kushner and Yang [13], Gy¨orfi and Walk [10], Nazin and Shcherbakov [14], and others
have studied this idea.
A further contribution of this paper is a new design to estimate the gradient
which leads to a vanishing asymptotic bias eµ (for d = 1 see Renz [19]) regardless
of which method (with or without averaging, or with adaptation) is used. Applying
the weighted averaging algorithm together with this gradient estimate leads to a
second moment of the asymptotic distribution which is minimal within a large class
of procedures (relation (5.4)).
Spall [22] introduced another gradient estimate Y
n
, the so-called simultaneous
gradient perturbation method. It uses only two observations at each step instead
of 2d observations, as in the standard Kiefer–Wolfowitz method in R
d
. This makes
it suitable for certain optimization problems in high-dimensional spaces R
d
. Taking
weighted averages of the process generated with Spall’s gradient estimate stabilizes
the performance as discussed below (Theorem 4.2 and section 5).
All these central limit theorems require consistency of the stochastic approxima-
tion method (Propositions 3.1 and 4.1). To prove the central limit theorems we apply
a weak invariance principle stated in Lemma 7.1. Taking weighted averages of the
trajectories leads to an accumulation of terms due to the nonlinearity of the regression
function. To cope with this effect the assumptions of this lemma are partly stronger
than those of a functional central limit theorem for the nonweighted case (see Walk
[24]). But fortunately, the additional conditions can be shown to be fulfilled for many
stochastic approximation procedures. The assertions of both central limit theorems
in this paper can be formulated as invariance principles in the spirit of Lemma 7.1.
As already indicated in Dippon and Renz [4], taking weighted averages of the
trajectories works well with the original gradient estimate of Kiefer and Wolfowitz
(p = 3).
2. Notations. For a d-dimensional Euclidean space the linear space of d × d
matrices is denoted by L(R
d
). x
is the transposed vector of x R
d
, A
is the adjoint
matrix, and tr A is the trace of A ∈L(R
d
). The tensor product x y : R
d
R
d
is
defined by hy, ·ix, where x, y R
d
and , ·i is the usual inner product. The space
C([0, 1], R
d
)ofR
d
-valued continuous functions on [0, 1] is equipped with the maximum

WEIGHTED MEANS IN STOCHASTIC APPROXIMATION 1813
norm. Hf (ϑ) is the Hessian of a function f : R
d
R at ϑ R
d
.ForxRwe use
bxc and dxe, denoting the integer part of x and the least integer greater than or equal
to x, respectively.
Let (Ω, A,P) be a probability space. Then a sequence (X
n
)ofR
d
-valued random
variables (r.v.’s) is called bounded in probability whenever lim
R→∞
lim
n
P (kX
n
k≥
R)=0;(X
n
) converges to zero almost in L
r
or is bounded almost in L
r
(r
(0, )) if for each ε>0 there exists an
ε
∈Awith P (Ω
ε
) 1 ε such that
(
R
ε
kX
n
k
r
dP )
1/r
= o(1) or = O(1), respectively. Convergence almost in L
r
implies
convergence in probability, but it is weaker than a.s. convergence or convergence in
the rth mean.
3. A Kiefer–Wolfowitz procedure with an improved gradient estimate.
The Kiefer–Wolfowitz procedure, which finds the minimizer ϑ of a regression function
f : R
d
R, has been modified by Fabian [6] in such a way that the rate of convergence
nearly reaches the rate of a Robbins–Monro procedure if f is assumed to be sufficiently
smooth in a neighborhood of ϑ. The method uses multiple observations per step.
We consider here, including the Fabian procedure, a modified Kiefer–Wolfowitz
procedure which is given by recursion (1.1). There Y
n
is an estimate of the gradient
f(X
n
) based on error-contaminated observations of f. It is defined by
Y
n
= c
1
n
m
X
j=1
v
j
{f(X
n
+c
n
u
j
e
i
) V
(i)
n,2j1
}−{f(X
n
c
n
u
j
e
i
)V
(i)
n,2j
}
i=1,...,d
,(3.1)
where the following definitions and relations are used throughout section 3: m N,
0 <u
1
<···<u
m
1, v
1
,...,v
m
are real numbers with
P
m
j=1
v
j
u
2i1
j
=(1/2)δ
1i
for
all i =1,...,m (as to the existence, compare Fabian [6]), and c
n
= cn
γ
with c>0
and 0<1/2. The unit vectors in R
d
are denoted by e
1
,...,e
d
.
For future reference, we state the following additional conditions:
(A) f exists on R
d
with f (ϑ)=0.
Concerning the local differentiability of f at ϑ we consider two cases. In the first
case (p = 2) we assume that there exists ε>0, τ (0, 1], K
1
and K
2
such that
(B1a) Hf (ϑ) exists with k∇f(x) Hf (ϑ)(x ϑ)k≤K
1
kxϑk
1+τ
for all xU
ε
(ϑ),
(B1b) k∇f(x) −∇f(y)k≤K
2
kxykfor all x, y U
ε
(ϑ).
(B1b) holds, for instance, if all second partial derivatives of f exist and are
bounded on U
ε
(ϑ). For the second case (p 3), we assume that there exist ε>0 and
L such that
(B2a) derivatives of f up to order p 1 exist on U
ε
(ϑ),
(B2b) the pth derivative of f at ϑ exists,
(B2c) kHf (x) Hf (y)k≤Lkxykfor all x, y U
ε
(ϑ).
A sufficient condition for (B2c) to hold is that all third partial derivatives of f
exist and are bounded on U
ε
(ϑ).
For brevity, (B1) stands for (B1a) and (B1b), and (B2) for (B2a), (B2b), and
(B2c). We use (B) to indicate that either (B1) or (B2) holds.
So far, m has not been specified. The number m must be adapted to the particular
value of p given by (B1) or (B2). Fabian [6] considers in this connection the case
(C1) m := bp/2c =(p1)/2 for an odd p 3, γ := 1/(2p).
We will consider in addition the following case (for d = 1 see Renz [19]):
(C2) m := dp/2e for p 2(pnot necessarily odd), γ := 1/(2p),

1814 J. DIPPON AND J. RENZ
which will result in an unbiased limit distribution, whereas (C1) generally leads to a
nonzero bias (Theorem 3.2).
Similarly as above, (C) means that either (C1) or (C2) holds. We note here that
the assumptions (B1) and (C1) do not occur together.
The sequence (W
n
) of random variables W
n
:=
P
m
j=1
v
j
V
(i)
n,2j1
V
(i)
n,2j
i=1,...,d
satisfies
(D)
nm
kEW
m
W
n
k≤%
nm
(EkW
m
k
2
EkW
n
k
2
)
1
2
with
X
l=0
%
l
< and EkW
n
k
2
= O(1).
Regarding assumption (B2b) it is worthwhile to note that this condition is in-
variant under rotation of coordinates (compare Fabian [8]). As a further comparison
with related work (Fabian [8], Spall [23]), we remark that our results, Theorems 3.2
and 4.2, do not assume continuity of the highest-order partial derivatives.
Results about asymptotic normality in stochastic approximation usually rely on
local smoothness of the regression function f around ϑ and on the consistency of the
procedure. The next proposition shows consistency of the modified procedure. The
assumptions imposed on f allow us to decouple the influence of the r.v.’s W
n
and to
use the weak dependence condition (D).
P
ROPOSITION 3.1. Let a
n
= a/n
α
with α (max{1/2+1/(2p), 1 1/p}, 1) or
a
n
=(aln n)/n, a>0. For recursion (1.1) with gradient estimate (3.1), assume that
conditions (A) and (D) hold, f is bounded from below and has a Lipschitz continuous
derivative with f(x) 6=0for all x 6= ϑ, and sup{kxk : f(x) λ} < for all
λ>inf{f (x):xR
d
}. Then X
n
ϑ (n →∞)a.s.
Under condition (C1) a nonweighted analogue of the next theorem can be found
in Fabian [8].
T
HEOREM 3.2. Let a
n
=(aln n)/n for p=2 and a
n
=a/n
α
with α (1/2+1 / (2p), 1)
for p 3. For recursion (1.1) with gradient estimate (3.1), assume that conditions
(A)–(D) hold, A := Hf (ϑ) is positive definite, and X
n
ϑ a.s. Let B
n
(t):=
n
1/2
n
P
bntc
i=1
W
i
+(nt −bntc)W
bntc+1
o
. Suppose the existence of a Brownian mo-
tion B with covariance matrix S of B(1) and
B
n
D
B in C([0, 1], R
d
)(n→∞).
Then, for all δ>(p+1)/(2p),
n
1
2
(
1
1
p
)
e
X
n,δ
ϑ
D
N
2p(1+δ)
p+1+2
c
p1
A
1
b,
p(1+δ)
2
p+1+2
c
2
A
1
SA
1
(n →∞),
where b =
1
p !
P
m
j=1
v
j
u
p
j
(1+(1)
p+1
)
p
(∂x
i
)
p
f(ϑ)
i=1,...,d
and
e
X
n,δ
is defined in
(1.2). In particular, under condition (C2), b =0.
R
EMARK 3.3. The choices δ =0and δ = 2γ = 1/p are of special interest.
Provided b 6=0, the pair (δ, c)=(0,c
0
) with c
0
as given in (5.1) minimizes the second
moment of the limit distribution. However, for fixed c>0, the limit’s covariance is
minimized by δ = 2γ = 1/p. In particular, Theorem 3.2 yields for n →∞
n
1
2
(
1
1
p
)
n
1
n
X
k=1
X
k
ϑ
!
D
N
2p
p+1
c
p1
A
1
b,
p
p+1
c
2
A
1
SA
1
,
n
1
2
(
1
1
p
)
p1
p
n
p1
p
n
X
k=1
k
1
p
X
k
ϑ
!
D
N
2 c
p1
A
1
b,
p1
p
c
2
A
1
SA
1
.

WEIGHTED MEANS IN STOCHASTIC APPROXIMATION 1815
4. A Kiefer–Wolfowitz procedure with simultaneous perturbation gra-
dient approximation. The classical Kiefer–Wolfowitz (finite difference) stochastic
approximation method (FDSA) needs 2d observations to obtain a finite difference
approximation of the gradient belonging to the function f : R
d
R of which the
minimizer ϑ is sought. To reduce the number of observations in each step, random-
ized gradient approximation methods have been considered in the literature. Two
examples are random direction stochastic approximation (RDSA), suggested by Kush-
ner and Clark [12], and simultaneous perturbation stochastic approximation (SPSA),
suggested by Spall [22]. Both methods are based on only two observations in each
iteration. Depending on the dimension d and the third derivatives of the regres-
sion function f the AMSE of the SPSA method can be better or worse than that of
the FDSA and RDSA methods. At least for second-order polynomials f, the FDSA
method needs d times more observations than the SPSA method to achieve the same
level of mean squared error asymptotically, when the same span c
n
= cn
γ
is used
(Spall [23]).
Before the idea of weighted averages is applied to the SPSA method, we will
describe this algorithm in more detail. Again recursion (1.1) is used, but with step
lengths a
n
= an
α
and with the following so-called simultaneous perturbation gradient
estimate of f(X
n
):
Y
n
=
1
2c
n
(∆
(1)
n
)
1
.
.
.
(∆
(d)
n
)
1
([f(X
n
+ c
n
n
) W
n,1
] [f(X
n
c
n
n
) W
n,2
])(4.1)
consisting of (artificially generated) random vectors
n
∈M(Ω, R
d
), observation
errors W
n,1
, W
n,2
∈M(Ω, R), and span c
n
.
We consider the following set of conditions.
(E) The components
(i)
n
of
n
, i =1,...,d, for n N fixed, form a set
of independent, identically and symmetrically distributed r.v.’s with |
(l)
n
|
having values between fixed positive numbers α
0
1
. The r.v.
n
is as-
sumed to be independent of {X
1
,...,X
n
,
1
,...,
n1
}. Furthermore, we
use ξ
2
= E|
(l)
n
|
2
and ρ
2
= E|
(l)
n
|
2
. For simplicity, the column vector
appearing in (4.1) is denoted by
1
n
.
(F) The difference W
n
= W
n,1
W
n,2
of the observation errors satisfies E(W
n
|
F
n
) = 0 and sup
n
E
W
2
n
|G
n
<a.s., where F
n
and G
n
denote the σ-fields
generated by {X
1
,..., X
n
,
1
,...,
n
} and {X
1
,..., X
n
,
1
,...,
n1
},
respectively.
(G) >E(W
2
n
|F
n
)σ
2
a.s. and E(W
2
n
1
[W
2
n
rn]
|F
n
)0 a.s. for every
r>0.
(H) (B2) holds for p = 3, and A = Hf (ϑ) is a positive definite matrix.
The proposition below presents conditions for the recursion’s consistency. It is
related to Blum’s result [2] on multivariate Kiefer–Wolfowitz procedures. Under dif-
ferent and less intuitive assumptions and with a different method of proof, Spall [23]
asserts consistency as well.
P
ROPOSITION 4.1. Let a
n
= a/n
α
with α (max{γ +1/2,12γ},1] and γ>0.
For recursion (1.1) with gradient estimate (4.1), assume that conditions (A), (E), and
(F) hold, and that f is bounded from below and has a Lipschitz continuous gradient.
(a) If sup{kxk : f(x) λ} < for all λ>inf{f(x):xR
d
}, then sup
n
kX
n
k <
a.s.

Citations
More filters
Journal ArticleDOI

Implementation of the simultaneous perturbation algorithm for stochastic optimization

TL;DR: This paper presents a simple step-by-step guide to implementation of SPSA in generic optimization problems and offers some practical suggestions for choosing certain algorithm coefficients.
Journal ArticleDOI

Adaptive stochastic approximation by the simultaneous perturbation method

TL;DR: The paper presents a general adaptive SA algorithm that is based on a simple method for estimating the Hessian matrix, while concurrently estimating the primary parameters of interest, based on the "simultaneous perturbation (SP)" idea introduced previously.
Journal ArticleDOI

Model-free control of nonlinear stochastic systems with discrete-time measurements

TL;DR: This paper considers the use of the simultaneous perturbation stochastic approximation algorithm which requires only system measurements and it is shown that this algorithm can greatly enhance the efficiency over more standard stoChastic approximation algorithms based on finite-difference gradient approximations.
Book

Simulation-Based Algorithms for Markov Decision Processes

TL;DR: The self-contained approach of this book will appeal not only to researchers in MDPs, stochastic modeling, and control, and simulation but will be a valuable source of tuition and reference for students of control and operations research.
BookDOI

Probabilistic and Randomized Methods for Design under Uncertainty

TL;DR: Part I Chance-Constrained and Stochastic Optimization Scenario Approximations of Chance Constraints Optimization Models with Probabilistic Constrains Theoretical Framework for Comparing Several Stochastics Optimization Approaches Optimization of Risk Measures.
References
More filters
Journal ArticleDOI

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

TL;DR: The paper presents an SA algorithm that is based on a simultaneous perturbation gradient approximation instead of the standard finite-difference approximation of Keifer-Wolfowitz type procedures that can be significantly more efficient than the standard algorithms in large-dimensional problems.
Journal ArticleDOI

Stochastic Estimation of the Maximum of a Regression Function

TL;DR: In this article, the authors give a scheme whereby, starting from an arbitrary point, one obtains successively $x_2, x_3, \cdots$ such that the regression function converges to the unknown point in probability as n \rightarrow \infty.
Journal ArticleDOI

Acceleration of stochastic approximation by averaging

TL;DR: Convergence with probability one is proved for a variety of classical optimization and identification problems and it is demonstrated for these problems that the proposed algorithm achieves the highest possible rate of convergence.
Book

Stochastic Approximation Methods for Constrained and Unconstrained Systems

TL;DR: In this paper, the authors present an algorithm for inequality constraints in a Dynamical System, based on the Robbins-Monro Process and Kiefer-Wolfowitz procedure. But they do not consider the case where the limit satisfies a Generalized ODE.
Book

Stochastic approximation

M. T. Wasan
Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "Weighted means in stochastic approximation of minima∗" ?

Weighted averaging of Kiefer-Wolfowitz-type procedures, which are driven by larger step lengths than usual, can achieve the optimal rate of convergence this paper. 

For recursion (1.1) with gradient estimate (3.1), assume that conditions (A)–(D) hold, A := Hf(ϑ) is positive definite, and Xn → ϑ a.s. Let Bn(t) := n−1/2 {∑bntc i=1 Wi + (nt− bntc)Wbntc+1 } . 

To show that Bn,1 converges in distribution to a Brownian motion B, and that Bn,2 converges to zero in probability, the authors apply an invariance principle for martingale difference sequences of Berger [1]. 

0. For recursion (1.1) with gradient estimate (4.1), assume that conditions (A), (E), and (F) hold, and that f is bounded from below and has a Lipschitz continuous gradient. 

For both algorithms (i) and (ii), with any gradient estimate considered in this paper, the limit distribution can be obtained by Theorem 1 in Walk [24] and by the representations derived in the proofs of Theorems 3.2 and 4.2. 

For the second case (p ≥ 3), the authors assume that there exist ε>0 and L such that (B2a) derivatives of f up to order p− 1 exist on Uε(ϑ), (B2b) the pth derivative of f at ϑ exists, (B2c) ‖Hf(x)−Hf(y)‖ ≤ 

Note that ‖n1/4r(Xn, cn)1Ω(n)‖ ≤ C4 n1/4‖Un‖1+τ + C5 n−τ/4. In both cases the authors get Tn−T = o(1) almost in L1 and An−A = o(1/ √ nan) almost in L2 , where the authors have used 2/p ≤ 1/2+1/(2p) < α for p ≥ 3. Thus the assertion follows from Lemma 7.1 (a).Proof of Proposition 4.1. 

Taking conditional expectations and using inequalities (6.11) and (6.12), the authors obtainE (f(Xn+1) | Gn) ≤ f(Xn)− an ( ‖∇f(Xn)‖2 −Kcn ‖∇f(Xn)‖ ) +Ka2n ‖∇f(Xn)‖2 +Ka2n/c2n ( E(W 2n | Gn) + 

The authors get∀c > 0 1 ≤ min a>β/(2λ0)E(a, c)Ẽ(−1/p, c) < supa>β/(2λ0)E(a, c)Ẽ(−1/p, c) = ∞.(5.4)Assume that a0 (> β/(2λ0)) minimizes E(a, c) for a fixed c. 

At least for second-order polynomials f , the FDSA method needs d times more observations than the SPSA method to achieve the same level of mean squared error asymptotically, when the same span cn = cn−γ is used (Spall [23]).