What is the recursion of the gradient?

For recursion (1.1) with gradient estimate (3.1), assume that conditions (A)–(D) hold, A := Hf(ϑ) is positive definite, and Xn → ϑ a.s. Let Bn(t) := n−1/2 {∑bntc i=1 Wi + (nt− bntc)Wbntc+1 } .

What is the proof of the Lemma 7.1?

To show that Bn,1 converges in distribution to a Brownian motion B, and that Bn,2 converges to zero in probability, the authors apply an invariance principle for martingale difference sequences of Berger [1].

How can the authors obtain the limit distribution for the two algorithms?

For both algorithms (i) and (ii), with any gradient estimate considered in this paper, the limit distribution can be obtained by Theorem 1 in Walk [24] and by the representations derived in the proofs of Theorems 3.2 and 4.2.

What is the proof of Proposition 4.1?

Note that ‖n1/4r(Xn, cn)1Ω(n)‖ ≤ C4 n1/4‖Un‖1+τ + C5 n−τ/4. In both cases the authors get Tn−T = o(1) almost in L1 and An−A = o(1/ √ nan) almost in L2 , where the authors have used 2/p ≤ 1/2+1/(2p) < α for p ≥ 3. Thus the assertion follows from Lemma 7.1 (a).Proof of Proposition 4.1.

What is the simplest formula to obtain f(xn)?

Taking conditional expectations and using inequalities (6.11) and (6.12), the authors obtainE (f(Xn+1) | Gn) ≤ f(Xn)− an ( ‖∇f(Xn)‖2 −Kcn ‖∇f(Xn)‖ ) +Ka2n ‖∇f(Xn)‖2 +Ka2n/c2n ( E(W 2n | Gn) +

What is the gradient estimate for a fixed c?

The authors get∀c > 0 1 ≤ min a>β/(2λ0)E(a, c)Ẽ(−1/p, c) β/(2λ0)E(a, c)Ẽ(−1/p, c) = ∞.(5.4)Assume that a0 (> β/(2λ0)) minimizes E(a, c) for a fixed c.

(Open Access) Weighted Means in Stochastic Approximation of Minima (1997) | Jürgen Dippon

Q: What are the contributions mentioned in the paper "Weighted means in stochastic approximation of minima∗" ?

Weighted averaging of Kiefer-Wolfowitz-type procedures, which are driven by larger step lengths than usual, can achieve the optimal rate of convergence this paper.

Q: What is the pth derivative of f?

For the second case (p ≥ 3), the authors assume that there exist ε>0 and L such that (B2a) derivatives of f up to order p− 1 exist on Uε(ϑ), (B2b) the pth derivative of f at ϑ exists, (B2c) ‖Hf(x)−Hf(y)‖ ≤

Q: What is the simplest formula to obtain f(xn)?

Taking conditional expectations and using inequalities (6.11) and (6.12), the authors obtainE (f(Xn+1) | Gn) ≤ f(Xn)− an ( ‖∇f(Xn)‖2 −Kcn ‖∇f(Xn)‖ ) +Ka2n ‖∇f(Xn)‖2 +Ka2n/c2n ( E(W 2n | Gn) +

Q: What is the gradient estimate for a fixed c?

The authors get∀c > 0 1 ≤ min a>β/(2λ0)E(a, c)Ẽ(−1/p, c) β/(2λ0)E(a, c)Ẽ(−1/p, c) = ∞.(5.4)Assume that a0 (> β/(2λ0)) minimizes E(a, c) for a fixed c.

Q: What is the difference between the two methods?

At least for second-order polynomials f , the FDSA method needs d times more observations than the SPSA method to achieve the same level of mean squared error asymptotically, when the same span cn = cn−γ is used (Spall [23]).

WEIGHTED MEANS IN STOCHASTIC APPROXIMATION OF

MINIMA

∗

J. DIPPON

†

AND J. RENZ

‡

SIAM J. C

ONTROL OPTIM.



1997 Society for Industrial and Applied Mathematics

Vol. 35, No. 5, pp. 1811–1827, September 1997 020

Abstract. Weighted averages of Kiefer–Wolfowitz-type procedures, which are driven by larger

step lengths than usual, can achieve the optimal rate of convergence. A priori knowledge of a lower

bound on the smallest eigenvalue of the Hessian matrix is avoided. The asymptotic mean squared

error of the weighted averaging algorithm is the same as would emerge using a Newton-type adaptive

algorithm. Several diﬀerent gradient estimates are considered; one of them leads to a vanishing

asymptotic bias. This gradient estimate applied with the weighted averaging algorithm usually

yields a better asymptotic mean squared error than applied with the standard algorithm.

Key words. stochastic approximation, acceleration by weighted averaging, weak invariance

principle, consistency, Kiefer–Wolfowitz procedure, gradient estimation, optimization

AMS subject classiﬁcations. Primary, 60L20; Secondary, 60F05, 60F17, 93E23

PII. S0363012995283789

1. Introduction. In stochastic approximation the minimizer ϑ of an unknown

regression function f : R

→ R can be estimated by running the recursion

n+1

= X

− a

,(1.1)

where Y

is a gradient estimate of f at the point X

and a

are positive step lengths

decreasing to zero. For instance, for d = 1 and decreasing span c

, Kiefer and Wol-

fowitz [11] used divided diﬀerences Y

=(Y

n,1

− Y

n,2

)/(2c

) as approximation of

), where Y

n,1

and Y

n,2

are error-contaminated observations of f(X

+ c

) and

f(X

− c

), respectively. If f is p-times diﬀerentiable at ϑ, and if the gradient esti-

mates Y

are constructed appropriately, one can obtain

(

1−

)

− ϑ)

→ N(µ, K)(n→∞)

with step lengths a

= an

−α

for some a>0 and α ∈ (0, 1] (see Fabian [8] for p ≥ 3

odd). Hence, for step lengths a

= a/n, the convergence rate n

(1−1/p)/2

is obtained.

This is the exact minimax order in the problem of estimating the minimizer of f for f

belonging to a certain class of p-times diﬀerentiable functions (Polyak and Tsybakov

[18]).

In this paper we investigate weighted means

n,δ

1+δ

i=1

(1.2)

of Kiefer–Wolfowitz-type processes (X

) generated by recursion (1.1) with some gra-

dient estimates Y

for p-times diﬀerentiable regression functions and step lengths

converging slower to zero than 1/n. We obtain

(

1−

)



n,δ

− ϑ



→ N(eµ,

K)(n→∞)

∗

Received by the editors March 27, 1995; accepted for publication (in revised form) July 23, 1996.

The research of the second author was supported by a Deutsche Forschungsgemeinschaft grant.

http://www.siam.org/journals/sicon/35-5/28378.html

†

Mathematisches Institut A, Universit¨at Stuttgart, 70511 Stuttgart, Germany (dippon@

mathematik.uni-stuttgart.de).

‡

Landesgirokasse, 70144 Stuttgart, Germany (RiskManagement@t-online.de).

1811

1812 J. DIPPON AND J. RENZ

for some weight parameters δ and various types of gradient estimates (Theorems 3.2

and 4.2). The main advantages are the following. First, a priori knowledge of a lower

bound on the smallest eigenvalue λ

of the Hessian Hf (ϑ)offat ϑ is avoided. If,

in the standard algorithm with a

= a/n, the constant a is chosen too small, i.e.,

a ≤ (1 − 1/p)/(2λ

), convergence can be very slow. To be safe one might choose

a pretty large. But the asymptotic mean squared error (AMSE) produced by the

standard algorithm grows approximately linearly in a. These problems do not arise

when the averaging algorithm is applied. On the other side, if an asymptotic bias is

present, the AMSE of the averaging algorithm cannot be greater than four times the

AMSE of the standard algorithm with the optimal, but usually unknown, constant a.

In this sense the averaging algorithm can be considered to be more stable than the

standard one. Furthermore, the averaging algorithm shows the same limit distribution

as the Newton-type adaptive procedure suggested by Fabian [9] (section 5).

The method proposed in this paper is inspired by an idea of Ruppert [21] and

Polyak [16], who suggested considering the arithmetic mean of the trajectories of a

Robbins–Monro process, which is driven by step lengths slower than 1/n,too. In

this case one obtains the best possible convergence rate and the optimal covariance

of the asymptotic distribution in a certain sense [17]. Since then Yin [27], Pechtl [15],

Kushner and Yang [13], Gy¨orﬁ and Walk [10], Nazin and Shcherbakov [14], and others

have studied this idea.

A further contribution of this paper is a new design to estimate the gradient

which leads to a vanishing asymptotic bias eµ (for d = 1 see Renz [19]) regardless

of which method (with or without averaging, or with adaptation) is used. Applying

the weighted averaging algorithm together with this gradient estimate leads to a

second moment of the asymptotic distribution which is minimal within a large class

of procedures (relation (5.4)).

Spall [22] introduced another gradient estimate Y

, the so-called simultaneous

gradient perturbation method. It uses only two observations at each step instead

of 2d observations, as in the standard Kiefer–Wolfowitz method in R

. This makes

it suitable for certain optimization problems in high-dimensional spaces R

. Taking

weighted averages of the process generated with Spall’s gradient estimate stabilizes

the performance as discussed below (Theorem 4.2 and section 5).

All these central limit theorems require consistency of the stochastic approxima-

tion method (Propositions 3.1 and 4.1). To prove the central limit theorems we apply

a weak invariance principle stated in Lemma 7.1. Taking weighted averages of the

trajectories leads to an accumulation of terms due to the nonlinearity of the regression

function. To cope with this eﬀect the assumptions of this lemma are partly stronger

than those of a functional central limit theorem for the nonweighted case (see Walk

[24]). But fortunately, the additional conditions can be shown to be fulﬁlled for many

stochastic approximation procedures. The assertions of both central limit theorems

in this paper can be formulated as invariance principles in the spirit of Lemma 7.1.

As already indicated in Dippon and Renz [4], taking weighted averages of the

trajectories works well with the original gradient estimate of Kiefer and Wolfowitz

(p = 3).

2. Notations. For a d-dimensional Euclidean space the linear space of d × d

matrices is denoted by L(R

). x

∗

is the transposed vector of x ∈ R

, A

∗

is the adjoint

matrix, and tr A is the trace of A ∈L(R

). The tensor product x ⊗ y : R

→ R

deﬁned by hy, ·ix, where x, y ∈ R

and h·, ·i is the usual inner product. The space

C([0, 1], R

)ofR

-valued continuous functions on [0, 1] is equipped with the maximum

WEIGHTED MEANS IN STOCHASTIC APPROXIMATION 1813

norm. Hf (ϑ) is the Hessian of a function f : R

→ R at ϑ ∈ R

.Forx∈Rwe use

bxc and dxe, denoting the integer part of x and the least integer greater than or equal

to x, respectively.

Let (Ω, A,P) be a probability space. Then a sequence (X

)ofR

-valued random

variables (r.v.’s) is called bounded in probability whenever lim

R→∞

lim

P (kX

k≥

R)=0;(X

) converges to zero almost in L

or is bounded almost in L

(r ∈

(0, ∞)) if for each ε>0 there exists an Ω

∈Awith P (Ω

) ≥ 1 − ε such that

(

Ω

dP )

1/r

= o(1) or = O(1), respectively. Convergence almost in L

implies

convergence in probability, but it is weaker than a.s. convergence or convergence in

the rth mean.

3. A Kiefer–Wolfowitz procedure with an improved gradient estimate.

The Kiefer–Wolfowitz procedure, which ﬁnds the minimizer ϑ of a regression function

f : R

→ R, has been modiﬁed by Fabian [6] in such a way that the rate of convergence

nearly reaches the rate of a Robbins–Monro procedure if f is assumed to be suﬃciently

smooth in a neighborhood of ϑ. The method uses multiple observations per step.

We consider here, including the Fabian procedure, a modiﬁed Kiefer–Wolfowitz

procedure which is given by recursion (1.1). There Y

is an estimate of the gradient

∇f(X

) based on error-contaminated observations of f. It is deﬁned by

= c

−1

j=1



{f(X

) − V

(i)

n,2j−1

}−{f(X

−c

)−V

(i)

n,2j

}



i=1,...,d

,(3.1)

where the following deﬁnitions and relations are used throughout section 3: m ∈N,

0 <u

<···<u

≤1, v

,...,v

are real numbers with

j=1

2i−1

=(1/2)δ

for

all i =1,...,m (as to the existence, compare Fabian [6]), and c

= cn

−γ

with c>0

and 0<γ<1/2. The unit vectors in R

are denoted by e

,...,e

For future reference, we state the following additional conditions:

(A) ∇f exists on R

with ∇f (ϑ)=0.

Concerning the local diﬀerentiability of f at ϑ we consider two cases. In the ﬁrst

case (p = 2) we assume that there exists ε>0, τ ∈(0, 1], K

and K

such that

(B1a) Hf (ϑ) exists with k∇f(x) −Hf (ϑ)(x −ϑ)k≤K

kx−ϑk

1+τ

for all x∈U

(ϑ),

(B1b) k∇f(x) −∇f(y)k≤K

kx−ykfor all x, y ∈ U

(ϑ).

(B1b) holds, for instance, if all second partial derivatives of f exist and are

bounded on U

(ϑ). For the second case (p ≥ 3), we assume that there exist ε>0 and

L such that

(B2a) derivatives of f up to order p − 1 exist on U

(ϑ),

(B2b) the pth derivative of f at ϑ exists,

(B2c) kHf (x) − Hf (y)k≤Lkx−ykfor all x, y ∈ U

(ϑ).

A suﬃcient condition for (B2c) to hold is that all third partial derivatives of f

exist and are bounded on U

(ϑ).

For brevity, (B1) stands for (B1a) and (B1b), and (B2) for (B2a), (B2b), and

(B2c). We use (B) to indicate that either (B1) or (B2) holds.

So far, m has not been speciﬁed. The number m must be adapted to the particular

value of p given by (B1) or (B2). Fabian [6] considers in this connection the case

(C1) m := bp/2c =(p−1)/2 for an odd p ≥ 3, γ := 1/(2p).

We will consider in addition the following case (for d = 1 see Renz [19]):

(C2) m := dp/2e for p ≥ 2(pnot necessarily odd), γ := 1/(2p),

1814 J. DIPPON AND J. RENZ

which will result in an unbiased limit distribution, whereas (C1) generally leads to a

nonzero bias (Theorem 3.2).

Similarly as above, (C) means that either (C1) or (C2) holds. We note here that

the assumptions (B1) and (C1) do not occur together.

The sequence (W

) of random variables W

j=1



(i)

n,2j−1

−V

(i)

n,2j



i=1,...,d

satisﬁes

(D)

∀

n≥m

kEW

⊗W

k≤%

n−m

(EkW

EkW

)

with

∞

l=0

< ∞ and EkW

= O(1).

Regarding assumption (B2b) it is worthwhile to note that this condition is in-

variant under rotation of coordinates (compare Fabian [8]). As a further comparison

with related work (Fabian [8], Spall [23]), we remark that our results, Theorems 3.2

and 4.2, do not assume continuity of the highest-order partial derivatives.

Results about asymptotic normality in stochastic approximation usually rely on

local smoothness of the regression function f around ϑ and on the consistency of the

procedure. The next proposition shows consistency of the modiﬁed procedure. The

assumptions imposed on f allow us to decouple the inﬂuence of the r.v.’s W

and to

use the weak dependence condition (D).

ROPOSITION 3.1. Let a

= a/n

with α ∈ (max{1/2+1/(2p), 1 − 1/p}, 1) or

=(aln n)/n, a>0. For recursion (1.1) with gradient estimate (3.1), assume that

conditions (A) and (D) hold, f is bounded from below and has a Lipschitz continuous

derivative with ∇f(x) 6=0for all x 6= ϑ, and sup{kxk : f(x) ≤ λ} < ∞ for all

λ>inf{f (x):x∈R

}. Then X

→ ϑ (n →∞)a.s.

Under condition (C1) a nonweighted analogue of the next theorem can be found

in Fabian [8].

HEOREM 3.2. Let a

=(aln n)/n for p=2 and a

=a/n

with α ∈(1/2+1 / (2p), 1)

for p ≥ 3. For recursion (1.1) with gradient estimate (3.1), assume that conditions

(A)–(D) hold, A := Hf (ϑ) is positive deﬁnite, and X

→ ϑ a.s. Let B

(t):=

−1/2

bntc

i=1

+(nt −bntc)W

bntc+1

. Suppose the existence of a Brownian mo-

tion B with covariance matrix S of B(1) and

→ B in C([0, 1], R

)(n→∞).

Then, for all δ>−(p+1)/(2p),

(

1−

)



n,δ

− ϑ



→ N



2p(1+δ)

p+1+2pδ

p−1

−1

p(1+δ)

p+1+2pδ

−2

−1



(n →∞),

where b = −

p !



j=1

(1+(−1)

p+1

)

∂

(∂x

)

f(ϑ)



i=1,...,d

and

n,δ

is deﬁned in

(1.2). In particular, under condition (C2), b =0.

EMARK 3.3. The choices δ =0and δ = −2γ = −1/p are of special interest.

Provided b 6=0, the pair (δ, c)=(0,c

) with c

as given in (5.1) minimizes the second

moment of the limit distribution. However, for ﬁxed c>0, the limit’s covariance is

minimized by δ = −2γ = −1/p. In particular, Theorem 3.2 yields for n →∞

(

1−

)

−1

k=1

− ϑ

→ N



p+1

p−1

−1

p+1

−2

−1



(

1−

)

p−1

−

p−1

k=1

−

− ϑ

→ N



2 c

p−1

−1

p−1

−2

−1



WEIGHTED MEANS IN STOCHASTIC APPROXIMATION 1815

4. A Kiefer–Wolfowitz procedure with simultaneous perturbation gra-

dient approximation. The classical Kiefer–Wolfowitz (ﬁnite diﬀerence) stochastic

approximation method (FDSA) needs 2d observations to obtain a ﬁnite diﬀerence

approximation of the gradient belonging to the function f : R

→ R of which the

minimizer ϑ is sought. To reduce the number of observations in each step, random-

ized gradient approximation methods have been considered in the literature. Two

examples are random direction stochastic approximation (RDSA), suggested by Kush-

ner and Clark [12], and simultaneous perturbation stochastic approximation (SPSA),

suggested by Spall [22]. Both methods are based on only two observations in each

iteration. Depending on the dimension d and the third derivatives of the regres-

sion function f the AMSE of the SPSA method can be better or worse than that of

the FDSA and RDSA methods. At least for second-order polynomials f, the FDSA

method needs d times more observations than the SPSA method to achieve the same

level of mean squared error asymptotically, when the same span c

= cn

−γ

is used

(Spall [23]).

Before the idea of weighted averages is applied to the SPSA method, we will

describe this algorithm in more detail. Again recursion (1.1) is used, but with step

lengths a

= an

−α

and with the following so-called simultaneous perturbation gradient

estimate of ∇f(X







(∆

(1)

)

−1

(∆

(d)

)

−1







([f(X

+ c

∆

) − W

n,1

] − [f(X

− c

∆

) − W

n,2

])(4.1)

consisting of (artiﬁcially generated) random vectors ∆

∈M(Ω, R

), observation

errors W

n,1

, W

n,2

∈M(Ω, R), and span c

We consider the following set of conditions.

(E) The components ∆

(i)

of ∆

, i =1,...,d, for n ∈ N ﬁxed, form a set

of independent, identically and symmetrically distributed r.v.’s with |∆

(l)

having values between ﬁxed positive numbers α

<α

. The r.v. ∆

is as-

sumed to be independent of {X

,...,X

,∆

,...,∆

n−1

}. Furthermore, we

use ξ

= E|∆

(l)

and ρ

= E|∆

(l)

−2

. For simplicity, the column vector

appearing in (4.1) is denoted by ∆

−1

(F) The diﬀerence W

= W

n,1

− W

n,2

of the observation errors satisﬁes E(W

) = 0 and sup





<∞a.s., where F

and G

denote the σ-ﬁelds

generated by {X

,..., X

, ∆

,...,∆

} and {X

,..., X

, ∆

,...,∆

n−1

respectively.

(G) ∞ >E(W

)→σ

a.s. and E(W

≥rn]

)→0 a.s. for every

r>0.

(H) (B2) holds for p = 3, and A = Hf (ϑ) is a positive deﬁnite matrix.

The proposition below presents conditions for the recursion’s consistency. It is

related to Blum’s result [2] on multivariate Kiefer–Wolfowitz procedures. Under dif-

ferent and less intuitive assumptions and with a diﬀerent method of proof, Spall [23]

asserts consistency as well.

ROPOSITION 4.1. Let a

= a/n

with α ∈ (max{γ +1/2,1−2γ},1] and γ>0.

For recursion (1.1) with gradient estimate (4.1), assume that conditions (A), (E), and

(F) hold, and that f is bounded from below and has a Lipschitz continuous gradient.

(a) If sup{kxk : f(x) ≤ λ} < ∞ for all λ>inf{f(x):x∈R

}, then sup

k <

∞ a.s.

Weighted Means in Stochastic Approximation of Minima

Citations

Implementation of the simultaneous perturbation algorithm for stochastic optimization

Adaptive stochastic approximation by the simultaneous perturbation method

Model-free control of nonlinear stochastic systems with discrete-time measurements

Simulation-Based Algorithms for Markov Decision Processes

Probabilistic and Randomized Methods for Design under Uncertainty

References

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

Stochastic Estimation of the Maximum of a Regression Function

Acceleration of stochastic approximation by averaging

Stochastic Approximation Methods for Constrained and Unconstrained Systems

Stochastic approximation

Related Papers (5)

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation

Stochastic Estimation of the Maximum of a Regression Function

Acceleration of stochastic approximation by averaging

Stochastic approximation

A Stochastic Approximation Method

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Weighted means in stochastic approximation of minima∗" ?

Q2. What is the recursion of the gradient?

Q3. What is the proof of the Lemma 7.1?

Q4. What is the recursion of the gradient estimate?

Q5. How can the authors obtain the limit distribution for the two algorithms?

Q6. What is the pth derivative of f?

Q7. What is the proof of Proposition 4.1?

Q8. What is the simplest formula to obtain f(xn)?

Q9. What is the gradient estimate for a fixed c?

Q10. What is the difference between the two methods?