What contributions have the authors mentioned in the paper "A knowledge-gradient policy for sequential information collection∗" ?

In a sequential Bayesian ranking and selection problem with independent normal populations and common known variance, the authors study a previously introduced measurement policy which they refer to as the knowledge-gradient policy. The authors show that the knowledge-gradient policy is optimal both when the horizon is a single time period and in the limit as the horizon extends to infinity. The authors show furthermore that, in some special cases, the knowledge-gradient policy is optimal regardless of the length of any given fixed total sampling horizon. The authors bound the knowledge-gradient policy ’ s suboptimality in the remaining cases, and show through simulations that it performs competitively with or significantly better than other policies.

(Open Access) A Knowledge-Gradient Policy for Sequential Information Collection (2008) | Peter I. Frazier

SIAM J. CONTROL OPTIM.



2008 Society for Industrial and Applied Mathematics

Vol. 47, No. 5, pp. 2410–2439

A KNOWLEDGE-GRADIENT POLICY FOR SEQUENTIAL

INFORMATION COLLECTION

∗

PETER I. FRAZIER

†

, WARREN B. POWELL

†

, AND SAVAS DAYANIK

†

Abstract. In a sequential Bayesian ranking and selection problem with independent normal

populations and common known variance, we study a previously introduced measurement policy

which we refer to as the knowledge-gradient policy. This policy myopically maximizes the expected

increment in the value of information in each time period, where the value is measured according to

the terminal utility function. We show that the knowledge-gradient policy is optimal both when the

horizon is a single time period and in the limit as the horizon extends to inﬁnity. We show furthermore

that, in some special cases, the knowledge-gradient policy is optimal regardless of the length of any

given ﬁxed total sampling horizon. We bound the knowledge-gradient policy’s suboptimality in the

remaining cases, and show through simulations that it performs competitively with or signiﬁcantly

better than other policies.

Key words. ranking and selection, Bayesian statistics, sequential decision analysis

AMS subject classiﬁcations. 62F07, 62F15, 62L05

DOI. 10.1137/070693424

1. Introduction. We consider a ranking and selection problem in which we are

faced with M ≥ 2 alternatives, each of which can be measured sequentially to estimate

its constant but unknown underlying average performance. The measurements are

noisy, and as we obtain more measurements, our estimates become more accurate.

We assume normally distributed measurement noise and independent normal Bayesian

priors for each alternative’s underlying average performance. We have a budget of N

measurements to spread over the M alternatives before deciding which is best. The

goal is to choose the alternative with the best underlying average performance.

Information collection problems of this type arise in a number of applications:

(i) Choosing the chemical compound from a library of existing test compounds

that has the greatest eﬀectiveness against a particular disease. A compound’s

eﬀectiveness may be measured by exposing cultured cells infected with the

disease to the compound and observing the result. The compound found most

eﬀective will be developed into a drug for treating the disease.

(ii) Choosing the most eﬃcient of several alternative assembly line conﬁgurations.

We may spend a certain short amount of time testing diﬀerent conﬁgurations,

but once we put one particular conﬁguration into production, that choice will

remain in production for a period of several years.

(iii) Selecting the best of several policies applied to a stochastic Markov decision

process. The policies may be evaluated only through Monte Carlo simulation,

so a method of ranking and selection is needed to determine which policy is

best. This selection may be as part of a larger algorithm for ﬁnding the

optimal policy as in evolutionary policy iteration [3].

∗

Received by the editors May 31, 2007; accepted for publication (in revised form) April 29, 2008;

published electronically September 8, 2008.

http://www.siam.org/journals/sicon/47-5/69342.html

†

Department of Operations Research & Financial Engineering, Princeton University, Princeton,

NJ 08544 (pfrazier@princeton.edu, powell@princeton.edu, sdayanik@princeton.edu). The second au-

thor’s research was partially supported by AFOSR contract FA9550-08-1-0195. The third author’s

research was partially supported by the Center for Dynamic Data Analysis for Homeland Security,

ONR Award N00014-07-1-0150.

2410

A KNOWLEDGE-GRADIENT POLICY 2411

In this article we study a measurement policy introduced in [16] under the name of

the (R

,...,R

) policy, and referred to herein as the knowledge-gradient (KG) policy.

We brieﬂy describe this policy and leave further description for section 4.1. Let μ

and (σ

)

denote the mean and variance of the posterior predictive distribution for the

unknown value of alternative x after the ﬁrst n measurements. Then the KG policy

is the policy that chooses its (n + 1)st measurement X

((μ

,σ

),...,(μ

,σ

))

from within {1,...,M} to maximize the single-period expected increase in value,



(max



n+1



) − (max



)



, where E

indicates the conditional expectation with

respect to what is known after the ﬁrst n measurements. That is,

((μ

,σ

),...,(μ

,σ

)) ∈ arg max

∈{1,...,M}



(max



n+1



) − (max



)



In this expression the expectation is implicitly a function of x

, the measurement de-

cision at time n. If the maximum is attained by more than one alternative, then

we choose the one with the smallest index. As the terminal reward is given by

max

x=1,...,M

, this policy is like a gradient ascent algorithm on a utility surface

with domain parameterized by the state of knowledge ((μ

,σ

),...,(μ

,σ

)). It

may also be viewed as a single-step Bayesian look-ahead policy.

In this work we continue the analysis of [16]. We demonstrate that the KG policy,

introduced there as the most rudimentary of a collection of potential policies and

studied for its simplicity but neglected thereafter, is actually a powerful and eﬃcient

tool for ranking and selection that should be considered for application alongside

current state-of-the-art policies. As discussed in detail in section 2, a number of other

sequential Bayesian look-ahead policies have been derived in recent years by solving

a sequence of single-stage optimization problems just as the KG policy does, and,

among these, the optimal computing budget allocation for linear loss of [18] and the

LL(S) policy of [12] assume situations most similar to the one assumed here. The

KG policy diﬀers, however, from these other policies in that it solves its single-stage

problem exactly, while the other policies must use approximations. We believe that

solving the look-ahead problem exactly oﬀers an advantage.

After formulating the problem in section 3 and deﬁning the policy in section 4, we

show in section 5 that the KG policy is optimal in the limit as N →∞in the sense that

the policy incurs no opportunity cost in the limit as inﬁnitely many measurements

are allowed. Also, by its construction and as noted in [16], KG is optimal when

there is only one measurement remaining. This provides optimality guarantees at

two extremes: N large and N small. While many policies are asymptotically optimal

without performing particularly well in the ﬁnite sample case, a policy with both

kinds of optimality satisﬁes a more stringent performance check. For example, the

equal-allocation policy is asymptotically optimal, but it is not optimal when N =1,

except in certain special cases, and performs poorly overall. In the other extreme,

myopic policies for generic Markov decision processes often perform poorly because

they ignore long-term rewards. By being optimal for both N = 1 and N = ∞,KG

avoids the problem that most aﬄicts other myopic policies, while retaining single-

sample optimality.

In accordance with our belief that optimality at two extremes suggests good per-

formance in the region between, we provide a bound on the policy’s suboptimality for

ﬁnite N in section 6. In section 7 we introduce the KG persistence property and use

it to show both optimality for the case when M = 2 and for a further special case

in which the means and variances are ordered. Our proof that KG is optimal when

2412 P. I. FRAZIER, W. B. POWELL, AND S. DAYANIK

M = 2 conﬁrms a claim made by Gupta and Miescke [15], who showed its optimality

among deterministic policies for M = 2, but did not oﬀer a formal proof for optimality

among sequential policies. Finally, in section 8, we demonstrate in numerical exper-

iments that KG performs competitively against the other policies discussed here. In

particular, the KG policy is best according to the measure of average performance

across a number of randomly generated problems, and the margin by which it out-

performs the best competing policies on the most favorable problems is signiﬁcantly

larger than the margin by which it is outperformed on the most unfavorable problems.

2. Literature review. The KG policy was introduced in [16] as the simplest

of a collection of look-ahead policies and was studied because its simplicity provided

tractability, but this simple policy has seldom been studied or applied in the years

since. Instead, a number of more complex Bayesian look-ahead policies have been

introduced. A series of researchers beginning with [4] and continuing with [5], [9], [7],

[8], [6] proposed and then reﬁned a family of policies known as the optimal computing

budget allocation (OCBA). These policies are derived by formulating a static opti-

mization problem in which one chooses the measurements to maximize the probability

of later correctly selecting the best alternative. OCBA policies solve this optimization

problem by approximating the objective function with various bounds and relaxations,

and by assuming that the predictive mean will remain unchanged by measurement.

They then solve the approximate problem using gradient ascent or greedy heuristics,

or with an asymptotic solution that is exact in the limit as the number of measure-

ments in the second stage is large. All OCBA policies assume normal samples with

known sampling variance, but in practice one may estimate this variance through

sampling.

Any OCBA policy can be extended to multistage or fully sequential problems by

performing the second stage of the two-stage policy repeatedly, at each stage calling

all previous measurements the ﬁrst stage and the set of measurements to be taken

next the second stage. It is in this extension that one sees the similarity to the one-

step Bayesian look-ahead approach of KG, which extends the one-stage policy which

is optimal with one measurement remaining to a sequential policy by supposing at

each point in time that the current measurement will be the last.

The OCBA policies mentioned above are designed to maximize the probability

of correctly selecting the best alternative, while KG is designed to maximize the

expected value of the chosen alternative. These diﬀerent objective functions are also

termed 0−1 loss and linear loss, respectively. They are similar but not identical, 0−1

loss perhaps being more appropriate when knowledge of the identity of the best is

intrinsically valuable (and where accidentally choosing the second best is nearly as

harmful as choosing the worst), and linear loss being more appropriate when value is

obtained directly by implementing the chosen alternative.

Recently [18] introduced an OCBA policy designed to minimize expected linear

loss. Although more similar to KG than other OCBA policies, it diﬀers in that it uses

the Bonferroni inequality to approximate the linear loss objective function for a single

stage, and then solves the approximate problem using a second approximation which

is accurate in the limit as the second stage is large. This is in contrast to KG, which

solves the single-stage problem exactly. The OCBA policy in [18] does not assume,

as the other OCBA approaches do, that the posterior predictive mean is equal to the

prior predictive mean, and in this regard it is more similar to the approach of [12]

discussed below.

A set of Bayesian look-ahead ranking and selection policies distinct from OCBA

A KNOWLEDGE-GRADIENT POLICY 2413

were introduced in [12]. They diﬀer by not assuming the predictive means equal

through time and by allowing the sampling variance to be unknown. This causes the

posterior predictive mean to be student-t distributed, inducing an optimization prob-

lem governing the second-stage allocation with an objective function that is somewhat

diﬀerent from that in OCBA formulations. This objective function, corresponding to

expected loss, is bounded below, and this lower bound is then approximately min-

imized. The resulting solution minimizes the lower bound exactly in the limit as

sampling costs are small, or as the number of second-stage measurements is large.

Six policies are derived in total by considering both 0−1 and linear loss under

three diﬀerent settings: two-stage measurements with a budget constraint; two-stage

without a budget constraint; and sequential. Among these policies, the one most

similar to KG is LL(S), which uses linear loss in a sequential setting, allocating τ

measurements at a time.

In [10] an unknown-variance version of the KG policy was developed under the

name LL

. The authors compared LL

to LL(S) using Monte Carlo simulations

and found that LL

performed well for a small sampling budget, but degraded in

performance as the sampling budget increased. We brieﬂy discuss how these results

relate to our own in section 8.

In addition to the Bayesian approaches to sequentially ranking and selecting nor-

mal populations described thus far, a substantial amount of progress has been made

using a frequentist approach. We do not review this literature in detail, but state

only that an overview may be found in [1] and that a more recent policy which per-

forms quite well in the multistage setting with normal rewards is given in [23], [22].

Other sequential and staged policies for independent normal rewards with frequentist

guarantees include those in [25], [27], [17], [26], and [24].

Sequential tests also exist which choose measurements based upon conﬁdence

bounds for the value Y

. Such tests include interval estimation [19], which was devel-

oped for on-line bandit-style learning in a reinforcement learning setting, and upper

conﬁdence bound estimation [3], which was developed for estimating value functions

for Markov decision processes. Both tests form frequentist conﬁdence intervals for

each Y

and then select the alternative with the largest upper bound on its conﬁ-

dence interval for measurement. Such policies have general applicability beyond the

independent normal setting discussed here.

3. Problem formulation. We state a formal model for our problem, including

transition and objective functions. We then formulate the problem as a dynamic

program.

3.1. A formal model. Let (Ω, F, P) be a probability space and let {1,...,M}

be the set of alternatives. For each x ∈{1,...,M} deﬁne a random variable Y

be the true underlying value of alternative x. We assume a Bayesian setting for the

problem in which we have a multivariate normal prior predictive distribution for the

random vector Y , and we further assume that the components of Y are independent

under the prior and that max

x=1,...,M

| is integrable. We will be allotted exactly N

measurements, and time will be indexed using n with the ﬁrst measurement decision

made at time 0. At each time 0 ≤ n<N, we choose an alternative x

to measure.

Let ε

n+1

be the measurement error, which we assume is normally distributed with

mean 0 and a ﬁnite known variance (σ

)

that is the same across all alternatives. We

also assume that errors are independent of each other and of the random vector Y .

Then deﬁne ˆy

n+1

= Y

+ ε

n+1

to be the measurement value observed. At time N,we

choose an implementation decision x

based on the measurements recorded, and we

2414 P. I. FRAZIER, W. B. POWELL, AND S. DAYANIK

receive an implementation reward ˆy

N+1

. We assume that the reward is unbiased, so

that ˆy

N+1

satisﬁes E



ˆy

N+1

|Y,x



= Y

. Deﬁne the ﬁltration (F

)

n=0

by letting F

be the sigma-algebra generated by x

, ˆy

,...,x

n−1

, ˆy

. We will use the notation

[·] to indicate E[ ·|F

], the conditional expectation taken with respect to F

Measurement and implementation decisions x

are restricted to be F

-measurable so

that decisions may depend only on measurements observed and decisions made in the

past.

Let μ

:= E [Y ] and Σ

:= Cov [Y ] be the mean and covariance of the predic-

tive distribution for Y so that Y has prior predictive distribution N (μ

, Σ

) and Σ

is a diagonal covariance matrix. Note that our assumed integrability of max

is equivalent to assuming integrability of every Y

because |Y



|≤max

| and

max

|≤|Y

| + ···+ |Y

|, which is equivalent to assuming Σ

ﬁnite for every x.

We will use the Bayes rule to form a sequence of posterior predictive distributions

for Y from this prior and the successive measurements. Let μ

:= E

[Y ] be the mean

vector and Σ

:= Cov [Y |F

] the covariance matrix of the predictive distribution af-

ter n measurements have been made. Because the error term ε

n+1

is independent and

normally distributed, the predictive distribution for Y will remain normal with inde-

pendent components, and Σ

will be diagonal almost surely. We write (σ

)

to refer

to the diagonal component Σ

of the covariance matrix. Then Y

∼N(μ

, (σ

)

conditionally on F

. We will also write β

:= (σ

)

−2

to refer to the precision of the

predictive distribution for Y

, β

:= (β

,...,β

) to refer to the vector of precisions,

and β



:= (σ

)

−2

to refer to the measurement precision. Note that σ



< ∞ implies



> 0.

Our goal will be to choose the measurement policy (x

,...,x

N−1

) and implemen-

tation decision x

that maximizes E [Y

]. The implementation decision x

that

maximizes E

]=μ

is any element of arg max

, and the value achieved is

max

. Thus, letting Π be the set of measurement strategies π =(x

,...,x

N−1

)

adapted to the ﬁltration, we may write our problem’s objective function as

(1) sup

π ∈Π



max



3.2. State space and transition function. Our state space is the space of

all possible predictive distributions for Y . It can be shown by induction that these

are all multivariate normal with independent components. We formally deﬁne the

state space S by S := R

× (0, ∞]

, and it consists of points s =(μ, β) where,

for each x ∈{1,...,M}, μ

and β

are, respectively, the mean and precision of a

normal distribution. We will write S

:= (μ

,β

) to refer to the state at time n. The

notation S

will refer to a random variable, while s will refer to a ﬁxed point in the

state space.

Fix a time n. We use the Bayes rule to update the predictive distribution of Y

conditioned on F

to reﬂect the observation ˆy

n+1

= Y

+ ε

n+1

, obtaining a posterior

predictive distribution conditioned on F

n+1

. Since ε

n+1

is an independent normal

random variable and the family of normal distributions is closed under sampling,

the posterior predictive distribution is also normal. Thus our posterior predictive

distribution for Y

is N (μ

n+1

, 1/β

n+1

), and writing it as a function of the prior and

the observation reduces to writing μ

n+1

and β

n+1

as functions of μ

, β

, and ˆy

n+1

The Bayes rule tells us that these functions are

n+1





+ β



ˆy

n+1



/β

n+1

if x

= x,

otherwise,

(2)

A Knowledge-Gradient Policy for Sequential Information Collection

Figures

Citations

A Tutorial on Bayesian Optimization

The Knowledge-Gradient Policy for Correlated Normal Beliefs

Sequential design of computer experiments for the estimation of a probability of failure

BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization

A Tutorial on Thompson Sampling

References

Foundations of modern probability

The Greatest of a Finite Set of Random Variables

Simulation Budget Allocation for Further Enhancing theEfficiency of Ordinal Optimization

Learning in Embedded Systems

Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

Related Papers (5)

The Knowledge-Gradient Policy for Correlated Normal Beliefs

Efficient Global Optimization of Expensive Black-Box Functions

Simulation Budget Allocation for Further Enhancing theEfficiency of Ordinal Optimization

Approximate dynamic programming : solving the curses of dimensionality

Gaussian Processes for Machine Learning

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A knowledge-gradient policy for sequential information collection∗" ?