scispace - formally typeset
Open AccessJournal ArticleDOI

A Knowledge-Gradient Policy for Sequential Information Collection

Reads0
Chats0
TLDR
In a sequential Bayesian ranking and selection problem with independent normal populations and common known variance, a previously introduced measurement policy is studied, showing that the knowledge-gradient policy is optimal both when the horizon is a single time period and in the limit as the horizon extends to infinity.
Abstract
In a sequential Bayesian ranking and selection problem with independent normal populations and common known variance, we study a previously introduced measurement policy which we refer to as the knowledge-gradient policy. This policy myopically maximizes the expected increment in the value of information in each time period, where the value is measured according to the terminal utility function. We show that the knowledge-gradient policy is optimal both when the horizon is a single time period and in the limit as the horizon extends to infinity. We show furthermore that, in some special cases, the knowledge-gradient policy is optimal regardless of the length of any given fixed total sampling horizon. We bound the knowledge-gradient policy's suboptimality in the remaining cases, and show through simulations that it performs competitively with or significantly better than other policies.

read more

Content maybe subject to copyright    Report

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
SIAM J. CONTROL OPTIM.
c
2008 Society for Industrial and Applied Mathematics
Vol. 47, No. 5, pp. 2410–2439
A KNOWLEDGE-GRADIENT POLICY FOR SEQUENTIAL
INFORMATION COLLECTION
PETER I. FRAZIER
, WARREN B. POWELL
, AND SAVAS DAYANIK
Abstract. In a sequential Bayesian ranking and selection problem with independent normal
populations and common known variance, we study a previously introduced measurement policy
which we refer to as the knowledge-gradient policy. This policy myopically maximizes the expected
increment in the value of information in each time period, where the value is measured according to
the terminal utility function. We show that the knowledge-gradient policy is optimal both when the
horizon is a single time period and in the limit as the horizon extends to infinity. We show furthermore
that, in some special cases, the knowledge-gradient policy is optimal regardless of the length of any
given fixed total sampling horizon. We bound the knowledge-gradient policy’s suboptimality in the
remaining cases, and show through simulations that it performs competitively with or significantly
better than other policies.
Key words. ranking and selection, Bayesian statistics, sequential decision analysis
AMS subject classifications. 62F07, 62F15, 62L05
DOI. 10.1137/070693424
1. Introduction. We consider a ranking and selection problem in which we are
faced with M 2 alternatives, each of which can be measured sequentially to estimate
its constant but unknown underlying average performance. The measurements are
noisy, and as we obtain more measurements, our estimates become more accurate.
We assume normally distributed measurement noise and independent normal Bayesian
priors for each alternative’s underlying average performance. We have a budget of N
measurements to spread over the M alternatives before deciding which is best. The
goal is to choose the alternative with the best underlying average performance.
Information collection problems of this type arise in a number of applications:
(i) Choosing the chemical compound from a library of existing test compounds
that has the greatest effectiveness against a particular disease. A compound’s
effectiveness may be measured by exposing cultured cells infected with the
disease to the compound and observing the result. The compound found most
effective will be developed into a drug for treating the disease.
(ii) Choosing the most efficient of several alternative assembly line configurations.
We may spend a certain short amount of time testing different configurations,
but once we put one particular configuration into production, that choice will
remain in production for a period of several years.
(iii) Selecting the best of several policies applied to a stochastic Markov decision
process. The policies may be evaluated only through Monte Carlo simulation,
so a method of ranking and selection is needed to determine which policy is
best. This selection may be as part of a larger algorithm for finding the
optimal policy as in evolutionary policy iteration [3].
Received by the editors May 31, 2007; accepted for publication (in revised form) April 29, 2008;
published electronically September 8, 2008.
http://www.siam.org/journals/sicon/47-5/69342.html
Department of Operations Research & Financial Engineering, Princeton University, Princeton,
NJ 08544 (pfrazier@princeton.edu, powell@princeton.edu, sdayanik@princeton.edu). The second au-
thor’s research was partially supported by AFOSR contract FA9550-08-1-0195. The third author’s
research was partially supported by the Center for Dynamic Data Analysis for Homeland Security,
ONR Award N00014-07-1-0150.
2410

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A KNOWLEDGE-GRADIENT POLICY 2411
In this article we study a measurement policy introduced in [16] under the name of
the (R
1
,...,R
1
) policy, and referred to herein as the knowledge-gradient (KG) policy.
We briefly describe this policy and leave further description for section 4.1. Let μ
n
x
and (σ
n
x
)
2
denote the mean and variance of the posterior predictive distribution for the
unknown value of alternative x after the first n measurements. Then the KG policy
is the policy that chooses its (n + 1)st measurement X
KG
((μ
n
1
n
1
),...,(μ
n
M
n
M
))
from within {1,...,M} to maximize the single-period expected increase in value,
E
n
(max
x
μ
n+1
x
) (max
x
μ
n
x
)
, where E
n
indicates the conditional expectation with
respect to what is known after the first n measurements. That is,
X
KG
((μ
n
1
n
1
),...,(μ
n
M
n
M
)) arg max
x
n
∈{1,...,M}
E
n
(max
x
μ
n+1
x
) (max
x
μ
n
x
)
.
In this expression the expectation is implicitly a function of x
n
, the measurement de-
cision at time n. If the maximum is attained by more than one alternative, then
we choose the one with the smallest index. As the terminal reward is given by
max
x=1,...,M
μ
N
x
, this policy is like a gradient ascent algorithm on a utility surface
with domain parameterized by the state of knowledge ((μ
1
1
),...,(μ
M
M
)). It
may also be viewed as a single-step Bayesian look-ahead policy.
In this work we continue the analysis of [16]. We demonstrate that the KG policy,
introduced there as the most rudimentary of a collection of potential policies and
studied for its simplicity but neglected thereafter, is actually a powerful and efficient
tool for ranking and selection that should be considered for application alongside
current state-of-the-art policies. As discussed in detail in section 2, a number of other
sequential Bayesian look-ahead policies have been derived in recent years by solving
a sequence of single-stage optimization problems just as the KG policy does, and,
among these, the optimal computing budget allocation for linear loss of [18] and the
LL(S) policy of [12] assume situations most similar to the one assumed here. The
KG policy differs, however, from these other policies in that it solves its single-stage
problem exactly, while the other policies must use approximations. We believe that
solving the look-ahead problem exactly offers an advantage.
After formulating the problem in section 3 and defining the policy in section 4, we
show in section 5 that the KG policy is optimal in the limit as N →∞in the sense that
the policy incurs no opportunity cost in the limit as infinitely many measurements
are allowed. Also, by its construction and as noted in [16], KG is optimal when
there is only one measurement remaining. This provides optimality guarantees at
two extremes: N large and N small. While many policies are asymptotically optimal
without performing particularly well in the finite sample case, a policy with both
kinds of optimality satisfies a more stringent performance check. For example, the
equal-allocation policy is asymptotically optimal, but it is not optimal when N =1,
except in certain special cases, and performs poorly overall. In the other extreme,
myopic policies for generic Markov decision processes often perform poorly because
they ignore long-term rewards. By being optimal for both N = 1 and N = ,KG
avoids the problem that most afflicts other myopic policies, while retaining single-
sample optimality.
In accordance with our belief that optimality at two extremes suggests good per-
formance in the region between, we provide a bound on the policy’s suboptimality for
finite N in section 6. In section 7 we introduce the KG persistence property and use
it to show both optimality for the case when M = 2 and for a further special case
in which the means and variances are ordered. Our proof that KG is optimal when

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
2412 P. I. FRAZIER, W. B. POWELL, AND S. DAYANIK
M = 2 confirms a claim made by Gupta and Miescke [15], who showed its optimality
among deterministic policies for M = 2, but did not offer a formal proof for optimality
among sequential policies. Finally, in section 8, we demonstrate in numerical exper-
iments that KG performs competitively against the other policies discussed here. In
particular, the KG policy is best according to the measure of average performance
across a number of randomly generated problems, and the margin by which it out-
performs the best competing policies on the most favorable problems is significantly
larger than the margin by which it is outperformed on the most unfavorable problems.
2. Literature review. The KG policy was introduced in [16] as the simplest
of a collection of look-ahead policies and was studied because its simplicity provided
tractability, but this simple policy has seldom been studied or applied in the years
since. Instead, a number of more complex Bayesian look-ahead policies have been
introduced. A series of researchers beginning with [4] and continuing with [5], [9], [7],
[8], [6] proposed and then refined a family of policies known as the optimal computing
budget allocation (OCBA). These policies are derived by formulating a static opti-
mization problem in which one chooses the measurements to maximize the probability
of later correctly selecting the best alternative. OCBA policies solve this optimization
problem by approximating the objective function with various bounds and relaxations,
and by assuming that the predictive mean will remain unchanged by measurement.
They then solve the approximate problem using gradient ascent or greedy heuristics,
or with an asymptotic solution that is exact in the limit as the number of measure-
ments in the second stage is large. All OCBA policies assume normal samples with
known sampling variance, but in practice one may estimate this variance through
sampling.
Any OCBA policy can be extended to multistage or fully sequential problems by
performing the second stage of the two-stage policy repeatedly, at each stage calling
all previous measurements the first stage and the set of measurements to be taken
next the second stage. It is in this extension that one sees the similarity to the one-
step Bayesian look-ahead approach of KG, which extends the one-stage policy which
is optimal with one measurement remaining to a sequential policy by supposing at
each point in time that the current measurement will be the last.
The OCBA policies mentioned above are designed to maximize the probability
of correctly selecting the best alternative, while KG is designed to maximize the
expected value of the chosen alternative. These different objective functions are also
termed 01 loss and linear loss, respectively. They are similar but not identical, 01
loss perhaps being more appropriate when knowledge of the identity of the best is
intrinsically valuable (and where accidentally choosing the second best is nearly as
harmful as choosing the worst), and linear loss being more appropriate when value is
obtained directly by implementing the chosen alternative.
Recently [18] introduced an OCBA policy designed to minimize expected linear
loss. Although more similar to KG than other OCBA policies, it differs in that it uses
the Bonferroni inequality to approximate the linear loss objective function for a single
stage, and then solves the approximate problem using a second approximation which
is accurate in the limit as the second stage is large. This is in contrast to KG, which
solves the single-stage problem exactly. The OCBA policy in [18] does not assume,
as the other OCBA approaches do, that the posterior predictive mean is equal to the
prior predictive mean, and in this regard it is more similar to the approach of [12]
discussed below.
A set of Bayesian look-ahead ranking and selection policies distinct from OCBA

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A KNOWLEDGE-GRADIENT POLICY 2413
were introduced in [12]. They differ by not assuming the predictive means equal
through time and by allowing the sampling variance to be unknown. This causes the
posterior predictive mean to be student-t distributed, inducing an optimization prob-
lem governing the second-stage allocation with an objective function that is somewhat
different from that in OCBA formulations. This objective function, corresponding to
expected loss, is bounded below, and this lower bound is then approximately min-
imized. The resulting solution minimizes the lower bound exactly in the limit as
sampling costs are small, or as the number of second-stage measurements is large.
Six policies are derived in total by considering both 01 and linear loss under
three different settings: two-stage measurements with a budget constraint; two-stage
without a budget constraint; and sequential. Among these policies, the one most
similar to KG is LL(S), which uses linear loss in a sequential setting, allocating τ
measurements at a time.
In [10] an unknown-variance version of the KG policy was developed under the
name LL
1
. The authors compared LL
1
to LL(S) using Monte Carlo simulations
and found that LL
1
performed well for a small sampling budget, but degraded in
performance as the sampling budget increased. We briefly discuss how these results
relate to our own in section 8.
In addition to the Bayesian approaches to sequentially ranking and selecting nor-
mal populations described thus far, a substantial amount of progress has been made
using a frequentist approach. We do not review this literature in detail, but state
only that an overview may be found in [1] and that a more recent policy which per-
forms quite well in the multistage setting with normal rewards is given in [23], [22].
Other sequential and staged policies for independent normal rewards with frequentist
guarantees include those in [25], [27], [17], [26], and [24].
Sequential tests also exist which choose measurements based upon confidence
bounds for the value Y
x
. Such tests include interval estimation [19], which was devel-
oped for on-line bandit-style learning in a reinforcement learning setting, and upper
confidence bound estimation [3], which was developed for estimating value functions
for Markov decision processes. Both tests form frequentist confidence intervals for
each Y
x
and then select the alternative with the largest upper bound on its confi-
dence interval for measurement. Such policies have general applicability beyond the
independent normal setting discussed here.
3. Problem formulation. We state a formal model for our problem, including
transition and objective functions. We then formulate the problem as a dynamic
program.
3.1. A formal model. Let , F, P) be a probability space and let {1,...,M}
be the set of alternatives. For each x ∈{1,...,M} define a random variable Y
x
to
be the true underlying value of alternative x. We assume a Bayesian setting for the
problem in which we have a multivariate normal prior predictive distribution for the
random vector Y , and we further assume that the components of Y are independent
under the prior and that max
x=1,...,M
|Y
x
| is integrable. We will be allotted exactly N
measurements, and time will be indexed using n with the first measurement decision
made at time 0. At each time 0 n<N, we choose an alternative x
n
to measure.
Let ε
n+1
be the measurement error, which we assume is normally distributed with
mean 0 and a finite known variance (σ
ε
)
2
that is the same across all alternatives. We
also assume that errors are independent of each other and of the random vector Y .
Then define ˆy
n+1
= Y
x
+ ε
n+1
to be the measurement value observed. At time N,we
choose an implementation decision x
N
based on the measurements recorded, and we

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
2414 P. I. FRAZIER, W. B. POWELL, AND S. DAYANIK
receive an implementation reward ˆy
N+1
. We assume that the reward is unbiased, so
that ˆy
N+1
satisfies E
ˆy
N+1
|Y,x
N
= Y
x
N
. Define the filtration (F
n
)
N
n=0
by letting F
n
be the sigma-algebra generated by x
0
, ˆy
1
,x
1
,...,x
n1
, ˆy
n
. We will use the notation
E
n
[·] to indicate E[ ·|F
n
], the conditional expectation taken with respect to F
n
.
Measurement and implementation decisions x
n
are restricted to be F
n
-measurable so
that decisions may depend only on measurements observed and decisions made in the
past.
Let μ
0
:= E [Y ] and Σ
0
:= Cov [Y ] be the mean and covariance of the predic-
tive distribution for Y so that Y has prior predictive distribution N (μ
0
, Σ
0
) and Σ
0
is a diagonal covariance matrix. Note that our assumed integrability of max
x
|Y
x
|
is equivalent to assuming integrability of every Y
x
because |Y
x
|≤max
x
|Y
x
| and
max
x
|Y
x
|≤|Y
1
| + ···+ |Y
M
|, which is equivalent to assuming Σ
0
xx
finite for every x.
We will use the Bayes rule to form a sequence of posterior predictive distributions
for Y from this prior and the successive measurements. Let μ
n
:= E
n
[Y ] be the mean
vector and Σ
n
:= Cov [Y |F
n
] the covariance matrix of the predictive distribution af-
ter n measurements have been made. Because the error term ε
n+1
is independent and
normally distributed, the predictive distribution for Y will remain normal with inde-
pendent components, and Σ
n
will be diagonal almost surely. We write (σ
n
x
)
2
to refer
to the diagonal component Σ
n
xx
of the covariance matrix. Then Y
x
∼N(μ
n
x
, (σ
n
x
)
2
)
conditionally on F
n
. We will also write β
n
x
:= (σ
n
x
)
2
to refer to the precision of the
predictive distribution for Y
x
, β
n
:= (β
n
1
,...,β
n
M
) to refer to the vector of precisions,
and β
:= (σ
ε
)
2
to refer to the measurement precision. Note that σ
< implies
β
> 0.
Our goal will be to choose the measurement policy (x
0
,...,x
N1
) and implemen-
tation decision x
N
that maximizes E [Y
x
N
]. The implementation decision x
N
that
maximizes E
N
[Y
x
N
]=μ
N
x
is any element of arg max
x
μ
N
x
, and the value achieved is
max
x
μ
N
x
. Thus, letting Π be the set of measurement strategies π =(x
0
,...,x
N1
)
adapted to the filtration, we may write our problem’s objective function as
(1) sup
π Π
E
π
max
x
μ
N
x
.
3.2. State space and transition function. Our state space is the space of
all possible predictive distributions for Y . It can be shown by induction that these
are all multivariate normal with independent components. We formally define the
state space S by S := R
M
× (0, ]
M
, and it consists of points s =(μ, β) where,
for each x ∈{1,...,M}, μ
x
and β
x
are, respectively, the mean and precision of a
normal distribution. We will write S
n
:= (μ
n
n
) to refer to the state at time n. The
notation S
n
will refer to a random variable, while s will refer to a fixed point in the
state space.
Fix a time n. We use the Bayes rule to update the predictive distribution of Y
x
conditioned on F
n
to reflect the observation ˆy
n+1
= Y
x
+ ε
n+1
, obtaining a posterior
predictive distribution conditioned on F
n+1
. Since ε
n+1
is an independent normal
random variable and the family of normal distributions is closed under sampling,
the posterior predictive distribution is also normal. Thus our posterior predictive
distribution for Y
x
is N (μ
n+1
x
, 1
n+1
x
), and writing it as a function of the prior and
the observation reduces to writing μ
n+1
and β
n+1
as functions of μ
n
, β
n
, and ˆy
n+1
.
The Bayes rule tells us that these functions are
μ
n+1
x
=
β
n
x
μ
n
x
+ β
ˆy
n+1
n+1
x
if x
n
= x,
μ
n
x
otherwise,
(2)

Citations
More filters
Posted Content

A Tutorial on Bayesian Optimization

TL;DR: This tutorial describes how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient, and provides a generalization of expected improvement to noisy evaluations beyond the noise-free setting where it is more commonly applied.
Journal ArticleDOI

The Knowledge-Gradient Policy for Correlated Normal Beliefs

TL;DR: A fully sequential sampling policy is proposed called the knowledge-gradient policy, which is provably optimal in some special cases and has bounded suboptimality in all others and it is demonstrated how this policy may be applied to efficiently maximize a continuous function on a continuous domain while constrained to a fixed number of noisy measurements.
Journal ArticleDOI

Sequential design of computer experiments for the estimation of a probability of failure

TL;DR: SUR (stepwise uncertainty reduction) strategies are derived from a Bayesian formulation of the problem of estimating a probability of failure of a function f using a Gaussian process model of f and aim at performing evaluations of f as efficiently as possible to infer the value of the probabilities of failure.
Proceedings Article

BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization

TL;DR: One of the earliest commonly-used packages is Spearmint, which implements a variety of modeling techniques such as MCMC hyperparameter sampling and input warping.
Book

A Tutorial on Thompson Sampling

TL;DR: Thompson Sampling as discussed by the authors is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulatenew information that may improve future performance.
References
More filters
Book

Foundations of modern probability

TL;DR: In this article, the authors discuss the relationship between Markov Processes and Ergodic properties of Markov processes and their relation with PDEs and potential theory. But their main focus is on the convergence of random processes, measures, and sets.
Journal ArticleDOI

The Greatest of a Finite Set of Random Variables

TL;DR: In this article, the authors present formulas and tables that permit approximations to the moments in case n > 2, where the moments are approximated by iteration of a three-parameter computation or, alternatively, through successive use of the threeparameter table, which is given.
Journal ArticleDOI

Simulation Budget Allocation for Further Enhancing theEfficiency of Ordinal Optimization

TL;DR: This paper presents a new approach that can further enhance the efficiency of ordinal optimization, which determines a highly efficient number of simulation replications or samples and significantly reduces the total simulation cost.
Book

Learning in Embedded Systems

TL;DR: This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions.
Journal ArticleDOI

Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "A knowledge-gradient policy for sequential information collection∗" ?

In a sequential Bayesian ranking and selection problem with independent normal populations and common known variance, the authors study a previously introduced measurement policy which they refer to as the knowledge-gradient policy. The authors show that the knowledge-gradient policy is optimal both when the horizon is a single time period and in the limit as the horizon extends to infinity. The authors show furthermore that, in some special cases, the knowledge-gradient policy is optimal regardless of the length of any given fixed total sampling horizon. The authors bound the knowledge-gradient policy ’ s suboptimality in the remaining cases, and show through simulations that it performs competitively with or significantly better than other policies.