scispace - formally typeset
Open AccessJournal ArticleDOI

A Tutorial on the Cross-Entropy Method

TLDR
This tutorial presents the CE methodology, the basic algorithm and its modifications, and discusses applications in combinatorial optimization and machine learning.
Abstract
The cross-entropy (CE) method is a new generic approach to combinatorial and multi-extremal optimization and rare event simulation. The purpose of this tutorial is to give a gentle introduction to the CE method. We present the CE methodology, the basic algorithm and its modifications, and discuss applications in combinatorial optimization and machine learning.

read more

Content maybe subject to copyright    Report

Cross-Entropy Method
Dirk P. Kroese
School of Mathematics and Physics
The University of Queensland
Brisbane 4072, Australia
kroese@maths.uq.edu.au
Abstract: The cross-entropy method is a recent versatile Monte Carlo technique.
This article provides a brief introduction to the cross-entropy method and discusses
how it can be used for rare-event probability estimation and for solving combinatorial,
continuous, constrained and noisy optimization problems. A comprehensive list of
references on cross-entropy methods and applications is included.
Keywords: cross-entropy, Kullback-Leibler divergence, rare events, impor-
tance sampling, stochastic search.
The cross-entropy (CE) method is a recent generic Monte Carlo technique
for solving complicated simulation and optimization problems. The approach
was introduced by R.Y. Rubinstein in [41, 42 ], extending his earlier work on
variance minimization methods for rare-event probability estimation [40].
The CE method can be applied to two types of problem:
1. Estimation: Estimate ` = E[H(X)], where X is a random variable or
vector taking values in some set X and H is function on X . An important
special case is the estimation of a probability ` = P(S(X) > γ), where S
is another f unction on X .
2. Optimization: Optimize (that is, maximize or minimize) S(x) over all
x X , where S is some objective function on X . S can be either a
known or a noisy function. In the latter case the obj ective function needs
to be estimated, e.g., via simulation.
In the estimation setting, the CE method can be viewed as an adaptive
importance sampling procedure that uses the cross-entropy or Kullback-Leibler
divergence as a measure of closeness between two sampling distributions, as
is explained further in Section 1. In the optimization setting, the optimiza-
tion problem is first translated into a rare-event estimation problem, and then
the CE method for estimation is used as an adaptive algorithm to locate the
optimum, as is explained further in Section 2.
An easy tutorial on the CE method is given in [15]. A more comprehensive
treatment can be found in [45]; see also [46, Chapter 8]. The CE method
homepage can be found at www.cemethod.org .
The CE method has been successfully applied to a diverse range of esti-
mation and optimization problems, including buffer allocation [1], queueing
models of telecommunication s ys tems [14, 16], optimal control of HIV/AIDS
spread [48, 49], signal detection [30], combinatorial auctions [9], DNA sequence
alignment [24, 38], scheduling and vehicle routing [3, 8, 11, 20, 23, 53], neu-
ral and reinforcement learning [31, 32, 34, 52, 54], project management [12],
rare-event simulation with light- and heavy-tail distributions [2, 10, 21, 28],
clustering analysis [4, 5, 29]. Applications to classical combinatorial optimiza-
tion problems including the max-cut, traveling salesman, and Hamiltonian cycle
1

problems are given in [7, 17, 42, 43, 44]. Various CE estimation and noisy op-
timization problems for reliability s y s tems and network design can be found in
[6, 22, 25, 26, 35, 36, 37, 39]. Parallel implementations of the CE method are
discussed in [18, 19], and recent generalizations and advances are explored in
[51].
1 The Cross-Entropy Method for Estimation
Consider the estimation of
` = E
f
[H(X)] =
Z
H(x) f(x) dx , (1)
where H is the sample performance func tion and f is the probability density of
the random variable (vector) X. (For notational convenience it is assumed that
X is a continuous random variable; if X is a discrete random variable, simply
replace the integral in (1) by a sum.) Let g be another probability density such
that for all x, g(x) = 0 implies that H(x) f(x) = 0. Using the probability
density g, we can represent ` as
` =
Z
H(x)
f(x)
g(x)
g(x) dx = E
g
H(X)
f(X)
g(X)
. (2)
Consequently, if X
1
, . . . , X
N
are independent random vectors, each with prob-
ability density g, then
b
` =
1
N
N
X
k=1
H(X
k
)
f(X
k
)
g(X
k
)
(3)
is an unbiased estimator of `. Such an estimator is called an importance sam-
pling estimator. The optimal importance sampling probability den s ity is given
by g
(x) |H(x)|f(x) (see, e.g., [46, page 132]), which in general is difficult
to obtain. The idea of the CE method is to choose the importance sampling
density g in a specified class of densities such that the cross-entropy or Kullback-
Leibler divergence between the optimal importance sampling density g
and g
is minimal. The Kullback-Leibler divergence between two probability densities
g and h is given by
D(g, h) = E
g
ln
g(X)
h(X)
=
Z
g(x) ln
g(x)
h(x)
dx
=
Z
g(x) ln g(x) dx
Z
g(x) ln h(x) dx .
(4)
In most cases of interest the sample performance function H is non-negative,
and the “nominal” probability density f is parameterized by a finite-dimensional
vector u; that is, f(x) = f (x; u). It is then customary to choose the importance
sampling probability density g in the same family of probability den s ities; thus,
g(x) = f(x; v) for some reference parameter v. The CE min imization procedure
then reduces to finding an optimal reference parameter vector, v
say, by cross-
entropy minimization. This v
turns out to be the solution to the maximization
2

problem max
v
R
H(x)f(x; u) ln f(x; v)dx, which in turn can be estimated via
simulation by solving with respect to v, the stochastic counterpart program
max
v
1
N
N
X
k=1
H(X
k
)
f(X
k
; u)
f(X
k
; w)
ln f (X
k
; v) , (5)
where X
1
, . . . , X
N
is a random sample from f (·; w), for any reference parameter
w. The maximization (5) can often be solved analytically, in particular wh en the
class of sampling distributions forms an exponential family; see, for example, [46,
pages 319–320]. Indeed, analytical updating formulas can be found whenever
explicit expressions for the maximal likelihood estimators of the parameters can
be found, cf. [15, page 36].
Often ` = P(S(X) > γ), for some performan ce function S and level γ, in
which case H(x) takes the form of an i ndicator function: H(x) = I
{S(X)>γ}
;
that is, H(x) = 1 if S(x) > γ, and 0 otherwise. A complication in solving
(5) o ccur s when ` is a rare-event probability; that is, a very small probability
(say less than 10
5
). Then, for moderate sample size N most or all of the
values H(X
k
) in (5) are zero, and the maximization problem becomes useless.
In th at case a multi-level CE procedure is used, where a sequence of reference
parameters and levels is constructed with the goal that the first converges to
v
and the second to γ. This leads to the following algorithm; see, e.g., [46,
page 238].
Algorithm 1.1 (CE Algorithm for Rare-Event Estimation)
1. Define
b
v
0
= u. Let N
e
= d%Ne. Set t = 1 (iteration counter).
2. Generate a random sample X
1
, . . . , X
N
according to the probability den-
sity f (·;
b
v
t1
). Calculate the performance s S(X
i
) for all i, and order them
from smallest to largest, S
(1)
6 . . . 6 S
(N)
. Let bγ
t
be the sample (1 %)-
quantile of performances; that is, bγ
t
= S
(NN
e
+1)
. If bγ
t
> γ, reset bγ
t
to
γ.
3. Use the same sample X
1
, . . . , X
N
to solve the stochastic program (5),
with w =
b
v
t1
. Denote the solution by
b
v
t
.
4. If bγ
t
< γ, set t = t + 1 and reiterate from Step 2; otherwise, proceed with
Step 5.
5. Let T be the final iteration counter. Generate a sample X
1
, . . . , X
N
1
ac-
cording to the probability density f (·;
b
v
T
) and estimate ` via importance
sampling, as in (3).
Apart from specifying the family of sampling probability densities, the initial
vector
b
v
0
, the sample size N and the rarity parameter % (typically between
0.01 and 0.1), the algorithm is completely self-tuning. The sample size N for
determining a good reference parameter can usually be chosen much smaller
than the sample size N
1
for the fin al importance sampling estimation, say N =
3

1000 versus N
1
= 100, 000. Under certain technical conditions the deterministic
version of Algorithm 1.1 is guaranteed to terminate (reach level γ) provided that
% is chosen small enough; see Section 3.5 of [45].
2 The Cross-Entropy Method for Optimization
Let X be an arbitrary set of states and let S be a real-valued performance
function on X . Suppose we wish to find the maximu m of S over X , and
the corresponding state x
at which this maximum is attained (assuming for
simplicity that there is only one such state). Denote the maximum by γ
, we
thus have
S(x
) = γ
= max
xX
S(x) . (6)
This setting includes many types of optimization problems: discrete (com-
binatorial), continuous, mixed, and constrained problems. Moreover, if one is
interested in minimizing rather than maximizing S, one can simply maximize
S.
Now associate with the above problem the estimation of the probability
` = P(S(X) > γ), where X has some probability density f(x; u) on X (for
example corresponding to the uniform distribution on X ) and γ is some level.
If γ is chosen close to the unknown γ
, then ` is, typically, a rare-event prob-
ability, and the CE approach of Section 1 can be used to find an importance
sampling distribution close to the theoretically optimal importance sampling
density, which concentrates all its mass on the point x
. Sampling from such
a distribution thus p roduces optimal or near-optimal states. A main difference
with the CE method for rare-event simulation is that in the optimization s etting
the final level γ = γ
is not kn own in advance. The CE method for optimization
produces a sequence of levels {bγ
t
} and reference p arameters {
b
v
t
} such that the
former tends to the optimal γ
and the latter to the optimal reference vector
v
corresponding to the point mass at x
; see, e.g., [46, page 251].
4

Algorithm 2.1 (CE Algorithm for Optimization)
1. Choose an initial parameter vector
b
v
0
. Let N
e
= d%N e. Set t = 1 (level
counter).
2. Generate a sample X
1
, . . . , X
N
from the probability density f (·;
b
v
t1
).
Calculate the performances S(X
i
) for all i, and order them from smallest
to largest, S
(1)
6 . . . 6 S
(N)
. Let bγ
t
be the sample (1 %)-quantile of
performances; that is, bγ
t
= S
(NN
e
+1)
.
3. Use the same sample X
1
, . . . , X
N
and solve the stochastic program
max
v
N
1
N
X
k=1
I
{S(X
k
)>bγ
t
}
ln f(X
k
; v) . (7)
Denote the solution by
b
v
t
.
4. If the stopping c riterion is met, stop; otherwise, set t = t + 1, and return
to Step 2.
To run the algorithm, one n eeds to provide the class of sampling probability
densities, the initial vector
b
v
0
, th e s ample size N, the rarity parameter %, and
the stoppin g criterion. Any CE algorithm for optimization involves thus the
following two main iterative phases:
1. Generate a random s ample of objects in the search space X (traj ecto-
ries, vectors, etc.) according to a specified probability distribution.
2. Update the parameters of th at distribution, based on the N
e
best per-
forming s amples (the so-called elite samples), using cross-entropy mini-
mization.
Apart fr om the fact that Step 3 in Algorithm 1.1 is missing in Algorithm 2.1,
another main difference between the two algorithms is that the likelihood ratio
term f (X
k
; u)/f(X
k
;
b
v
t1
) in (5) is missing in (7).
Often a smoothed updating rule is u sed, in which the parameter vector
b
v
t
is taken as
b
v
t
= α
e
v
t
+ (1 α)
b
v
t1
, (8)
where
e
v
t
is the solution to (7) and 0 6 α 6 1 is a smoothing parameter. Many
other modifications can be found in [27, 45, 46] and in th e list of references.
When there are two or more optimal solutions the CE algorithm typically “fluc-
tuates” between the solutions before f ocusing in on one of the solutions. The
effect that smoothing has on convergence is discussed in detail in [13]. In par-
ticular, it is sh own that with appropriate sm oothing the CE method converges
and nds the optimal s olution with probability arbitrarily close to 1. Necessary
conditions and sufficient conditions under which the optimal solution is gener-
ated eventually with probability 1 are also given. Other convergence results,
5

Citations
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Book ChapterDOI

Statistical Pattern Recognition

TL;DR: This chapter introduces the subject of statistical pattern recognition (SPR) by considering how features are defined and emphasizes that the nearest neighbor algorithm achieves error rates comparable with those of an ideal Bayes’ classifier.
Proceedings ArticleDOI

BASNet: Boundary-Aware Salient Object Detection

TL;DR: Experimental results on six public datasets show that the proposed predict-refine architecture, BASNet, outperforms the state-of-the-art methods both in terms of regional and boundary evaluation measures.
Proceedings ArticleDOI

UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation

TL;DR: A novel UNet 3+ is proposed, which takes advantage of full-scale skip connections and deep supervisions, and can reduce the network parameters to improve the computation efficiency.
Journal ArticleDOI

High-throughput Ethomics in Large Groups of Drosophila

TL;DR: A camera-based method for automatically quantifying the individual and social behaviors of fruit flies, Drosophila melanogaster, interacting in a planar arena finds that behavioral differences between individuals were consistent over time and were sufficient to accurately predict gender and genotype.
References
More filters
Book

Genetic algorithms in search, optimization, and machine learning

TL;DR: In this article, the authors present the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields, including computer programming and mathematics.
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Book

Genetic Algorithms

Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Cross-entropy method" ?

This article provides a brief introduction to the cross-entropy method and discusses how it can be used for rare-event probability estimation and for solving combinatorial, continuous, constrained and noisy optimization problems. 

During the course of the algorithm, the sequence of mean vectors ideally tends to the maximizer x∗, while the vector of standard deviations tend to the zero vector. 

Apart from specifying the family of sampling probability densities, the initial vector v̂0, the sample size N and the rarity parameter % (typically between 0.01 and 0.1), the algorithm is completely self-tuning. 

The idea of the CE method is to choose the importance sampling density g in a specified class of densities such that the cross-entropy or KullbackLeibler divergence between the optimal importance sampling density g∗ and g is minimal. 

the i-th penalty function Pi (corresponding to the i-th constraint) is defined asPi(x) = Hi max(Gi(x), 0) (12)and Hi > 0 measures the importance (cost) of the i-th penalty. 

The CE minimization procedure then reduces to finding an optimal reference parameter vector, v∗ say, by crossentropy minimization. 

(4)In most cases of interest the sample performance function H is non-negative, and the “nominal” probability density f is parameterized by a finite-dimensional vector u; that is, f(x) = f(x;u). 

A possible stopping criterion is to stop when all standard deviations are smaller than some ε.Constrained optimization problems can be put in the framework (6) by taking X a (non-linear) region defined by some system of inequalities:Gi(x) 6 0, i = 1, . . . , L . (10)To solve the program (6) with constraints (10), two approaches can be adopted. 

The CE method for optimization produces a sequence of levels {γ̂t} and reference parameters {v̂t} such that the former tends to the optimal γ∗ and the latter to the optimal reference vector v∗ corresponding to the point mass at x∗; see, e.g., [46, page 251]. 

Generate a sample X1, . . . ,XN1 according to the probability density f(·; v̂T ) and estimate ` via importance sampling, as in (3). 

In particular, it is shown that with appropriate smoothing the CE method converges and finds the optimal solution with probability arbitrarily close to 1. 

(6)This setting includes many types of optimization problems: discrete (combinatorial), continuous, mixed, and constrained problems.