What is the optimum sequence of mean vectors?

During the course of the algorithm, the sequence of mean vectors ideally tends to the maximizer x∗, while the vector of standard deviations tend to the zero vector.

What is the important part of the algorithm?

Apart from specifying the family of sampling probability densities, the initial vector v̂0, the sample size N and the rarity parameter % (typically between 0.01 and 0.1), the algorithm is completely self-tuning.

What is the penalty function for the i-th constraint?

the i-th penalty function Pi (corresponding to the i-th constraint) is defined asPi(x) = Hi max(Gi(x), 0) (12)and Hi > 0 measures the importance (cost) of the i-th penalty.

What is the CE method for estimating the importance of a random variable?

The CE minimization procedure then reduces to finding an optimal reference parameter vector, v∗ say, by crossentropy minimization.

What is the possible stopping criterion for the CE algorithm?

A possible stopping criterion is to stop when all standard deviations are smaller than some ε.Constrained optimization problems can be put in the framework (6) by taking X a (non-linear) region defined by some system of inequalities:Gi(x) 6 0, i = 1, . . . , L . (10)To solve the program (6) with constraints (10), two approaches can be adopted.

What is the simplest method for estimating rare events?

Generate a sample X1, . . . ,XN1 according to the probability density f(·; v̂T ) and estimate ` via importance sampling, as in (3).

What is the way to solve a combinatorial optimization problem?

In particular, it is shown that with appropriate smoothing the CE method converges and finds the optimal solution with probability arbitrarily close to 1.

(Open Access) A Tutorial on the Cross-Entropy Method (2005) | Pieter-Tjerk de Boer

Q: What is the way to estimate the importance of a random variable?

(4)In most cases of interest the sample performance function H is non-negative, and the “nominal” probability density f is parameterized by a finite-dimensional vector u; that is, f(x) = f(x;u).

Cross-Entropy Method

Dirk P. Kroese

School of Mathematics and Physics

The University of Queensland

Brisbane 4072, Australia

kroese@maths.uq.edu.au

Abstract: The cross-entropy method is a recent versatile Monte Carlo technique.

This article provides a brief introduction to the cross-entropy method and discusses

how it can be used for rare-event probability estimation and for solving combinatorial,

continuous, constrained and noisy optimization problems. A comprehensive list of

references on cross-entropy methods and applications is included.

Keywords: cross-entropy, Kullback-Leibler divergence, rare events, impor-

tance sampling, stochastic search.

The cross-entropy (CE) method is a recent generic Monte Carlo technique

for solving complicated simulation and optimization problems. The approach

was introduced by R.Y. Rubinstein in [41, 42 ], extending his earlier work on

variance minimization methods for rare-event probability estimation [40].

The CE method can be applied to two types of problem:

1. Estimation: Estimate ` = E[H(X)], where X is a random variable or

vector taking values in some set X and H is function on X . An important

special case is the estimation of a probability ` = P(S(X) > γ), where S

is another f unction on X .

2. Optimization: Optimize (that is, maximize or minimize) S(x) over all

x ∈ X , where S is some objective function on X . S can be either a

known or a noisy function. In the latter case the obj ective function needs

to be estimated, e.g., via simulation.

In the estimation setting, the CE method can be viewed as an adaptive

importance sampling procedure that uses the cross-entropy or Kullback-Leibler

divergence as a measure of closeness between two sampling distributions, as

is explained further in Section 1. In the optimization setting, the optimiza-

tion problem is ﬁrst translated into a rare-event estimation problem, and then

the CE method for estimation is used as an adaptive algorithm to locate the

optimum, as is explained further in Section 2.

An easy tutorial on the CE method is given in [15]. A more comprehensive

treatment can be found in [45]; see also [46, Chapter 8]. The CE method

homepage can be found at www.cemethod.org .

The CE method has been successfully applied to a diverse range of esti-

mation and optimization problems, including buﬀer allocation [1], queueing

models of telecommunication s ys tems [14, 16], optimal control of HIV/AIDS

spread [48, 49], signal detection [30], combinatorial auctions [9], DNA sequence

alignment [24, 38], scheduling and vehicle routing [3, 8, 11, 20, 23, 53], neu-

ral and reinforcement learning [31, 32, 34, 52, 54], project management [12],

rare-event simulation with light- and heavy-tail distributions [2, 10, 21, 28],

clustering analysis [4, 5, 29]. Applications to classical combinatorial optimiza-

tion problems including the max-cut, traveling salesman, and Hamiltonian cycle

problems are given in [7, 17, 42, 43, 44]. Various CE estimation and noisy op-

timization problems for reliability s y s tems and network design can be found in

[6, 22, 25, 26, 35, 36, 37, 39]. Parallel implementations of the CE method are

discussed in [18, 19], and recent generalizations and advances are explored in

[51].

1 The Cross-Entropy Method for Estimation

Consider the estimation of

` = E

[H(X)] =

H(x) f(x) dx , (1)

where H is the sample performance func tion and f is the probability density of

the random variable (vector) X. (For notational convenience it is assumed that

X is a continuous random variable; if X is a discrete random variable, simply

replace the integral in (1) by a sum.) Let g be another probability density such

that for all x, g(x) = 0 implies that H(x) f(x) = 0. Using the probability

density g, we can represent ` as

` =

H(x)

f(x)

g(x)

g(x) dx = E



H(X)

f(X)

g(X)



. (2)

Consequently, if X

, . . . , X

are independent random vectors, each with prob-

ability density g, then

` =

k=1

H(X

)

f(X

)

g(X

)

(3)

is an unbiased estimator of `. Such an estimator is called an importance sam-

pling estimator. The optimal importance sampling probability den s ity is given

by g

∗

(x) ∝ |H(x)|f(x) (see, e.g., [46, page 132]), which in general is diﬃcult

to obtain. The idea of the CE method is to choose the importance sampling

density g in a speciﬁed class of densities such that the cross-entropy or Kullback-

Leibler divergence between the optimal importance sampling density g

∗

and g

is minimal. The Kullback-Leibler divergence between two probability densities

g and h is given by

D(g, h) = E



g(X)

h(X)



g(x) ln

g(x)

h(x)

g(x) ln g(x) dx −

g(x) ln h(x) dx .

(4)

In most cases of interest the sample performance function H is non-negative,

and the “nominal” probability density f is parameterized by a ﬁnite-dimensional

vector u; that is, f(x) = f (x; u). It is then customary to choose the importance

sampling probability density g in the same family of probability den s ities; thus,

g(x) = f(x; v) for some reference parameter v. The CE min imization procedure

then reduces to ﬁnding an optimal reference parameter vector, v

∗

say, by cross-

entropy minimization. This v

∗

turns out to be the solution to the maximization

problem max

H(x)f(x; u) ln f(x; v)dx, which in turn can be estimated via

simulation by solving with respect to v, the stochastic counterpart program

max

k=1

H(X

)

f(X

; u)

f(X

; w)

ln f (X

; v) , (5)

where X

, . . . , X

is a random sample from f (·; w), for any reference parameter

w. The maximization (5) can often be solved analytically, in particular wh en the

class of sampling distributions forms an exponential family; see, for example, [46,

pages 319–320]. Indeed, analytical updating formulas can be found whenever

explicit expressions for the maximal likelihood estimators of the parameters can

be found, cf. [15, page 36].

Often ` = P(S(X) > γ), for some performan ce function S and level γ, in

which case H(x) takes the form of an i ndicator function: H(x) = I

{S(X)>γ}

;

that is, H(x) = 1 if S(x) > γ, and 0 otherwise. A complication in solving

(5) o ccur s when ` is a rare-event probability; that is, a very small probability

(say less than 10

−5

). Then, for moderate sample size N most or all of the

values H(X

) in (5) are zero, and the maximization problem becomes useless.

In th at case a multi-level CE procedure is used, where a sequence of reference

parameters and levels is constructed with the goal that the ﬁrst converges to

∗

and the second to γ. This leads to the following algorithm; see, e.g., [46,

page 238].

Algorithm 1.1 (CE Algorithm for Rare-Event Estimation)

1. Deﬁne

= u. Let N

= d%Ne. Set t = 1 (iteration counter).

2. Generate a random sample X

, . . . , X

according to the probability den-

sity f (·;

t−1

). Calculate the performance s S(X

) for all i, and order them

from smallest to largest, S

(1)

6 . . . 6 S

(N)

. Let bγ

be the sample (1 − %)-

quantile of performances; that is, bγ

= S

(N−N

+1)

. If bγ

> γ, reset bγ

γ.

3. Use the same sample X

, . . . , X

to solve the stochastic program (5),

with w =

t−1

. Denote the solution by

4. If bγ

< γ, set t = t + 1 and reiterate from Step 2; otherwise, proceed with

Step 5.

5. Let T be the ﬁnal iteration counter. Generate a sample X

, . . . , X

ac-

cording to the probability density f (·;

) and estimate ` via importance

sampling, as in (3).

Apart from specifying the family of sampling probability densities, the initial

vector

, the sample size N and the rarity parameter % (typically between

0.01 and 0.1), the algorithm is completely self-tuning. The sample size N for

determining a good reference parameter can usually be chosen much smaller

than the sample size N

for the ﬁn al importance sampling estimation, say N =

1000 versus N

= 100, 000. Under certain technical conditions the deterministic

version of Algorithm 1.1 is guaranteed to terminate (reach level γ) provided that

% is chosen small enough; see Section 3.5 of [45].

2 The Cross-Entropy Method for Optimization

Let X be an arbitrary set of states and let S be a real-valued performance

function on X . Suppose we wish to ﬁnd the maximu m of S over X , and

the corresponding state x

∗

at which this maximum is attained (assuming for

simplicity that there is only one such state). Denote the maximum by γ

∗

, we

thus have

S(x

∗

) = γ

∗

= max

x∈X

S(x) . (6)

This setting includes many types of optimization problems: discrete (com-

binatorial), continuous, mixed, and constrained problems. Moreover, if one is

interested in minimizing rather than maximizing S, one can simply maximize

−S.

Now associate with the above problem the estimation of the probability

` = P(S(X) > γ), where X has some probability density f(x; u) on X (for

example corresponding to the uniform distribution on X ) and γ is some level.

If γ is chosen close to the unknown γ

∗

, then ` is, typically, a rare-event prob-

ability, and the CE approach of Section 1 can be used to ﬁnd an importance

sampling distribution close to the theoretically optimal importance sampling

density, which concentrates all its mass on the point x

∗

. Sampling from such

a distribution thus p roduces optimal or near-optimal states. A main diﬀerence

with the CE method for rare-event simulation is that in the optimization s etting

the ﬁnal level γ = γ

∗

is not kn own in advance. The CE method for optimization

produces a sequence of levels {bγ

} and reference p arameters {

} such that the

former tends to the optimal γ

∗

and the latter to the optimal reference vector

∗

corresponding to the point mass at x

∗

; see, e.g., [46, page 251].

Algorithm 2.1 (CE Algorithm for Optimization)

1. Choose an initial parameter vector

. Let N

= d%N e. Set t = 1 (level

counter).

2. Generate a sample X

, . . . , X

from the probability density f (·;

t−1

Calculate the performances S(X

) for all i, and order them from smallest

to largest, S

(1)

6 . . . 6 S

(N)

. Let bγ

be the sample (1 − %)-quantile of

performances; that is, bγ

= S

(N−N

+1)

3. Use the same sample X

, . . . , X

and solve the stochastic program

max

−1

k=1

{S(X

)>bγ

}

ln f(X

; v) . (7)

Denote the solution by

4. If the stopping c riterion is met, stop; otherwise, set t = t + 1, and return

to Step 2.

To run the algorithm, one n eeds to provide the class of sampling probability

densities, the initial vector

, th e s ample size N, the rarity parameter %, and

the stoppin g criterion. Any CE algorithm for optimization involves thus the

following two main iterative phases:

1. Generate a random s ample of objects in the search space X (traj ecto-

ries, vectors, etc.) according to a speciﬁed probability distribution.

2. Update the parameters of th at distribution, based on the N

best per-

forming s amples (the so-called elite samples), using cross-entropy mini-

mization.

Apart fr om the fact that Step 3 in Algorithm 1.1 is missing in Algorithm 2.1,

another main diﬀerence between the two algorithms is that the likelihood ratio

term f (X

; u)/f(X

;

t−1

) in (5) is missing in (7).

Often a smoothed updating rule is u sed, in which the parameter vector

is taken as

= α

+ (1 − α)

t−1

, (8)

where

is the solution to (7) and 0 6 α 6 1 is a smoothing parameter. Many

other modiﬁcations can be found in [27, 45, 46] and in th e list of references.

When there are two or more optimal solutions the CE algorithm typically “ﬂuc-

tuates” between the solutions before f ocusing in on one of the solutions. The

eﬀect that smoothing has on convergence is discussed in detail in [13]. In par-

ticular, it is sh own that with appropriate sm oothing the CE method converges

and ﬁnds the optimal s olution with probability arbitrarily close to 1. Necessary

conditions and suﬃcient conditions under which the optimal solution is gener-

ated eventually with probability 1 are also given. Other convergence results,

A Tutorial on the Cross-Entropy Method

Citations

Machine learning

Statistical Pattern Recognition

BASNet: Boundary-Aware Salient Object Detection

UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation

High-throughput Ethomics in Large Groups of Drosophila

References

Genetic algorithms in search, optimization, and machine learning

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Genetic Algorithms

Machine learning

Related Papers (5)

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Deep learning

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Cross-entropy method" ?

Q2. What is the optimum sequence of mean vectors?

Q3. What is the important part of the algorithm?

Q4. What is the idea of the CE method?

Q5. What is the penalty function for the i-th constraint?

Q6. What is the CE method for estimating the importance of a random variable?

Q7. What is the way to estimate the importance of a random variable?

Q8. What is the possible stopping criterion for the CE algorithm?

Q9. What is the CE method for determining a good reference parameter?

Q10. What is the simplest method for estimating rare events?

Q11. What is the way to solve a combinatorial optimization problem?

Q12. What types of problems can be used for the CE method?