No free lunch theorems for optimization

doi:10.1109/4235.585893

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997 67

No Free Lunch Theorems for Optimization

David H. Wolpert and William G. Macready

Abstract—A framework is developed to explore the connection

between effective optimization algorithms and the problems they

are solving. A number of “no free lunch” (NFL) theorems are

presented which establish that for any algorithm, any elevated

performance over one class of problems is offset by perfor-

mance over another class. These theorems result in a geometric

interpretation of what it means for an algorithm to be well

suited to an optimization problem. Applications of the NFL

theorems to information-theoretic aspects of optimization and

benchmark measures of performance are also presented. Other

issues addressed include time-varying optimization problems and

a priori “head-to-head” minimax distinctions between optimiza-

tion algorithms, distinctions that result despite the NFL theorems’

enforcing of a type of uniformity over all algorithms.

Index Terms— Evolutionary algorithms, information theory,

optimization.

I. INTRODUCTION

T

HE past few decades have seen an increased interest

in general-purpose “black-box” optimization algorithms

that exploit limited knowledge concerning the optimization

problem on which they are run. In large part these algorithms

have drawn inspiration from optimization processes that occur

in nature. In particular, the two most popular black-box

optimization strategies, evolutionary algorithms [1]–[3] and

simulated annealing [4], mimic processes in natural selection

and statistical mechanics, respectively.

In light of this interest in general-purpose optimization

algorithms, it has become important to understand the rela-

tionship between how well an algorithm

performs and the

optimization problem

on which it is run. In this paper

we present a formal analysis that contributes toward such

an understanding by addressing questions like the following:

given the abundance of black-box optimization algorithms and

of optimization problems, how can we best match algorithms

to problems (i.e., how best can we relax the black-box nature

of the algorithms and have them exploit some knowledge

concerning the optimization problem)? In particular, while

serious optimization practitioners almost always perform such

matching, it is usually on a heuristic basis; can such matching

be formally analyzed? More generally, what is the underlying

mathematical “skeleton” of optimization theory before the

“ﬂesh” of the probability distributions of a particular context

and set of optimization problems are imposed? What can

Manuscript received August 15, 1996; revised December 30, 1996. This

work was supported by the Santa Fe Institute and TXN Inc.

D. H. Wolpert is with IBM Almaden Research Center, San Jose, CA 95120-

6099 USA.

W. G. Macready was with Santa Fe Institute, Santa Fe, NM 87501 USA.

He is now with IBM Almaden Research Center, San Jose, CA 95120-6099

USA.

Publisher Item Identiﬁer S 1089-778X(97)03422-X.

information theory and Bayesian analysis contribute to an

understanding of these issues? How a priori generalizable are

the performance results of a certain algorithm on a certain

class of problems to its performance on other classes of

problems? How should we even measure such generalization?

How should we assess the performance of algorithms on

problems so that we may programmatically compare those

algorithms?

Broadly speaking, we take two approaches to these ques-

tions. First, we investigate what a priori restrictions there are

on the performance of one or more algorithms as one runs

over the set of all optimization problems. Our second approach

is to instead focus on a particular problem and consider the

effects of running over all algorithms. In the current paper

we present results from both types of analyses but concentrate

largely on the ﬁrst approach. The reader is referred to the

companion paper [5] for more types of analysis involving the

second approach.

We begin in Section II by introducing the necessary nota-

tion. Also discussed in this section is the model of computation

we adopt, its limitations, and the reasons we chose it.

One might expect that there are pairs of search algorithms

and such that performs better than on average, even if

sometimes outperforms . As an example, one might expect

that hill climbing usually outperforms hill descending if one’s

goal is to ﬁnd a maximum of the cost function. One might also

expect it would outperform a random search in such a context.

One of the main results of this paper is that such expecta-

tions are incorrect. We prove two “no free lunch” (NFL) the-

orems in Section III that demonstrate this and more generally

illuminate the connection between algorithms and problems.

Roughly speaking, we show that for both static and time-

dependent optimization problems, the average performance

of any pair of algorithms across all possible problems is

identical. This means in particular that if some algorithm

’s

performance is superior to that of another algorithm

over

some set of optimization problems, then the reverse must be

true over the set of all other optimization problems. (The reader

is urged to read this section carefully for a precise statement

of these theorems.) This is true even if one of the algorithms

is random; any algorithm

performs worse than randomly

just as readily (over the set of all optimization problems) as

it performs better than randomly. Possible objections to these

results are addressed in Sections III-A and III-B.

In Section IV we present a geometric interpretation of the

NFL theorems. In particular, we show that an algorithm’s

average performance is determined by how “aligned” it is

with the underlying probability distribution over optimization

problems on which it is run. This section is critical for an

1089–778X/97$10.00  1997 IEEE

68 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

understanding of how the NFL results are consistent with the

well-accepted fact that many search algorithms that do not take

into account knowledge concerning the cost function work

well in practice.

Section V-A demonstrates that the NFL theorems allow

one to answer a number of what would otherwise seem to

be intractable questions. The implications of these answers

for measures of algorithm performance and of how best to

compare optimization algorithms are explored in Section V-B.

In Section VI we discuss some of the ways in which,

despite the NFL theorems, algorithms can have a priori

distinctions that hold even if nothing is speciﬁed concerning

the optimization problems. In particular, we show that there

can be “head-to-head” minimax distinctions between a pair of

algorithms, i.e., that when considering one function at a time,

a pair of algorithms may be distinguishable, even if they are

not when one looks over all functions.

In Section VII we present an introduction to the alternative

approach to the formal analysis of optimization in which

problems are held ﬁxed and one looks at properties across

the space of algorithms. Since these results hold in general,

they hold for any and all optimization problems and thus

are independent of the types of problems one is more or

less likely to encounter in the real world. In particular,

these results show that there is no a priori justiﬁcation for

using a search algorithm’s observed behavior to date on a

particular cost function to predict its future behavior on that

function. In fact when choosing between algorithms based on

their observed performance it does not sufﬁce to make an

assumption about the cost function; some (currently poorly

understood) assumptions are also being made about how the

algorithms in question are related to each other and to the

cost function. In addition to presenting results not found in

[5], this section serves as an introduction to the perspective

adopted in [5].

We conclude in Section VIII with a brief discussion, a

summary of results, and a short list of open problems.

We have conﬁned all proofs to appendixes to facilitate the

ﬂow of the paper. A more detailed, and substantially longer,

version of this paper, a version that also analyzes some issues

not addressed in this paper, can be found in [6].

II. P

RELIMINARIES

We restrict attention to combinatorial optimization in which

the search space

, though perhaps quite large, is ﬁnite.

We further assume that the space of possible “cost” values

is also ﬁnite. These restrictions are automatically met

for optimization algorithms run on digital computers where

typically

is some 32 or 64 bit representation of the real

numbers.

The size of the spaces

and are indicated by and ,

respectively. An optimization problem

(sometimes called

a “cost function” or an “objective function” or an “energy

function”) is represented as a mapping

and

indicates the space of all possible problems.

is of size —a large but ﬁnite number. In addition to

static

, we are also interested in optimization problems that

depend explicitly on time. The extra notation required for such

time-dependent problems will be introduced as needed.

It is common in the optimization community to adopt

an oracle-based view of computation. In this view, when

assessing the performance of algorithms, results are stated

in terms of the number of function evaluations required to

ﬁnd a given solution. Practically though, many optimization

algorithms are wasteful of function evaluations. In particular,

many algorithms do not remember where they have already

searched and therefore often revisit the same points. Although

any algorithm that is wasteful in this fashion can be made

more efﬁcient simply by remembering where it has been (cf.

tabu search [7], [8]), many real-world algorithms elect not to

employ this stratagem. From the point of view of the oracle-

based performance measures, these revisits are “artifacts”

distorting the apparent relationship between many such real-

world algorithms.

This difﬁculty is exacerbated by the fact that the amount

of revisiting that occurs is a complicated function of both

the algorithm and the optimization problem and therefore

cannot be simply “ﬁltered out” of a mathematical analysis.

Accordingly, we have elected to circumvent the problem

entirely by comparing algorithms based on the number of

distinct function evaluations they have performed. Note that

this does not mean that we cannot compare algorithms that

are wasteful of evaluations—it simply means that we compare

algorithms by counting only their number of distinct calls to

the oracle.

We call a time-ordered set of

distinct visited points

a “sample” of size

. Samples are denoted by

. The points in a

sample are ordered according to the time at which they

were generated. Thus

indicates the value of the

th successive element in a sample of size and is

its associated cost or

value.

will be used to indicate the ordered set of cost values. The

space of all samples of size

is (so

) and the set of all possible samples of arbitrary

size is

.

As an important clariﬁcation of this deﬁnition, consider a

hill-descending algorithm. This is the algorithm that examines

a set of neighboring points in

and moves to the one having

the lowest cost. The process is then iterated from the newly

chosen point. (Often, implementations of hill descending stop

when they reach a local minimum, but they can easily be

extended to run longer by randomly jumping to a new unvis-

ited point once the neighborhood of a local minimum has been

exhausted.) The point to note is that because a sample contains

all the previous points at which the oracle was consulted, it

includes the

values of all the neighbors of the current

point, and not only the lowest cost one that the algorithm

moves to. This must be taken into account when counting the

value of

.

An optimization algorithm

is represented as a mapping

from previously visited sets of points to a single new (i.e.,

previously unvisited) point in

. Formally,

. Given our decision to only measure distinct

function evaluations even if an algorithm revisits previously

WOLPERT AND MACREADY: NO FREE LUNCH THEOREMS FOR OPTIMIZATION 69

searched points, our deﬁnition of an algorithm includes all

common black-box optimization techniques like simulated an-

nealing and evolutionary algorithms. (Techniques like branch

and bound [9] are not included since they rely explicitly on

the cost structure of partial solutions.)

As deﬁned above, a search algorithm is deterministic; every

sample maps to a unique new point. Of course, essentially, all

algorithms implemented on computers are deterministic,

1

and

in this our deﬁnition is not restrictive. Nonetheless, it is worth

noting that all of our results are extensible to nondeterministic

algorithms, where the new point is chosen stochastically from

the set of unvisited points. This point is returned to later.

Under the oracle-based model of computation any measure

of the performance of an algorithm after

iterations is a

function of the sample

. Such performance measures will

be indicated by

. As an example, if we are trying

to ﬁnd a minimum of

, then a reasonable measure of the

performance of

might be the value of the lowest value in

: . Note that measures

of performance based on factors other than

(e.g., wall clock

time) are outside the scope of our results.

We shall cast all of our results in terms of probability

theory. We do so for three reasons. First, it allows simple

generalization of our results to stochastic algorithms. Second,

even when the setting is deterministic, probability theory

provides a simple consistent framework in which to carry out

proofs. The third reason for using probability theory is perhaps

the most interesting. A crucial factor in the probabilistic

framework is the distribution

.

This distribution, deﬁned over

, gives the probability that

each

is the actual optimization problem at hand.

An approach based on this distribution has the immediate

advantage that often knowledge of a problem is statistical in

nature and this information may be easily encodable in

.

For example, Markov or Gibbs random ﬁeld descriptions [10]

of families of optimization problems express

exactly.

Exploiting

, however, also has advantages even when

we are presented with a single uniquely speciﬁed cost function.

One such advantage is the fact that although it may be fully

speciﬁed, many aspects of the cost function are effectively

unknown (e.g., we certainly do not know the extrema of the

function). It is in many ways most appropriate to have this

effective ignorance reﬂected in the analysis as a probability

distribution. More generally, optimization practitioners usually

act as though the cost function is partially unknown, in that the

same algorithm is used for all cost functions in a class of such

functions (e.g., in the class of all traveling salesman problems

having certain characteristics). In so doing, the practitioner

implicitly acknowledges that distinctions between the cost

functions in that class are irrelevant or at least unexploitable.

In this sense, even though we are presented with a single

particular problem from that class, we act as though we are

presented with a probability distribution over cost functions,

a distribution that is nonzero only for members of that class

of cost functions.

is thus a prior speciﬁcation of the

class of the optimization problem at hand, with different

1

In particular, note that pseudorandom number generators are deterministic

given a seed.

classes of problems corresponding to different choices of

what algorithms we will use, and giving rise to different

distributions

.

Given our choice to use probability theory, the perfor-

mance of an algorithm

iterated times on a cost function

is measured with . This is the conditional

probability of obtaining a particular sample

under the

stated conditions. From

performance measures

can be found easily.

In the next section we analyze

and in par-

ticular how it varies with the algorithm

. Before proceeding

with that analysis, however, it is worth brieﬂy noting that there

are other formal approaches to the issues investigated in this

paper. Perhaps the most prominent of these is the ﬁeld of com-

putational complexity. Unlike the approach taken in this paper,

computational complexity largely ignores the statistical nature

of search and concentrates instead on computational issues.

Much, though by no means all, of computational complexity is

concerned with physically unrealizable computational devices

(e.g., Turing machines) and the worst-case resource usage

required to ﬁnd optimal solutions. In contrast, the analysis

in this paper does not concern itself with the computational

engine used by the search algorithm, but rather concentrates

exclusively on the underlying statistical nature of the search

problem. The current probabilistic approach is complimentary

to computational complexity. Future work involves combining

our analysis of the statistical nature of search with practical

concerns for computational resources.

III. T

HE NFL THEOREMS

In this section we analyze the connection between algo-

rithms and cost functions. We have dubbed the associated

results NFL theorems because they demonstrate that if an

algorithm performs well on a certain class of problems then

it necessarily pays for that with degraded performance on the

set of all remaining problems. Additionally, the name em-

phasizes a parallel with similar results in supervised learning

[11], [12].

The precise question addressed in this section is: “How does

the set of problems

for which algorithm performs

better than algorithm

compare to the set for which

the reverse is true?” To address this question we compare the

sum over all

of to the sum over all of

. This comparison constitutes a major result of

this paper:

is independent of when averaged

over all cost functions.

Theorem 1: For any pair of algorithms

and

A proof of this result is found in Appendix A. An immediate

corollary of this result is that for any performance measure

, the average over all of is inde-

pendent of

. The precise way that the sample is mapped to

a performance measure is unimportant.

This theorem explicitly demonstrates that what an algorithm

gains in performance on one class of problems is necessarily

70 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 1, NO. 1, APRIL 1997

offset by its performance on the remaining problems; that is

the only way that all algorithms can have the same

-averaged

performance.

A result analogous to Theorem 1 holds for a class of time-

dependent cost functions. The time-dependent functions we

consider begin with an initial cost function

that is present

at the sampling of the ﬁrst

value. Before the beginning of

each subsequent iteration of the optimization algorithm, the

cost function is deformed to a new function, as speciﬁed by a

mapping

.

2

We indicate this mapping with the

notation

. So the function present during the th iteration is

. is assumed to be a (potentially -dependent)

bijection between

and . We impose bijectivity because if

it did not hold, the evolution of cost functions could narrow

in on a region of

’s for which some algorithms may perform

better than others. This would constitute an a priori bias in

favor of those algorithms, a bias whose analysis we wish to

defer to future work.

How best to assess the quality of an algorithm’s perfor-

mance on time-dependent cost functions is not clear. Here we

consider two schemes based on manipulations of the deﬁnition

of the sample. In scheme 1 the particular

value in

corresponding to a particular value is given by the

cost function that was present when

was sampled. In

contrast, for scheme 2 we imagine a sample

given by the

values from the present cost function for each of the values in

. Formally if , then in scheme 1

we have

, and

in scheme 2 we have

where is the ﬁnal cost function.

In some situations it may be that the members of the

sample “live” for a long time, compared to the time scale

of the dynamics of the cost function. In such situations it may

be appropriate to judge the quality of the search algorithm

by

; all those previous elements of the sample that are

still “alive” at time

, and therefore their current cost is of

interest. On the other hand, if members of the sample live

for only a short time on the time scale of the dynamics of

the cost function, one may instead be concerned with things

like how well the “living” member(s) of the sample track

the changing cost function. In such situations, it may make

more sense to judge the quality of the algorithm with the

sample.

Results similar to Theorem 1 can be derived for both

schemes. By analogy with that theorem, we average over all

possible ways a cost function may be time dependent, i.e., we

average over all

(rather than over all ). Thus we consider

where is the initial cost function.

Since

only takes effect for , and since is ﬁxed,

there are a priori distinctions between algorithms as far as

the ﬁrst member of the sample is concerned. After redeﬁning

samples, however, to only contain those elements added after

the ﬁrst iteration of the algorithm, we arrive at the following

result, proven in Appendix B.

2

An obvious restriction would be to require that

T

does not vary with time,

so that it is a mapping simply from

F

to

F

. An analysis for

T

’s limited in

this way is beyond the scope of this paper.

Theorem 2: For all , algorithms and

, and initial cost functions

and

So, in particular, if one algorithm outperforms another for

certain kinds of cost function dynamics, then the reverse must

be true on the set of all other cost function dynamics.

Although this particular result is similar to the NFL result

for the static case, in general the time-dependent situation

is more subtle. In particular, with time dependence there

are situations in which there can be a priori distinctions

between algorithms even for those members of the sample

arising after the ﬁrst. For example, in general there will be

distinctions between algorithms when considering the quantity

. To see this, consider the case where

is a set of contiguous integers and for all iterations is a

shift operator, replacing

by for all [with

]. For such a case we can construct

algorithms which behave differently a priori. For example,

take

to be the algorithm that ﬁrst samples at , next

at

, and so on, regardless of the values in the sample.

Then for any

, is always made up of identical values.

Accordingly,

is nonzero only for for

which all values

are identical. Other search algorithms,

even for the same shift

, do not have this restriction on

values. This constitutes an a priori distinction between

algorithms.

A. Implications of the NFL Theorems

As emphasized above, the NFL theorems mean that if an

algorithm does particularly well on average for one class of

problems then it must do worse on average over the remaining

problems. In particular, if an algorithm performs better than

random search on some class of problems then in must

perform worse than random search on the remaining problems.

Thus comparisons reporting the performance of a particular

algorithm with a particular parameter setting on a few sample

problems are of limited utility. While such results do indicate

behavior on the narrow range of problems considered, one

should be very wary of trying to generalize those results to

other problems.

Note, however, that the NFL theorems need not be viewed

as a way of comparing function classes

and (or

classes of evolution operators

and , as the case might

be). They can be viewed instead as a statement concerning

any algorithm’s performance when

is not ﬁxed, under the

uniform prior over cost functions,

. If we wish

instead to analyze performance where

is not ﬁxed, as in this

alternative interpretation of the NFL theorems, but in contrast

with the NFL case

is now chosen from a nonuniform prior,

then we must analyze explicitly the sum

(1)

WOLPERT AND MACREADY: NO FREE LUNCH THEOREMS FOR OPTIMIZATION 71

Since it is certainly true that any class of problems faced by

a practitioner will not have a ﬂat prior, what are the practical

implications of the NFL theorems when viewed as a statement

concerning an algorithm’s performance for nonﬁxed

? This

question is taken up in greater detail in Section IV but we

offer a few comments here.

First, if the practitioner has knowledge of problem charac-

teristics but does not incorporate them into the optimization

algorithm, then

is effectively uniform. (Recall that

can be viewed as a statement concerning the practitioner’s

choice of optimization algorithms.) In such a case, the NFL

theorems establish that there are no formal assurances that the

algorithm chosen will be at all effective.

Second, while most classes of problems will certainly have

some structure which, if known, might be exploitable, the

simple existence of that structure does not justify choice

of a particular algorithm; that structure must be known and

reﬂected directly in the choice of algorithm to serve as such a

justiﬁcation. In other words, the simple existence of structure

per se, absent a speciﬁcation of that structure, cannot provide a

basis for preferring one algorithm over another. Formally, this

is established by the existence of NFL-type theorems in which

rather than average over speciﬁc cost functions

, one averages

over speciﬁc “kinds of structure,” i.e., theorems in which one

averages

over distributions . That such

theorems hold when one averages over all

means that

the indistinguishability of algorithms associated with uniform

is not some pathological, outlier case. Rather, uniform

is a “typical” distribution as far as indistinguishability

of algorithms is concerned. The simple fact that the

at

hand is nonuniform cannot serve to determine one’s choice of

optimization algorithm.

Finally, it is important to emphasize that even if one is

considering the case where

is not ﬁxed, performing the

associated average according to a uniform

is not essential

for NFL to hold. NFL can also be demonstrated for a range

of nonuniform priors. For example, any prior of the form

(where is the distribution of

values) will also give NFL theorems. The -average can also

enforce correlations between costs at different

values and

NFL-like results will still be obtained. For example, if costs

are rank ordered (with ties broken in some arbitrary way) and

we sum only over all cost functions given by permutations of

those orderings, then NFL remains valid.

The choice of uniform

was motivated more from

theoretical rather than pragmatic concerns, as a way of an-

alyzing the theoretical structure of optimization. Nevertheless,

the cautionary observations presented above make clear that

an analysis of the uniform

case has a number of

ramiﬁcations for practitioners.

B. Stochastic Optimization Algorithms

Thus far we have considered the case in which algorithms

are deterministic. What is the situation for stochastic algo-

rithms? As it turns out, NFL results hold even for these

algorithms.

The proof is straightforward. Let

be a stochastic “nonpo-

tentially revisiting” algorithm. Formally, this means that

is

a mapping taking any sample

to a -dependent distribution

over

that equals zero for all . In this sense is

what in statistics community is known as a “hyper-parameter,”

specifying the function

for all

and . One can now reproduce the derivation of the NFL

result for deterministic algorithms, only with

replaced by

throughout. In so doing, all steps in the proof remain valid.

This establishes that NFL results apply to stochastic algorithms

as well as deterministic ones.

IV. A G

EOMETRIC PERSPECTIVE ON THE NFL THEOREMS

Intuitively, the NFL theorem illustrates that if knowledge

of

, perhaps speciﬁed through , is not incorporated into

, then there are no formal assurances that will be effective.

Rather, in this case effective optimization relies on a fortuitous

matching between

and . This point is formally established

by viewing the NFL theorem from a geometric perspective.

Consider the space

of all possible cost functions. As pre-

viously discussed in regard to (1), the probability of obtaining

some

is

where is the prior probability that the optimization

problem at hand has cost function

. This sum over functions

can be viewed as an inner product in

. Deﬁning the -space

vectors

and by their components

and , respectively

(2)

This equation provides a geometric interpretation of the op-

timization process.

can be viewed as ﬁxed to the sample

that is desired, usually one with a low cost value, and

is a measure of the computational resources that can be

afforded. Any knowledge of the properties of the cost function

goes into the prior over cost functions

. Then (2) says the

performance of an algorithm is determined by the magnitude

of its projection onto

, i.e., by how aligned is with

the problems

. Alternatively, by averaging over , it is easy

to see that

is an inner product between and

. The expectation of any performance measure

can be written similarly.

In any of these cases,

or must “match” or be aligned

with

to get the desired behavior. This need for matching

provides a new perspective on how certain algorithms can

perform well in practice on speciﬁc kinds of problems. For

example, it means that the years of research into the traveling

salesman problem (TSP) have resulted in algorithms aligned

with the (implicit)

describing traveling salesman problems

of interest to TSP researchers.

Taking the geometric view, the NFL result that

is independent of has the interpretation

that for any particular

and , all algorithms have the

same projection onto the uniform

, represented by the

diagonal vector

. Formally, where

is some constant depending only upon and . For

deterministic algorithms, the components of

(i.e., the

No free lunch theorems for optimization

Citations

Deep Learning

Numerical Optimization

Grey Wolf Optimizer

Introduction to Algorithms

Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach

References

Elements of information theory

Optimization by Simulated Annealing

Adaptation in natural and artificial systems

Tabu Search—Part II

Artificial Intelligence through Simulated Evolution

Related Papers (5)

Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces

Particle swarm optimization

Optimization by Simulated Annealing

Adaptation in natural and artificial systems

Genetic algorithms in search, optimization and machine learning