Coevolutionary free lunches

doi:10.1109/TEVC.2005.856205

’

Jy

I

?

PnPvnI11tinnarv

Free

Lunches

J

uwv

v

-*-------

David

H.

Wolpert,

dhw@email

.

arc. nasa

.

gov

William

G.

Macready,

wgm@email

.

arc

-

nasa.

gov

NASA

Ames

Research

Center

Moffett

Field, CA,

94035

Abstract-Recent work on the foundations

of

optimization

has

begun

to

uncover its underlying rich

structure.

In

particular, the

“No

Free

Lunch”

(Nn)

theorems

lwMg

state

that any

two

algorithms

are

equivalent when their performance

is

averaged

across

all

possible problems.

This

L&ghts

the

need

for

exploit-

ing

problemspecifc knowledge

to

achieve better

than

random

performance.

In

this

paper we present a general framework

covering most search scenarios.

In

addition

to

the optimiza-

tion

scenarios addressed

in

the

NFL

results,

this

framework

covers

multi-armed

bandit problems and evolution

of

multiple

co-evolving

agents.

As

a particular instance of the latter,

it

covers “self-play” problems.

In

these problems the agents

work

together

to

produce a

champion,

who then engages one

or

more

antagonists

in

a subsequent multi-player

game

In

contrast

to

the

traditional

optimization

case

where the

NFL

results

hold,

we show that

in

self-play there

are

free

lunches:

in

coevolution

some algorithms have better performance

than

other algorithms,

averaged across

all

possible problems. However

in

the typical

coevolutionary scenarios encountered

in

biology, where there

is

no

champion,

NFL

still

holds.

I. INTRODUCTION

Optimization algorithms have proven to

be

valuable

in

almost every setting where quantitative €gures of merit are

available. Recently, the mathematical foundations of opti-

mization have begun to

be

uncovered [KWMOI], LMW961,

[FHSOl],

mol],

[COOl].

One particular result in this work,

the “No Free Lunch”

@FL)

theorems, establishes the equiva-

lent performance of all optimization algorithms when averaged

across all possible problems [KWMOI]. Numerous works have

extended these early results, and considered their application

to

different

types

of optimization (e.g. to multi-objective op-

timization [CK03]). The web site

www.

no-

free-

lunch

-

org

offers a list of recent references.

However, all previous work has been cast

in

a limited

manner that does not cover repeated game scenarios where

the

fgure

of

merit can vary based on the response of

an-

other player.

In

particular, the

NFL,

theorems do not cover

such scenarios. These game-like scenarios are usually called

“coevolutionary” since they involve the behaviors of more than

a single agent or player

FWH)].

One important example of coevolution is “self-play”, where

the players cooperate to

train

one of them

as

a champion. That

champion is then pitted against an antagonist

in

a subsequent

multi-player game. The goal is to

min

that champion to

perform as well

as

possible in that subsequent game. For a

checkers example

see

[CF99]. We will refer to

all

players

other than the one

of

direct attention

as

that player’s “op-

ponents”, even when

(as

in self-play) the players are actually

cooperating. (Sometimes when discussing self-play we

will

refer to the specifc opponent to

be

faced by a champion in a

subsequent game

-

an opponent not under our control

-

as

the champion’s “antagonist”.)

Coevolution

can

also

be

used for problems that on

the

surface appear to have

no

connection to a game (for

an

early

application to sorting networks see m1921). Coevolution in

these cases enables escape from poor local optima in favor of

better local optima.

In

this

paper we frst present a mathematical framework

that covers both traditional optimization and coevolutionary

scenarios. (It also covers other scenarios like multi-armed

bandits.) We then use that framework to explore the differences

between traditional optimization and coevolution. We €nd

dramatic differences between the traditional optimization and

coevolutionary scenarios.

In

particular, unlike the fundamental

NFL

result for traditional optimization, in the self-play domain

there are algorithms which

are

superior to other algorithms for

all

problems. However

in

the typical coevolutionary scenarios

encountered

in

biology, where there is

no

champion,

NFL

still

holds.

IT.

GENERAL

FRAMEWORK

In

this

section we present a general framework, and illustrate

it

on

two examples. Despite its substantially greater breadth

of applicability, the formal structure of

this

framwork

is

only

a slight extension of that used in

[p\Ip1197].

A.

Foml

framework

specifcation

Say we have two spaces,

X

and

2.

To

guide the intuition,

a

typical scenario might have

x

E

X

be

the joint strategy

followed by our player(s), and

z

E

2

be the probability

distribution over some space of possible rewarddpayoffs to

the champion, or over possible fgures of merit., or some such.

In

addition to

X

and

2,

we

also

have a

&ness

function

f

:X-Z.

(1)

In

the example where

z

is a probability distribution over

rewards,

f

can

be

viewed as the specifcation of an

x-

conditioned probability distribution of rewards.

We have a total of

m

time-steps, and represent the infor-

mation generated through those time-steps

as

d,

=

(4LG)

=

({W)IZ“=,,

{d”(t)),“=,).

Each

dz(t)

is a particular

x

E

X.

Each

d”(t)

is a (perhaps

stochastic) function of f(dx(t‘)). For example,

say

z’s

-

values off

(x)

-

are probability distributions over reward

val-

ues. Then

d”(t)

could consist of

the

full

distribution

f(dz(t)).

Alternatively,

it

could consist of

a

moment of that distribution,

or even

a

random sample of it. In general, we allow the

function specifying

dz(t)

to

vary

with

t,

although that freedom

will not be exploited here.

As

shorthand we will write

d(t)

to

mean the pair

(d"

(t)

,

d"

(t

))

.

A

search algorithm

a

is an initial distribution

Pl(d"(l)),

together with

a

set of

m

-

1

separate conditional distributions

P,(d"(t)

I

dt-l),

t

=

2,.

.

m.

Such an algorithm speciEes

what

IC

to choose, based

on

the information uncovered

so

far, for any time-step

t.

Finally, we have

a

vector-valued

cost

function,

C(d,,

f)

which we use to assess the performance

of the algorithm. Often our goal is to End the

a

that will

maximize

E(C),

for

a

particular choice of how to form the

d"

(t)'s.

The

NFL

theorems concern averages over

all

f

of quantities

involving

C.

For those theorems to hold, for f-averages of

C

to be independent of the search algorithm, it is crucial

that

C

does not depend on

f.

(The framework in [Wh497]

ddnes

cost

f~octisns

2s

:ea!-vakied

fiiiibiivIis

of

d,.j

-when

that independence is relaxed, the NFL theorems need not hold.

Such relaxation occurs in self-play, for example, and is how

one can have free lunches in self-play. This papers explores

this phenomenon.

B.

Examples

of

the framework

a)

Example

1:

One example of this framework is the

scenario considered in the

NFL

theorems. There each

z

is

a

probability distribution over

a

space

Y

R.

For convenience

we take

X

and

Y

countable. Each

df(t)

is a sample of the

associated function

z(t)

=

f(d"(t)).

The search algorithm is

constrained

so

that

(3)

i.e.,

so

that the search never revisits points already sampled.'

Finally,

C(d,,

f)

is allowed to be any scalar-valued function

that depends on

d,

exclusively.

The

NFL

theorems apply to any scenario meeting these

specifcations.

b)

Example

2:

Another example is the multi-arm bandit

problem introduced for optimization by Holland [Ho175] and

thoroughly analyzed in [MW98]. The scenario for that problem

is

identical to that for the

NFL

results, except that there are

no constraints on the search algorithm,

Y

=

R

and every

z

is

a

Gaussian. The fact that revisits are allowed means that

NFL

need not apply.

c)

Example

3:

Self-play is identical to the NFL scenario

except that

C

depends on

f.

This dependence is based

on

a

function

A(&)

mapping

d,

to

a

subset of

X.

Intuitively,

A

specifes the details of the champion, based on the

m

repeated games and on the possible responses to the champion

of

an antagonist

in

a

subsequent game.

C

is then based on

this specifcation of the champion. Formally, it uses

A

to

P,(d"(t)

=

x

I

dt-1)

=

0

vx

E

d;-l,

'This requirement is just to "normalize" algorithms.

In

general. an algo-

rithm that sometimes revisits points can outperform one that never

does.

Our

requirement simply says that we're purely focusing

on

how well the algorithms

choose new points, not how smart they are about whether

to

Enish the search

at

t

=

m

by

sampling a new point

or

by

returning to one already visited.

See

IwM971.

u

determine the quality of the search algorithm that generated

d,

as

follows:

(4)

where

IEf(.)

is the expected value of the distribution of payoffs

Intuitively, this measure is the worst possible payoff to the

champion.

To

see in more detail how this describes self-play, assume

two players, with strategy spaces

X1

and

X2, X1

being the

strategy space of our champion. Take

IC

to be the joint strategy

of our players

in

any particular game, i.e.,

IC

E

X

=

X1

x

Xz.

So

d&

specifes the

m

strategies followed by our champion

(as

well

as

that of the other player) during the

m

training games.

dk

is the associated set of rewards to our champion, i.e., each

d"

(t)

is

a

sample of the distribution

f

(d"

(t)).

Let

21

E

X1

be the strategy our champion elects to follow

bawd

nn

th~

trriining

dzta.

Note

:hat

that

b'uategy

cai

be

represented

as

the set of all joint-strategies

x

whose Erst

component is

zl.

We adopt this representation, and write the

strategy chosen by our champion

-

the set of all

x's

consistent

with the champion's choice of strategy

x1

-

as

A(&)

C

X.

Say the antagonist our champion will now face is able

to choose the worst possible element of

X2

(as far as ex-

pected reward to our champion is concerned), given that our

champion chooses strategy

A(&).

If the antagonist does this

the expected reward to our champion is given by

C(dm,f)

as

deEned above. Obvious variants of

this

setup replace the

worst-case nature of

C

with some alternative, have

A

be

stochastic, etc. Whatever variant we choose, typically our

goal in self-play is to choose

a

andor

A

so

as

to maximize

E(C),

the expectation being over all possible

d,.

The fact

that

C

depends on

f

means that NFL need not apply. The

mathematical structure that replaces

NFL

is explored in the

following sections of this paper.

d)

Example

4:

The basic description of self-play in the

introduction looks like

a

special case of the more general

biological coevolution scenario. However in terms

of

our

framework they are quite different.

In the general coevolution scenario there are

a

total of

N

agents (or players, or species', or genes, etc). Their strategy

spaces are written

X,,

as

in self-play. Now though

X

is

extended beyond the current joint strategy, to include the

previous joint "population frequency" value. Formally, we

write

(5)

and interpret each

u,

E

R

as

agent

2's

previous population

frequency. As explained below, the reason for this extension

of

X

is

so

that

a

can give the sequence of joint population

frequencies that accompanies the sequence of joint strategies.

In the general coevolution scenario each

2

is a probability

distribution over the possible current population frequencies

of the agents.

So

given our defnition of

X,

we interpret

f

as

a

map talung the previous joint population frequency, together

with the current joint strategy of the agents, into

a

probability

distribution over the possible current joint population frequen-

cies of the agents.

f(x)

obtained from

2,

c,,y

Pf(Y

I

IC)

=

c,,y

?Af(z)I(?/).

x

=

(x1,ul)

x

."

x

(XNI~N),

.A<

an

examole,

in

evolutionary game theory, the joint strat- gqes between an agent and its opponents, the agent enters

egy of the agents at any given

t

determines

the

cnange ill

each

a

csqxtitk~

Performance of

the

agent is measured with a

one’s population frequency in that time-step. Accordingly, in payoff function.

As

shorthand, the (here deterministic) payon

the replicator dynamics of evolutionary game theory,

f

takes a function when the zth agent plays move (strategy)

gi

and

2’s

joint strategy

21

x

.

XN

and the values

of

all agents’ previous opponent plays

Ti

is written

as

fi(gi,

Ti).

If

we indicate the

population frequencies, and based

on

that determines the new joint move of

i

and its opponent

as

xi

=

(gi,

Ti)

we can

write

value of each agent’s population frequency.

the payoff to agent

i

as

fi(~i).

In

the following we make no

As before, each

d”(t)

contains the information coming out assumption about the structure of moves except that they are

of

f

(d”(t)).

Here that information is the set of current popula- fnite.

a:

might represent a sequence of plays representing an

tion frequencies. The search algorithm

a

now plays two roles. entire game of checkers and

Z

might represent a complete

One of these

is

to directly incorporate those current population set of opponent responses to each play. The payoff function

frequencies into the

{ui}

components of

d”(t+

1).

The other

f(g,Z)

might then represent the outcome of the game

as

+l

is,

as

before,

to

determine the joint strategy

[XI,.

.

,

ZN]

for a win for

i,

0

for a draw, and

-1

for a loss. Illegal joint

-

for time--t

+

1.

As

in self-play,

this

strategy of each agent moves can

be

eliminated by appropriately limiting the space of

z

is given by a (potentially stochastic and/or time-varying) moves and opponent responses

in

order to satisfy the rules of

function

ai.

An

application of

a

is given by the simultaneous

the game.

In

other applications,

g

might represent an algorithm

operation

of

all

those

N

distinct

ai

on

a common

4,

as

well

to

sort

a list and

Z

a mutable set of lists to

be

sorted. The

as the transfer of the joint population frequency from

d”(t),

payoff would then rc9ect the ability of the algorithm to

sort

to

produce

dz

(t

+

1).

those lists in

Z.

Note that the choice of joint strategy given by

a

may depend We defne the payoff for agent

i

playing move

3

inde-

on the previous time-step’s frequencies. As an example,

this

pendent

of

an

opponent’s reply,

g(gi),

as

the least payoff

corresponds

to

sexual reproduction in which mating choices over all possible opponent responses (a

minimax

criteria):

are random.2 However in the simplest version of evolutionary

gi(gi)

minz,

fi(gi,Ti).

With

this

criterion, the

best

move

game theory, the joint strategy is actually constant in time, an agent can make is that move which maximizes

gi

so

that

with all the dynamics occuring via frequency updating

in

f.

its performance in competition (over

all

possible opponents)

If

the agents

are

identihd with distinct genomes, then in

L!s

will be

as

good

as

possible. We are not interested in search

version reproduction

is

parthenogentic.

strategies just across

i’s

possible moves, but more generally

Finally,

C

is now a vector with

N

components, each across all joint moves of

z

and its opponents. (Note that

component

j

only depending

on

the associated

&(j).

In

whether that opponent varies or not is irrelevant, since we

general

in

biological coevolution scenarios (e.g., evolutionary

are

setting its moves.) The ultimate goal is to

maximize

i’s

game theory), there is no notion of a champion being produced

minimax

performance

gi

.

by the search and subsequently pitted against

an

antagonist in

We make one important observation.

In

general, using a

a “bake-off‘’. Accordingly, there

is

no

particular si@ance random pairing strategy in the training phase will not result in

to results for

C’s

that depend

on

f.

a training set that can

be

used to guarantee that any particular

This

means

that

so

long as we make the approximation, move in the competition is better than the worst possible move.

reasonable

in

real biological systems, that

x’s

are never The only way to ensure an outcome

guaranteed

to be better

revisited, all of the requirements of Ex. 1 are met,

This

means than the worst possible is to exhaustively explore all possible

that

MFL

applies.

So

in

particular, say we restrict attention to responses to move

g,

and then determine that the worst value

the particular kinds of

a.’s

of evolutionary game theory. Then

of

fi

for all such joint moves is better than the worst value

any two choices of

a

-

any

two

sets

of

strategy-making rules for some other move,

d.

To

do this certainly requires that

{ai}

-

perfom

just

as

well

as

one another, averaged over all

m

is

greater than the total number of possible moves by

f’s.

More generally, we can consider other kinds

of

a

as

well, the opponent but even for very large

m

unless

all

possible

and the result

still

holds.

opponent responses have been explored we can not make any

such

guarantees.

Pursuing this observation further, consider the situation

~n

example

3

of section

D-B

we introduced self-play model.

where we how

through

exhaustive sampling of

In

the remainder of

this

paper we show how free lunches may

Opponent

that

the

worst

possible Payoff

for

some

arise in

this

seaing, and quantify the a priori differences

be-

move

3:

is

g(g)

and that another joint move

x’

=

(g’,Y)

with

tween certain self-play algorithms. For expository simplicity,

iT

#

g’

results in a payoff

f(x’>

<

g(d-

In

this

there

we

modify the ddnitions introduced in the

framework, is

no

need to explore other opponent responses to

g’

since it

to tailor them for self-play.

must

be

that

g(g’)

<

g(g),

i.e.

g’

is minimax inferior to

g.

In

self-play agents (or game strategies) are paired against

Thus,

considering strategies for searching across the space of

each other in a (perhaps stochastically formed) sequence to Joint moves

xi,

any algorithm that avoids searching regions

generate a set of 2-play~ games. After

m

distinct training which

are

known

to

be

minimax

inferior

(as

above)

will

be

more effcient than one which searches these regions (e.g.

20bvious ehbOdOnS

Of

the

framework

OW

2

to

include relative

rewards

random search).

This

applies for

all

Si

and

SO

the smarter

from

the

preceding

round,

as

well

as

frequencies.

This

allows

mate selection

to

be

based

on

current differential

.€mess,

as

well

as

overall frequency

in

the

algorithm

have

an

performance

than

the

111.

APPLICATION

TO SELF-PLAY

population.

dumb algotithm. Very roughly speaking,

this

result avoids

NFL

implications because uniformly varying over all

gi

does not

uniformly vary over all possible

fi,

which are the functions

that ultimately determines performance.

In the following sections we explore this observation further.

A.

DeZnitions

As

much

as

possible we follow the notation of [WM97]

extending it where necessary. That paper should be consulted

as motivation for the analysis framework we employ. Without

loss of generality we now consider two player games, and

leave the agent index

i

implicit. If there are

I

moves available

to an agent, we label these by

c

E

X

=

[l,

.

‘

,I].

For each

such move we assume the opponent may choose from one

of

t(g)

possible moves forming the space

y(g).

For simplicity

we will take

X(g)

to be independent of

:.

Consequently,

the size of the joint move space is

1x1

=

Cb=,t(g).

If

the

training period consists of

m

distinct joint moves, even with

m

as

!qe

2s

IX;

-

i,

wc

cannot guarantee that the agent

won’t choose the worst possible move

in

the competition as

the worst possible move could be the opponent response that

was left unexplored for each of the

1

possible moves.

In [WM97] a population is

a

sample of distinct points from

the input space

X

and their corresponding ftness values.

In this coevolutionary context the notion of a population

of sampled confgurations needs to be extended to include

opponent responses. For simplicity we assume that ftness

payoffs are

a

deterministic function of joint moves. Thus,

rather than the more general output space

2,

we assume payoff

values lie in

a

€nite totally ordered space. Consequently, the

ftness function is the mapping

f

:

X

H

Y

where

X

=

Ex

x

is the space of joint moves.

As

in

the general framework

a

population

of

size

m

is represented

as

where

d&(i)

=

{dk(i),

dz(z)}

and

dY,(i)

=

f(dk(z),

dz(i))

and

i

E

[l,...

,m]

labels the samples taken. In the above

defnition

d&(i)

is the

ith

move made by the agent,

dz(i)

is

the opponent response, and

dk(i)

is the corresponding payoff.

As

usual we assume that no joint confgurations are revisited.

A

particular coevolutionary optimization task is specifed by

defning the payoff function that is to be extrernized.

As

dis-

cussed in [WM97] a class of problems is defned by specifying

a

probability density

P(f)

over the space

of

possible payoff

functions.

As

long

as

both

X

and

Y

are fnite (as they are in

any computer implementation) this is straightforward.

In addition to this extended notion of a population, there

is an additional consideration in the coevolutionary setting,

namely the decision of what move to make

in

the competition

based upon the results of the training population. Formally,

we encapsulate the process of making this decision as

A.4

A

consists of

a

set of distributions (one for each

m

since we

would like

A

to select a move regardless

of

the size of the

3Note that the space

of

opponent

moves

varies with

c.

This is the typical

situation in applications to games with complex rules

(e.g.

checkers).

4The notation

A

is meant to suggest that, unlike the

A(&)

function

introduced earlier,

A

defnes only

the

champions move, and not the possible

responses to

this

move.

training set) of the

form

P(2

E

;

d,).

If

A

deterministically

returns a single move, we indicate the mapping from training

population

to

move as

A(&).

TO

summarize, the defnition

of search method is extended for self-play to include:

A

search rule

a

which determines the manner in which

a population is expanded during training and is formally

given by the set

of

distributions

{Pt(d”(t)

1

dt--i)}21.

This corresponds exactly to

the

defnition of a search

algorithm in [WM97] used in non-coevolutionary opti-

mization.

A

move-choosing rule

A

mapping probabilistically or

deterministically to the single move used in the compe-

tition. We write

A

explicitly as the probability density

-

A(g

I

d,)

where

E

X.

For deterministic

A

we write

the density

as

A(g

I

d,)

=

b(:

-

A(&)).

The tuple

(.,A)

is called a search process

(as

opposed to

a

search algorithm in [WM97]).

Tine search process seeks

a

strategy that will perform

well in competition. If

A

is deterministic the natural mea-

sure of the performance of search process

(a,A)

obtained

during training is

C

=

minzE[l,m]

f(+4(dm),

dz(z)).

(If

-

A

is not deterministic then we use the weighted average

a

particular

f

are those which maximize

C.

The traditional version of

NFL

(for traditional optimiza-

tion) defnes the performance differently since there is no

opponent. In the simplest case the performance of

a

(recall

that there is no choosing algorithm) might be measured

as

C

=

rna~,~[~,,]

dL.

One traditional

NFL

result states that

the average performance of any pair

of

algorithms is identical,

or formally,

Cf

P(C

1

f,m,a)

is independent of

a5

A

natural extension of this previous results considers a non-

uniform average over ftness functions. In this case the quantity

of interest is

CfP(C

I

f,m,a)P(f)

where

P(f)

weights

different ftness functions.

NFL

results can be proven for other

non-uniform

P(f)

[SVWOl].

A

result akin to this one in the self-play setting would state

that the unform average

Cf

P(C

I

f,

m,

a,

A)

is independent

of

a

and

A.

However,

as

we have informally seen, such

a

result cannot hold in general since a search process with an

a

that exhausts an opponent’s repertoire of moves has better

guarantees than other search processes.

A

formal proof of this

statement is presented in the next section.

EZEpin2E[l,,]

f(z,

d%))A(:

I

dm).)

The best

(%A)

for

IV.

AN

INTUITIVE

EXAMPLE

Before proving the existence of free lunches we give a

motivating example to both illustrate the dehitions made in

the above section and to show why we might expect free

lunches to exist. Consider the concrete case where the player

has two possible moves, i.e.

X

=

{1,2},

the opponent has

two

responses

for

each of these moves, i.e.

x

=

{

1,2},

and

there

are only two possible payoff values, i.e.

Y

=

{1/2,1}.

In

this

simple case there are

16

possible functions and these are listed

in Table I.

We

can see that in this simple example the minimax

’Actually far more can be said, and the reader

is

encouraged to consult

W97]

for details.

5

(%z)

(1,:)

(1,2)

(2,l)

(2,2)

1

2

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

fll

flZ

f13

f14

f15

fl6

!,E

!

If2

1

112

I

In

1

1R

1

112

1

10

1

1R

1

IR

1

IR

ID

1

I

ii;

;R

!

1

IR

1R

1

1R

112

In

1

1R 1R

1R

1

1n

If2

112

ID

1R

In

1R

IC!

1

1 1

1

112

In

1

In

112

iR

1

112

1R

112

1

ID

1R

112

1

112

,

1R

112 112

in

112

1/2

1R 112

1R

112

1

91

92

93

94

95

96

g7

98

99

910

911

912

913

914

915

916

EXHAUSTIVE

ENUMERATION

OF

ALL POSSIBLE FUNCTIONS

fk,

F)

AND

gk)

=

minFf(g,z)

FOR

x

=

{1,2},

x

=

{1,2},

AND

Y

=

{1/2,1}.

THE

PAYOFF

FUNCTIONS

LABELED

IN

BOLD

ARE

THOSE

CONSISTENT

WITH

THE

POPULATION

dz

=

{(1,2; 1/2),

(2,2;

1)).

criteria gives a very biased distribution over possible perfor-

mance measures: 9/16 of the functions have

g

=

[1/2 1/21,

3/16 have

g

=

1/2 11, 3/16 have

g

=

[l

1/2

,

and 1/16

have

g

=

[l

1

I

where

g

=

[g(g

=

1)

g(:

=

2)

1

.

If

we consider a particular population, say

d2

=

{(1,2; 1/2), (2,2; I)}, the payoff functions that

are

consistent

with this population

are

fg,

f10,

fi3,

fl4

and the corres ond

ing distribution over

g

functions is 1/2[1/2 1/21

and

1/2 [1/2

1IT.

Given the fact that any population will give

a

biased sample over

g

functions it may not surprising that there

are free lunches. We might expect that an algorithm which is

able to exploit

this

biased sample would perform uniformly

better than another algorithm which does not exploit the biased

sample of

gs.

In the next section we prove the existence of

free lunches by constructing such a pair of algorithms.

P-

v.

PROOF

OF

FREE

LUNCHES

In

this

section a proof is presented that there are free

lunches for self-play by constructing a pair of search processes

one of which explicitly has performance equal to or better

than the other for all possible payoff functions

f.

As

in

earlier

NFT

work we assume that both

1x1

and

IYI

are

fnite.6

For convenience, and with no loss in generality, we

normalize the possible

Y

values

so

that they

are

equal to

The

pair

of processes we construct use the same search

rule

a

(it is not important in the present context what

a

is) but different deterministic move choosing rules

A.

In

both cases a Bayesian estimate based on uniform

P(f)

and

the

d,

at hand is made of the expected value of

g(g)

=

min~f(g,

5)

for each

I.

Since we

are

striving to maximize

the worst possible payoff from

f.

the optimal search process

selects the move which maximizes

this

expected value while

the worst process selects the move which minimizes

this

value.

More

formally,

if

E(C

I

d,,

a,

A)

differs for the

two

choices of

4,

always being higher for one of them, then

E(C

I

m,

a,

A)

=

Ea,

P(d,

1

a)E(C

I

d,,A)

differs for

the two

A.

In

turn,

E(C

I

m,a,A)

=

Cf,&

x

P(C

1

uniform

prior

P(f).

Since

this

differs for the two

4,

so

must

Let

j(g)

be

a random variable representing the value of

g(g)

conditioned

on

d,

and

g,

i.e. it equals the worst possible

l/IYl, 2/IYI,.

. .

,1.

f,

m,

a,A)

x

P(f)l

Cf,CP

x

P(C

I

f,

m,

a,A)1

for

the

Cf

P(C

I

f,

m,

a,A).

that

1x1

=

E,

TO.

-

payoff (to the agent) after the agent makes move

:

and

the opponent replies.

In

the example of section

IV

we have

Ej(

1)

=

1/2 and Ej(2)

=

3/4

To

determine the expected value of

j(g)

we need to know

P(f).

Of the entire population

d,

only

the subset sampled

at

g

is relevant. We assume that there are

k(g,d,)

5

m

such

value^.^

Since we are concerned with the worst possible

opponent response let

r(:,d,)

be

the minimal Y value

obtained over the

k(:,d,)

responses to

g,

i.e.

~(g,d,)

=

min,,c

dK(g,

Z).

Since payoff values

are

normalized to

lie

between

0

and 1,

0

<

~(g,d,)

5

1.

Given

k(:,dm)

and

~(g,

L),

P(j

I

g,

d,)

is independent of

a:

and

d,

and

so

we

indicate the desired probability

as

~+(j).

In

appendix

A

we derive the probability

Tk,r

in the case

where

all

Y values are distinct (we do

so

because

this

results

in a particularly simple expression for the expected value of

j)

and in the case where

Y

values

are

not forced to

be

distinct.

From these densities we the expected value of

j(:)

can

be

determined. In the case where

Y

values

are

not forced to

be

distinct there is no closed form for the expectation. However,

in the continuum limit where

IYI

+

00

we €nd (see appendix

B)

P(j(:)

I

z,

4n)

=

Cf

P(j(z)

I

z,

dm,

f)P(f)

for unifcml

where we have explicitly noted that both

k

and

r

depend

both on the move

g

as well

as

the training population

d,.

As

shorthand we ddne

C,(g)

The best move given the training population is the deter-

ministic choice

Lt(d,)

=

arg

max,

Cm(g)

and the worst

last move is

&mt(dm)

=

arg

min,

C,(g).

~n

the example

of section

IV

with the population

ofsize

2,

&(&)

=

2

and

AW0rst(d2)

=

1-

As

long

as

Cm(g)

is not constant (which

will

usually

be

the

case since the

T

values will differ)

(a,&)

and

(a,-&m)

will differ, and the expected performance of

Lt

will

be

superior.

This

proves that the expected performance over

all

payoff functions of algorithm

(a,&)

is

greater than that of

E(j(g)

I

g,d,).

algorithm

(a,

AwOIst).

VI.

OTHER

FREE

LUNCHES

We have shown the existence of

free

lunches for self-play

by constructing a pair of algorithms with the same search

rule

course,

we

must

also

have

k&

k)

5

T(zJ

for

pop~lations

6.

Coevolutionary free lunches

Citations

Introduction to Algorithms

Evolution and the Theory of Games

Nature-Inspired Optimization Algorithms

Handbook of game theory with economic applications

A systematic comparison of supervised classifiers.

References

Adaptation in natural and artificial systems

No free lunch theorems for optimization

Evolution and the Theory of Games

Evolution and the Theory of Games: Subject index

Evolution and the Theory of Games

Related Papers (5)

No free lunch theorems for optimization

Adaptation in natural and artificial systems

Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces

Optimization by Simulated Annealing

Metaheuristics in combinatorial optimization: Overview and conceptual comparison