scispace - formally typeset
Open AccessJournal ArticleDOI

Coevolutionary free lunches

Reads0
Chats0
TLDR
This paper presents a general framework covering most optimization scenarios and shows that in self-play there are free lunches: in coevolution some algorithms have better performance than other algorithms, averaged across all possible problems.
Abstract
Recent work on the foundational underpinnings of black-box optimization has begun to uncover a rich mathematical structure. In particular, it is now known that an inner product between the optimization algorithm and the distribution of optimization problems likely to be encountered fixes the distribution over likely performances in running that algorithm. One ramification of this is the "No Free Lunch" (NFL) theorems, which state that any two algorithms are equivalent when their performance is averaged across all possible problems. This highlights the need for exploiting problem-specific knowledge to achieve better than random performance. In this paper, we present a general framework covering most optimization scenarios. In addition to the optimization scenarios addressed in the NFL results, this framework covers multiarmed bandit problems and evolution of multiple coevolving players. As a particular instance of the latter, it covers "self-play" problems. In these problems, the set of players work together to produce a champion, who then engages one or more antagonists in a subsequent multiplayer game. In contrast to the traditional optimization case where the NFL results hold, we show that in self-play there are free lunches: in coevolution some algorithms have better performance than other algorithms, averaged across all possible problems. However, in the typical coevolutionary scenarios encountered in biology, where there is no champion, the NFL theorems still hold.

read more

Content maybe subject to copyright    Report

Jy
I
?
PnPvnI11tinnarv
Free
Lunches
J
uwv
v
-*-------
David
H.
Wolpert,
dhw@email
.
arc. nasa
.
gov
William
G.
Macready,
wgm@email
.
arc
-
nasa.
gov
NASA
Ames
Research
Center
Moffett
Field, CA,
94035
Abstract-Recent work on the foundations
of
optimization
has
begun
to
uncover its underlying rich
structure.
In
particular, the
“No
Free
Lunch”
(Nn)
theorems
lwMg
state
that any
two
algorithms
are
equivalent when their performance
is
averaged
across
all
possible problems.
This
L&ghts
the
need
for
exploit-
ing
problemspecifc knowledge
to
achieve better
than
random
performance.
In
this
paper we present a general framework
covering most search scenarios.
In
addition
to
the optimiza-
tion
scenarios addressed
in
the
NFL
results,
this
framework
covers
multi-armed
bandit problems and evolution
of
multiple
co-evolving
agents.
As
a particular instance of the latter,
it
covers “self-play” problems.
In
these problems the agents
work
together
to
produce a
champion,
who then engages one
or
more
antagonists
in
a subsequent multi-player
game
In
contrast
to
the
traditional
optimization
case
where the
NFL
results
hold,
we show that
in
self-play there
are
free
lunches:
in
coevolution
some algorithms have better performance
than
other algorithms,
averaged across
all
possible problems. However
in
the typical
coevolutionary scenarios encountered
in
biology, where there
is
no
champion,
NFL
still
holds.
I. INTRODUCTION
Optimization algorithms have proven to
be
valuable
in
almost every setting where quantitative €gures of merit are
available. Recently, the mathematical foundations of opti-
mization have begun to
be
uncovered [KWMOI], LMW961,
[FHSOl],
mol],
[COOl].
One particular result in this work,
the “No Free Lunch”
@FL)
theorems, establishes the equiva-
lent performance of all optimization algorithms when averaged
across all possible problems [KWMOI]. Numerous works have
extended these early results, and considered their application
to
different
types
of optimization (e.g. to multi-objective op-
timization [CK03]). The web site
www.
no-
free-
lunch
-
org
offers a list of recent references.
However, all previous work has been cast
in
a limited
manner that does not cover repeated game scenarios where
the
fgure
of
merit can vary based on the response of
an-
other player.
In
particular, the
NFL,
theorems do not cover
such scenarios. These game-like scenarios are usually called
“coevolutionary” since they involve the behaviors of more than
a single agent or player
FWH)].
One important example of coevolution is “self-play”, where
the players cooperate to
train
one of them
as
a champion. That
champion is then pitted against an antagonist
in
a subsequent
multi-player game. The goal is to
min
that champion to
perform as well
as
possible in that subsequent game. For a
checkers example
see
[CF99]. We will refer to
all
players
other than the one
of
direct attention
as
that player’s “op-
ponents”, even when
(as
in self-play) the players are actually
cooperating. (Sometimes when discussing self-play we
will
refer to the specifc opponent to
be
faced by a champion in a
subsequent game
-
an opponent not under our control
-
as
the champion’s “antagonist”.)
Coevolution
can
also
be
used for problems that on
the
surface appear to have
no
connection to a game (for
an
early
application to sorting networks see m1921). Coevolution in
these cases enables escape from poor local optima in favor of
better local optima.
In
this
paper we frst present a mathematical framework
that covers both traditional optimization and coevolutionary
scenarios. (It also covers other scenarios like multi-armed
bandits.) We then use that framework to explore the differences
between traditional optimization and coevolution. We €nd
dramatic differences between the traditional optimization and
coevolutionary scenarios.
In
particular, unlike the fundamental
NFL
result for traditional optimization, in the self-play domain
there are algorithms which
are
superior to other algorithms for
all
problems. However
in
the typical coevolutionary scenarios
encountered
in
biology, where there is
no
champion,
NFL
still
holds.
IT.
GENERAL
FRAMEWORK
In
this
section we present a general framework, and illustrate
it
on
two examples. Despite its substantially greater breadth
of applicability, the formal structure of
this
framwork
is
only
a slight extension of that used in
[p\Ip1197].
A.
Foml
framework
specifcation
Say we have two spaces,
X
and
2.
To
guide the intuition,
a
typical scenario might have
x
E
X
be
the joint strategy
followed by our player(s), and
z
E
2
be the probability
distribution over some space of possible rewarddpayoffs to
the champion, or over possible fgures of merit., or some such.
In
addition to
X
and
2,
we
also
have a
&ness
function
f
:X-Z.
(1)
In
the example where
z
is a probability distribution over
rewards,
f
can
be
viewed as the specifcation of an
x-
conditioned probability distribution of rewards.
We have a total of
m
time-steps, and represent the infor-
mation generated through those time-steps
as
d,
=
(4LG)
=
({W)IZ“=,,
{d”(t)),“=,).
Each
dz(t)
is a particular
x
E
X.
Each
d”(t)
is a (perhaps
stochastic) function of f(dx(t‘)). For example,
say
z’s
-
values off
(x)
-
are probability distributions over reward
val-
ues. Then
d”(t)
could consist of
the
full
distribution
f(dz(t)).

Alternatively,
it
could consist of
a
moment of that distribution,
or even
a
random sample of it. In general, we allow the
function specifying
dz(t)
to
vary
with
t,
although that freedom
will not be exploited here.
As
shorthand we will write
d(t)
to
mean the pair
(d"
(t)
,
d"
(t
))
.
A
search algorithm
a
is an initial distribution
Pl(d"(l)),
together with
a
set of
m
-
1
separate conditional distributions
P,(d"(t)
I
dt-l),
t
=
2,.
.
.
m.
Such an algorithm speciEes
what
IC
to choose, based
on
the information uncovered
so
far, for any time-step
t.
Finally, we have
a
vector-valued
cost
function,
C(d,,
f)
which we use to assess the performance
of the algorithm. Often our goal is to End the
a
that will
maximize
E(C),
for
a
particular choice of how to form the
d"
(t)'s.
The
NFL
theorems concern averages over
all
f
of quantities
involving
C.
For those theorems to hold, for f-averages of
C
to be independent of the search algorithm, it is crucial
that
C
does not depend on
f.
(The framework in [Wh497]
ddnes
cost
f~octisns
2s
:ea!-vakied
fiiiibiivIis
of
d,.j
-when
that independence is relaxed, the NFL theorems need not hold.
Such relaxation occurs in self-play, for example, and is how
one can have free lunches in self-play. This papers explores
this phenomenon.
B.
Examples
of
the framework
a)
Example
1:
One example of this framework is the
scenario considered in the
NFL
theorems. There each
z
is
a
probability distribution over
a
space
Y
R.
For convenience
we take
X
and
Y
countable. Each
df(t)
is a sample of the
associated function
z(t)
=
f(d"(t)).
The search algorithm is
constrained
so
that
(3)
i.e.,
so
that the search never revisits points already sampled.'
Finally,
C(d,,
f)
is allowed to be any scalar-valued function
that depends on
d,
exclusively.
The
NFL
theorems apply to any scenario meeting these
specifcations.
b)
Example
2:
Another example is the multi-arm bandit
problem introduced for optimization by Holland [Ho175] and
thoroughly analyzed in [MW98]. The scenario for that problem
is
identical to that for the
NFL
results, except that there are
no constraints on the search algorithm,
Y
=
R
and every
z
is
a
Gaussian. The fact that revisits are allowed means that
NFL
need not apply.
c)
Example
3:
Self-play is identical to the NFL scenario
except that
C
depends on
f.
This dependence is based
on
a
function
A(&)
mapping
d,
to
a
subset of
X.
Intuitively,
A
specifes the details of the champion, based on the
m
repeated games and on the possible responses to the champion
of
an antagonist
in
a
subsequent game.
C
is then based on
this specifcation of the champion. Formally, it uses
A
to
P,(d"(t)
=
x
I
dt-1)
=
0
vx
E
d;-l,
'This requirement is just to "normalize" algorithms.
In
general. an algo-
rithm that sometimes revisits points can outperform one that never
does.
Our
requirement simply says that we're purely focusing
on
how well the algorithms
choose new points, not how smart they are about whether
to
Enish the search
at
t
=
m
by
sampling a new point
or
by
returning to one already visited.
See
IwM971.
u
determine the quality of the search algorithm that generated
d,
as
follows:
(4)
where
IEf(.)
is the expected value of the distribution of payoffs
Intuitively, this measure is the worst possible payoff to the
champion.
To
see in more detail how this describes self-play, assume
two players, with strategy spaces
X1
and
X2, X1
being the
strategy space of our champion. Take
IC
to be the joint strategy
of our players
in
any particular game, i.e.,
IC
E
X
=
X1
x
Xz.
So
d&
specifes the
m
strategies followed by our champion
(as
well
as
that of the other player) during the
m
training games.
dk
is the associated set of rewards to our champion, i.e., each
d"
(t)
is
a
sample of the distribution
f
(d"
(t)).
Let
21
E
X1
be the strategy our champion elects to follow
bawd
nn
th~
trriining
dzta.
Note
:hat
that
b'uategy
cai
be
represented
as
the set of all joint-strategies
x
whose Erst
component is
zl.
We adopt this representation, and write the
strategy chosen by our champion
-
the set of all
x's
consistent
with the champion's choice of strategy
x1
-
as
A(&)
C
X.
Say the antagonist our champion will now face is able
to choose the worst possible element of
X2
(as far as ex-
pected reward to our champion is concerned), given that our
champion chooses strategy
A(&).
If the antagonist does this
the expected reward to our champion is given by
C(dm,f)
as
deEned above. Obvious variants of
this
setup replace the
worst-case nature of
C
with some alternative, have
A
be
stochastic, etc. Whatever variant we choose, typically our
goal in self-play is to choose
a
andor
A
so
as
to maximize
E(C),
the expectation being over all possible
d,.
The fact
that
C
depends on
f
means that NFL need not apply. The
mathematical structure that replaces
NFL
is explored in the
following sections of this paper.
d)
Example
4:
The basic description of self-play in the
introduction looks like
a
special case of the more general
biological coevolution scenario. However in terms
of
our
framework they are quite different.
In the general coevolution scenario there are
a
total of
N
agents (or players, or species', or genes, etc). Their strategy
spaces are written
X,,
as
in self-play. Now though
X
is
extended beyond the current joint strategy, to include the
previous joint "population frequency" value. Formally, we
write
(5)
and interpret each
u,
E
R
as
agent
2's
previous population
frequency. As explained below, the reason for this extension
of
X
is
so
that
a
can give the sequence of joint population
frequencies that accompanies the sequence of joint strategies.
In the general coevolution scenario each
2
is a probability
distribution over the possible current population frequencies
of the agents.
So
given our defnition of
X,
we interpret
f
as
a
map talung the previous joint population frequency, together
with the current joint strategy of the agents, into
a
probability
distribution over the possible current joint population frequen-
cies of the agents.
f(x)
obtained from
2,
c,,y
Pf(Y
I
IC)
=
c,,y
?Af(z)I(?/).
x
=
(x1,ul)
x
."
x
(XNI~N),

.A<
an
examole,
in
evolutionary game theory, the joint strat- gqes between an agent and its opponents, the agent enters
egy of the agents at any given
t
determines
the
cnange ill
each
a
csqxtitk~
Performance of
the
agent is measured with a
one’s population frequency in that time-step. Accordingly, in payoff function.
As
shorthand, the (here deterministic) payon
the replicator dynamics of evolutionary game theory,
f
takes a function when the zth agent plays move (strategy)
gi
and
2’s
joint strategy
21
x
.
.
.
XN
and the values
of
all agents’ previous opponent plays
Ti
is written
as
fi(gi,
Ti).
If
we indicate the
population frequencies, and based
on
that determines the new joint move of
i
and its opponent
as
xi
=
(gi,
Ti)
we can
write
value of each agent’s population frequency.
the payoff to agent
i
as
fi(~i).
In
the following we make no
As before, each
d”(t)
contains the information coming out assumption about the structure of moves except that they are
of
f
(d”(t)).
Here that information is the set of current popula- fnite.
a:
might represent a sequence of plays representing an
tion frequencies. The search algorithm
a
now plays two roles. entire game of checkers and
Z
might represent a complete
One of these
is
to directly incorporate those current population set of opponent responses to each play. The payoff function
frequencies into the
{ui}
components of
d”(t+
1).
The other
f(g,Z)
might then represent the outcome of the game
as
+l
is,
as
before,
to
determine the joint strategy
[XI,.
.
.
,
ZN]
for a win for
i,
0
for a draw, and
-1
for a loss. Illegal joint
-
for time--t
+
1.
As
in self-play,
this
strategy of each agent moves can
be
eliminated by appropriately limiting the space of
z
is given by a (potentially stochastic and/or time-varying) moves and opponent responses
in
order to satisfy the rules of
function
ai.
An
application of
a
is given by the simultaneous
the game.
In
other applications,
g
might represent an algorithm
operation
of
all
those
N
distinct
ai
on
a common
4,
as
well
to
sort
a list and
Z
a mutable set of lists to
be
sorted. The
as the transfer of the joint population frequency from
d”(t),
payoff would then rc9ect the ability of the algorithm to
sort
to
produce
dz
(t
+
1).
those lists in
Z.
Note that the choice of joint strategy given by
a
may depend We defne the payoff for agent
i
playing move
3
inde-
on the previous time-step’s frequencies. As an example,
this
pendent
of
an
opponent’s reply,
g(gi),
as
the least payoff
corresponds
to
sexual reproduction in which mating choices over all possible opponent responses (a
minimax
criteria):
are random.2 However in the simplest version of evolutionary
gi(gi)
minz,
fi(gi,Ti).
With
this
criterion, the
best
move
game theory, the joint strategy is actually constant in time, an agent can make is that move which maximizes
gi
so
that
with all the dynamics occuring via frequency updating
in
f.
its performance in competition (over
all
possible opponents)
If
the agents
are
identihd with distinct genomes, then in
L!s
will be
as
good
as
possible. We are not interested in search
version reproduction
is
parthenogentic.
strategies just across
i’s
possible moves, but more generally
Finally,
C
is now a vector with
N
components, each across all joint moves of
z
and its opponents. (Note that
component
j
only depending
on
the associated
&(j).
In
whether that opponent varies or not is irrelevant, since we
general
in
biological coevolution scenarios (e.g., evolutionary
are
setting its moves.) The ultimate goal is to
maximize
i’s
game theory), there is no notion of a champion being produced
minimax
performance
gi
.
by the search and subsequently pitted against
an
antagonist in
We make one important observation.
In
general, using a
a “bake-off‘’. Accordingly, there
is
no
particular si@ance random pairing strategy in the training phase will not result in
to results for
C’s
that depend
on
f.
a training set that can
be
used to guarantee that any particular
This
means
that
so
long as we make the approximation, move in the competition is better than the worst possible move.
reasonable
in
real biological systems, that
x’s
are never The only way to ensure an outcome
guaranteed
to be better
revisited, all of the requirements of Ex. 1 are met,
This
means than the worst possible is to exhaustively explore all possible
that
MFL
applies.
So
in
particular, say we restrict attention to responses to move
g,
and then determine that the worst value
the particular kinds of
a.’s
of evolutionary game theory. Then
of
fi
for all such joint moves is better than the worst value
any two choices of
a
-
any
two
sets
of
strategy-making rules for some other move,
d.
To
do this certainly requires that
{ai}
-
perfom
just
as
well
as
one another, averaged over all
m
is
greater than the total number of possible moves by
f’s.
More generally, we can consider other kinds
of
a
as
well, the opponent but even for very large
m
unless
all
possible
and the result
still
holds.
opponent responses have been explored we can not make any
such
guarantees.
Pursuing this observation further, consider the situation
~n
example
3
of section
D-B
we introduced self-play model.
where we how
through
exhaustive sampling of
In
the remainder of
this
paper we show how free lunches may
Opponent
that
the
worst
possible Payoff
for
some
arise in
this
seaing, and quantify the a priori differences
be-
move
3:
is
g(g)
and that another joint move
x’
=
(g’,Y)
with
tween certain self-play algorithms. For expository simplicity,
iT
#
g’
results in a payoff
f(x’>
<
g(d-
In
this
there
we
modify the ddnitions introduced in the
framework, is
no
need to explore other opponent responses to
g’
since it
to tailor them for self-play.
must
be
that
g(g’)
<
g(g),
i.e.
g’
is minimax inferior to
g.
In
self-play agents (or game strategies) are paired against
Thus,
considering strategies for searching across the space of
each other in a (perhaps stochastically formed) sequence to Joint moves
xi,
any algorithm that avoids searching regions
generate a set of 2-play~ games. After
m
distinct training which
are
known
to
be
minimax
inferior
(as
above)
will
be
more effcient than one which searches these regions (e.g.
20bvious ehbOdOnS
Of
the
framework
OW
2
to
include relative
rewards
random search).
This
applies for
all
Si
and
SO
the smarter
from
the
preceding
round,
as
well
as
frequencies.
This
allows
mate selection
to
be
based
on
current differential
.€mess,
as
well
as
overall frequency
in
the
algorithm
have
an
performance
than
the
111.
APPLICATION
TO SELF-PLAY
population.
dumb algotithm. Very roughly speaking,
this
result avoids
NFL

implications because uniformly varying over all
gi
does not
uniformly vary over all possible
fi,
which are the functions
that ultimately determines performance.
In the following sections we explore this observation further.
A.
DeZnitions
As
much
as
possible we follow the notation of [WM97]
extending it where necessary. That paper should be consulted
as motivation for the analysis framework we employ. Without
loss of generality we now consider two player games, and
leave the agent index
i
implicit. If there are
I
moves available
to an agent, we label these by
c
E
X
=
[l,
.
.
,I].
For each
such move we assume the opponent may choose from one
of
t(g)
possible moves forming the space
y(g).
For simplicity
we will take
X(g)
to be independent of
:.
Consequently,
the size of the joint move space is
1x1
=
Cb=,t(g).
If
the
training period consists of
m
distinct joint moves, even with
m
as
!qe
2s
IX;
-
i,
wc
cannot guarantee that the agent
won’t choose the worst possible move
in
the competition as
the worst possible move could be the opponent response that
was left unexplored for each of the
1
possible moves.
In [WM97] a population is
a
sample of distinct points from
the input space
X
and their corresponding ftness values.
In this coevolutionary context the notion of a population
of sampled confgurations needs to be extended to include
opponent responses. For simplicity we assume that ftness
payoffs are
a
deterministic function of joint moves. Thus,
rather than the more general output space
2,
we assume payoff
values lie in
a
€nite totally ordered space. Consequently, the
ftness function is the mapping
f
:
X
H
Y
where
X
=
Ex
x
is the space of joint moves.
As
in
the general framework
a
population
of
size
m
is represented
as
where
d&(i)
=
{dk(i),
dz(z)}
and
dY,(i)
=
f(dk(z),
dz(i))
and
i
E
[l,...
,m]
labels the samples taken. In the above
defnition
d&(i)
is the
ith
move made by the agent,
dz(i)
is
the opponent response, and
dk(i)
is the corresponding payoff.
As
usual we assume that no joint confgurations are revisited.
A
particular coevolutionary optimization task is specifed by
defning the payoff function that is to be extrernized.
As
dis-
cussed in [WM97] a class of problems is defned by specifying
a
probability density
P(f)
over the space
of
possible payoff
functions.
As
long
as
both
X
and
Y
are fnite (as they are in
any computer implementation) this is straightforward.
In addition to this extended notion of a population, there
is an additional consideration in the coevolutionary setting,
namely the decision of what move to make
in
the competition
based upon the results of the training population. Formally,
we encapsulate the process of making this decision as
A.4
A
consists of
a
set of distributions (one for each
m
since we
would like
A
to select a move regardless
of
the size of the
3Note that the space
of
opponent
moves
varies with
c.
This is the typical
situation in applications to games with complex rules
(e.g.
checkers).
4The notation
A
is meant to suggest that, unlike the
A(&)
function
introduced earlier,
A
defnes only
the
champions move, and not the possible
responses to
this
move.
training set) of the
form
P(2
E
;
d,).
If
A
deterministically
returns a single move, we indicate the mapping from training
population
to
move as
A(&).
TO
summarize, the defnition
of search method is extended for self-play to include:
A
search rule
a
which determines the manner in which
a population is expanded during training and is formally
given by the set
of
distributions
{Pt(d”(t)
1
dt--i)}21.
This corresponds exactly to
the
defnition of a search
algorithm in [WM97] used in non-coevolutionary opti-
mization.
A
move-choosing rule
A
mapping probabilistically or
deterministically to the single move used in the compe-
tition. We write
A
explicitly as the probability density
-
A(g
I
d,)
where
E
X.
For deterministic
A
we write
the density
as
A(g
I
d,)
=
b(:
-
A(&)).
The tuple
(.,A)
is called a search process
(as
opposed to
a
search algorithm in [WM97]).
Tine search process seeks
a
strategy that will perform
well in competition. If
A
is deterministic the natural mea-
sure of the performance of search process
(a,A)
obtained
during training is
C
=
minzE[l,m]
f(+4(dm),
dz(z)).
(If
-
A
is not deterministic then we use the weighted average
a
particular
f
are those which maximize
C.
The traditional version of
NFL
(for traditional optimiza-
tion) defnes the performance differently since there is no
opponent. In the simplest case the performance of
a
(recall
that there is no choosing algorithm) might be measured
as
C
=
rna~,~[~,,]
dL.
One traditional
NFL
result states that
the average performance of any pair
of
algorithms is identical,
or formally,
Cf
P(C
1
f,m,a)
is independent of
a5
A
natural extension of this previous results considers a non-
uniform average over ftness functions. In this case the quantity
of interest is
CfP(C
I
f,m,a)P(f)
where
P(f)
weights
different ftness functions.
NFL
results can be proven for other
non-uniform
P(f)
[SVWOl].
A
result akin to this one in the self-play setting would state
that the unform average
Cf
P(C
I
f,
m,
a,
A)
is independent
of
a
and
A.
However,
as
we have informally seen, such
a
result cannot hold in general since a search process with an
a
that exhausts an opponent’s repertoire of moves has better
guarantees than other search processes.
A
formal proof of this
statement is presented in the next section.
EZEpin2E[l,,]
f(z,
d%))A(:
I
dm).)
The best
(%A)
for
IV.
AN
INTUITIVE
EXAMPLE
Before proving the existence of free lunches we give a
motivating example to both illustrate the dehitions made in
the above section and to show why we might expect free
lunches to exist. Consider the concrete case where the player
has two possible moves, i.e.
X
=
{1,2},
the opponent has
two
responses
for
each of these moves, i.e.
x
=
{
1,2},
and
there
are only two possible payoff values, i.e.
Y
=
{1/2,1}.
In
this
simple case there are
16
possible functions and these are listed
in Table I.
We
can see that in this simple example the minimax
’Actually far more can be said, and the reader
is
encouraged to consult
W97]
for details.

5
(%z)
(1,:)
(1,2)
(2,l)
(2,2)
1
2
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
fll
flZ
f13
f14
f15
fl6
!,E
!
If2
1
112
I
In
1
1R
1
112
1
10
1
1R
1
IR
IR
1
1
IR
ID
1
I
ii;
;R
!
1
IR
1R
1
1
1R
1R
112
In
1
1
1
1
1R 1R
1R
1R
1
1
1
1
1n
If2
112
ID
1R
In
1R
IC!
1
1
1
1
1
1 1
1
112
112
In
1
In
112
iR
1
112
1R
112
1
ID
1R
112
1
112
,
1R
1R
112 112
in
112
1/2
1R 112
1R
112
1
1
1
1
91
92
93
94
95
96
g7
98
99
910
911
912
913
914
915
916
EXHAUSTIVE
ENUMERATION
OF
ALL POSSIBLE FUNCTIONS
fk,
F)
AND
gk)
=
minFf(g,z)
FOR
x
=
{1,2},
x
=
{1,2},
AND
Y
=
{1/2,1}.
THE
PAYOFF
FUNCTIONS
LABELED
IN
BOLD
ARE
THOSE
CONSISTENT
WITH
THE
POPULATION
dz
=
{(1,2; 1/2),
(2,2;
1)).
criteria gives a very biased distribution over possible perfor-
mance measures: 9/16 of the functions have
g
=
[1/2 1/21,
3/16 have
g
=
1/2 11, 3/16 have
g
=
[l
1/2
,
and 1/16
have
g
=
[l
1
I
where
g
=
[g(g
=
1)
g(:
=
2)
1
.
If
we consider a particular population, say
d2
=
{(1,2; 1/2), (2,2; I)}, the payoff functions that
are
consistent
with this population
are
fg,
f10,
fi3,
fl4
and the corres ond
ing distribution over
g
functions is 1/2[1/2 1/21
and
1/2 [1/2
1IT.
Given the fact that any population will give
a
biased sample over
g
functions it may not surprising that there
are free lunches. We might expect that an algorithm which is
able to exploit
this
biased sample would perform uniformly
better than another algorithm which does not exploit the biased
sample of
gs.
In the next section we prove the existence of
free lunches by constructing such a pair of algorithms.
P-
v.
PROOF
OF
FREE
LUNCHES
In
this
section a proof is presented that there are free
lunches for self-play by constructing a pair of search processes
one of which explicitly has performance equal to or better
than the other for all possible payoff functions
f.
As
in
earlier
NFT
work we assume that both
1x1
and
IYI
are
fnite.6
For convenience, and with no loss in generality, we
normalize the possible
Y
values
so
that they
are
equal to
The
pair
of processes we construct use the same search
rule
a
(it is not important in the present context what
a
is) but different deterministic move choosing rules
A.
In
both cases a Bayesian estimate based on uniform
P(f)
and
the
d,
at hand is made of the expected value of
g(g)
=
min~f(g,
5)
for each
I.
Since we
are
striving to maximize
the worst possible payoff from
f.
the optimal search process
selects the move which maximizes
this
expected value while
the worst process selects the move which minimizes
this
value.
More
formally,
if
E(C
I
d,,
a,
A)
differs for the
two
choices of
4,
always being higher for one of them, then
E(C
I
m,
a,
A)
=
Ea,
P(d,
1
a)E(C
I
d,,A)
differs for
the two
A.
In
turn,
E(C
I
m,a,A)
=
Cf,&
x
P(C
1
uniform
prior
P(f).
Since
this
differs for the two
4,
so
must
Let
j(g)
be
a random variable representing the value of
g(g)
conditioned
on
d,
and
g,
i.e. it equals the worst possible
l/IYl, 2/IYI,.
. .
,1.
f,
m,
a,A)
x
P(f)l
Cf,CP
x
P(C
I
f,
m,
a,A)1
for
the
Cf
P(C
I
f,
m,
a,A).
that
1x1
=
E,
TO.
-
payoff (to the agent) after the agent makes move
:
and
the opponent replies.
In
the example of section
IV
we have
Ej(
1)
=
1/2 and Ej(2)
=
3/4
To
determine the expected value of
j(g)
we need to know
P(f).
Of the entire population
d,
only
the subset sampled
at
g
is relevant. We assume that there are
k(g,d,)
5
m
such
value^.^
Since we are concerned with the worst possible
opponent response let
r(:,d,)
be
the minimal Y value
obtained over the
k(:,d,)
responses to
g,
i.e.
~(g,d,)
=
min,,c
dK(g,
Z).
Since payoff values
are
normalized to
lie
between
0
and 1,
0
<
~(g,d,)
5
1.
Given
k(:,dm)
and
~(g,
L),
P(j
I
g,
d,)
is independent of
a:
and
d,
and
so
we
indicate the desired probability
as
~+(j).
In
appendix
A
we derive the probability
Tk,r
in the case
where
all
Y values are distinct (we do
so
because
this
results
in a particularly simple expression for the expected value of
j)
and in the case where
Y
values
are
not forced to
be
distinct.
From these densities we the expected value of
j(:)
can
be
determined. In the case where
Y
values
are
not forced to
be
distinct there is no closed form for the expectation. However,
in the continuum limit where
IYI
+
00
we €nd (see appendix
B)
P(j(:)
I
z,
4n)
=
Cf
P(j(z)
I
z,
dm,
f)P(f)
for unifcml
where we have explicitly noted that both
k
and
r
depend
both on the move
g
as well
as
the training population
d,.
As
shorthand we ddne
C,(g)
The best move given the training population is the deter-
ministic choice
Lt(d,)
=
arg
max,
Cm(g)
and the worst
last move is
&mt(dm)
=
arg
min,
C,(g).
~n
the example
of section
IV
with the population
ofsize
2,
&(&)
=
2
and
AW0rst(d2)
=
1-
As
long
as
Cm(g)
is not constant (which
will
usually
be
the
case since the
T
values will differ)
(a,&)
and
(a,-&m)
will differ, and the expected performance of
Lt
will
be
superior.
This
proves that the expected performance over
all
payoff functions of algorithm
(a,&)
is
greater than that of
E(j(g)
I
g,d,).
algorithm
(a,
AwOIst).
VI.
OTHER
FREE
LUNCHES
We have shown the existence of
free
lunches for self-play
by constructing a pair of algorithms with the same search
rule
course,
we
must
also
have
k&
k)
5
T(zJ
for
pop~lations
6.

Citations
More filters
Book ChapterDOI

Introduction to Algorithms

Xin-She Yang
TL;DR: This chapter provides an overview of the fundamentals of algorithms and their links to self-organization, exploration, and exploitation.
Book ChapterDOI

Evolution and the Theory of Games

TL;DR: In the Hamadryas baboon, males are substantially larger than females, and a troop of baboons is subdivided into a number of ‘one-male groups’, consisting of one adult male and one or more females with their young.
Book

Nature-Inspired Optimization Algorithms

Xin-She Yang
TL;DR: This book can serve as an introductory book for graduates, doctoral students and lecturers in computer science, engineering and natural sciences, and researchers and engineers as well as experienced experts will also find it a handy reference.
Journal Article

Handbook of game theory with economic applications

TL;DR: Aumann, Aumann, S. van Damme and Hart as mentioned in this paper proposed a game theoretic analysis of the shapley value in the context of games with many players.
Journal ArticleDOI

A systematic comparison of supervised classifiers.

TL;DR: The default configuration of parameters in Weka was found to provide near optimal performance for most cases, not including methods such as the support vector machine (SVM), and the k-nearest neighbor method frequently allowed the best accuracy.
References
More filters
Book

Adaptation in natural and artificial systems

TL;DR: Names of founding work in the area of Adaptation and modiication, which aims to mimic biological optimization, and some (Non-GA) branches of AI.
Journal ArticleDOI

No free lunch theorems for optimization

TL;DR: A framework is developed to explore the connection between effective optimization algorithms and the problems they are solving and a number of "no free lunch" (NFL) theorems are presented which establish that for any algorithm, any elevated performance over one class of problems is offset by performance over another class.
Book

Evolution and the Theory of Games

TL;DR: A modification of the theory of games, a branch of mathematics first formulated by Von Neumann and Morgenstern in 1944 for the analysis of human conflicts, was proposed in this paper.
Book ChapterDOI

Evolution and the Theory of Games

TL;DR: In the Hamadryas baboon, males are substantially larger than females, and a troop of baboons is subdivided into a number of ‘one-male groups’, consisting of one adult male and one or more females with their young.
Related Papers (5)