Book Chapter•DOI•

Creating an upper-confidence-tree program for havannah

Q: What have the authors contributed in "Creating an upper-confidence-tree program for havannah" ?

In this paper, the authors test the generality of the approach by experimenting on another game, Havannah, which is known for being especially difficult for computers. The authors show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go.

Q: What is the common bandit formula?

Bandit formulaThe most classical bandit formula is UCB (Upper Confidence Bounds [15, 3]); Monte-Carlo Tree Search based on UCB is termed UCT.

Q: how many simulations are a heuristic rave?

K=0, size 5, 30 000 simulations/move 0.53 ± 0.02 R=50, K=0.25, size 5, 30 000 simulations/move 0.47 ± 0.04 R=50, K=0.05, size 5, 30 000 simulations/move 0.60 ± 0.02 R=50, K=0.02, size 5, 30 000 simulations/move 0.60 ± 0.03 R=5, K=0.02, size 5, 30 000 simulations/move 0.61 ± 0.06 R=20, K=0.02, size 5, 30 000 simulations/move 0.66 ± 0.03

Q: What is the heuristic for progressive widening?

The authors experimented with 500 simulations per move, size 5, various constants P and Q for the progressive widening f(m) = Q⌊mP ⌋:Q, P Success rate against no prog.

Q: what is the optimum value of the 'exploration' term?

In their implementation, with a uniform random player, the formula was empirically set to:explorationHoeffding = √ 0.25 log(2 + m)/n. (2)[1, 17] has shown the efficiency of using Bernstein’s bound instead of Hoeffding’s bound, in some settings.

Q: What is the difference between the two lines?

in the following lines, using a weaker exploration and a small value of R, the authors get better results. [16] points out that, with big simulation times, K = 0 was better, but an exploration bonus depending on patterns was used instead.

Q: What are the main weaknesses of RAVE?

The main weaknesses of RAVE are:– tuning is required for each new number of simulations;– the results for large numbers of simulations are less impressive (yet significant).

Fabien Teytaud¹, Olivier Teytaud¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

11 May 2009-Vol. 6048, pp 65-74

TL;DR: This paper test the generality of Monte-Carlo Tree Search and Upper Confidence Bounds by experimenting on the game, Havannah, and shows that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah.

read less

Abstract: Monte-Carlo Tree Search and Upper Confidence Bounds provided huge improvements in computer-Go. In this paper, we test the generality of the approach by experimenting on the game, Havannah, which is known for being especially difficult for computers. We show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 UCT] – [3 Guiding exploration] – [3.1 Progressive widening/unpruning] – [3.2 Rapid Action Value Estimate] – [4 Games against Havannah-Applet] and [5 Discussion]

1 Introduction

– Each player places one stone on one empty cell, also known as The rules are simple.
These figures are presented on Fig. 1 for the sake of clarify.
In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match.
The following features make Havannah very difficult for computers, perhaps yet more difficult than the game of Go: – few local patterns are known for Havannah; – no natural evaluation function; – no pruning rule for reducing the number of reasonable moves; – large action space (271 for the first move with board size 10).

2 UCT

Trees are the most straightforward choice when implementing a Monte-Carlo Tree Search.
As long as there is time before playing, the algorithm performs random simulations from a UCT tree leaf.
Usually, UCT provides better and better results when the number of simulations per move is increased.

3 Guiding exploration

UCT-like algorithms are quite strong for balancing exploration and exploitation.
On the other hand, they provide no information for unexplored moves, and on how to choose among these moves; and little information for loosely explored moves.
Various tricks have been proposed around this: First Play Urgency, Progressive widening, Rapid Action Value Estimates (RAVE).

3.1 Progressive widening/unpruning

Then, at the mth simulation of a node, all moves with index larger than f(m) have −∞ score (i.e. are discarded), with f(m) some non-decreasing mapping from N to N.
Consider a node of a tree which is explored 50 times only (this certainly happens for many nodes deep in the tree).
Meanwhile, progressive widening will sample a few moves only, e.g. 4 moves, and sample much more the best of these 4 moves - this is likely to be better than taking the average of all moves as an evaluation function.
These experiments were performed with the exploration formula given in Eq.

3.2 Rapid Action Value Estimate

In the case of Go [4, 12] propose to average the score with a permutation-based statistical estimate.
In the game of Go, RAVE values are a great improvement.
They involve complicated implementations due to captures and re-captures.
In the case of Havannah there’s no such problem, and we’ll see that the results are good.
The first line corresponds to the configuration empirically chosen for 1000 simulations per move; its results are disappointing, almost equivalent to UCT, for these experiments with 30 000 simulations per move.

4 Games against Havannah-Applet

The authors tested their program against Havannah-Applet http://dfa.imn. htwk-leipzig.de/havannah/, recommended by the MindSports association as the only publicly available program that plays by the Havannah rules.
In both cases, their program, based on RAVE, no exploration term, no progressive widening, was black (white starts and has therefore the advantage in Havannah).
The first game (played with 8 seconds per move for their program, against 30s for the opponent) is presented in Fig.
It then increases regularly until the end.
Consistently again, the estimated success rate is lower than 50 % at the beginning (as the opponent, playing first, has the advantage initially).

5 Discussion

The authors could clearly validate in the case of Havannah the efficiency of some well known techniques coming from computer-Go, showing the generality of the MCTS approach.
The success rate is higher than in the case of the game of Go (around 75%).
– Progressive widening, in spite of the fact that it was shown in [19] that it works even without heuristic, was not significant for us.
The authors program could defeat Havannah-Applet easily, whereas it was playing as black, and with only 2s per move instead of 30s (running on a single core).
Running more experiments was difficult due to lack of automated interface.

Did you find this useful? Give us your feedback

Figures (3)

Fig. 3. Left: the result of the game played against Havannah-Applet in size 5. Our program was playing black and won by resignation. White has to play W1, otherwise black realises a bridge. Then, Black plays B1 and is connected to a second side. Next, White must play W2 (if Black plays W2 then Black has a fork). Finally, Black can play B2 and with with move, is connected to the third side (by B3a or B3b (White can’t avoid this connection). Right: estimated success rate for each of the opponents.

Fig. 1. Three finished games: a ring (a loop, by black), a bridge (linking two corners, by white) and a fork (linking three edges, by black).

Fig. 2. Left: the result of the game played against Havannah-Applet in size 5. Our program won very quickly, by a nice multiple constraint: the white opponent must play W1 (unless black wins by bridge). Then, black can connect to the lower-right edge with B1. Then, black has two opportunities for connecting to the right edge, B2a and B2b; white can only block one and black then realizes a fork. As seen on the estimated success rate, Havannah-Applet did not suspect this attack before the very last move. Right: estimated success rate for each of the opponents.

Content maybe subject to copyright Report

HAL Id: inria-00380539

https://hal.inria.fr/inria-00380539

Submitted on 2 May 2009

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Creating an Upper-Condence-Tree program for

Havannah

Fabien Teytaud, Olivier Teytaud

To cite this version:

Fabien Teytaud, Olivier Teytaud. Creating an Upper-Condence-Tree program for Havannah. ACG

12, May 2009, Pamplona, Spain. �inria-00380539�

Creating an Upper-Conﬁdence-Tree p rogram for

Havannah

F. Teytaud and O. Teytaud

TAO (Inria), LRI, UMR 8623(CNRS - Univ. Paris-Sud),

bat 490 Univ. Paris-Sud 91405 Orsay, France, fteytaud@lri.fr

Abstract. Monte-Carlo Tree Search and Upper Conﬁdence Bounds pro-

vided huge improvements in computer-Go. In this paper, we test the gen-

erality of the approach by experimenting on another game, Havannah,

which is known for being especially diﬃcult for computers. We show that

the same results hold, with slight diﬀerences related to the absence of

clearly known patterns for the game of Havannah, in spite of the fact

that Havannah is more related to connection games like Hex than to

territory games like Go.

1 Introduction

This introduction is based on [21]. Havannah is a 2-players board game (black

vs white) invented by Christian Freeling [21, 18]. It is played on an hexagonal

board of hexagonal locations, with variable size (1 0 hexes per side usually for

strong players).

White starts, after which moves alternate. The rules are simple:

– Each player places one stone on one empty cell. If there’s no empty cell and

if no player has won yet, the game is a draw (very rare c ases).

– A player wins if he realizes:

• a ring, i.e. a loop a round one or more cells (empty or not, occupied by

black or white stones);

• a bridge, i.e. a co ntinuous string of stones from one of the six corner cells

to another of the six corner cells;

• a fork, i.e. a connection be tween three edges of the board (corner points

are not edges).

These ﬁgures are presented on Fig. 1 fo r the sa ke of clarify.

Although computers can play some abstract stra tegy games better than any

human, the best Havannah-playing software plays weakly compared to human

exp erts. In 2002 Freeling oﬀered a prize of 1000 euros, available through 2012,

for any computer program that could beat him in even one game of a ten-game

match.

Havannah is somewhat related to Go and Hex . It’s fully observable, involves

connections; also, rings are somewhat rela ted to the conce pt of eye or capture in

the game of Go. Go has been for a while the main target for AI in games, as it’s

Fig. 1. Three ﬁnished games: a ring (a loop, by black), a bridge (link ing two corners,

by white) and a fork (linking three edges, by black).

a very famous game, very important in many Asian countries; however, it is no

longer true that computers ar e only at the level of a novice. Since MCTS/UCT

(Monte-Carlo Tree Sear ch, Upper Conﬁdence Trees) approaches have been de-

ﬁned [6, 9, 14], several improvements have appeared like First-Play Urgency [20],

Rave-values [4, 12], patterns and progressive widening [10, 7], better than UCB-

like (Upper Conﬁdence Bounds) exploration terms [16], large-scale parallelization

[11, 8, 5, 13], automatic building of huge op e ning books [2]. Thanks to all these

improvements, MoGo has already won games against a professional player in 9x9

(Amsterdam, 2007; Paris, 2008; Taiwan 20 09), and recently won with handicap

6 against a professional player, Li-Chen Chien (Tainan, 2009), and with hand-

icap 7 against a top professional player, Zhou Junxun, winner of the LG-Cup

2007 (Tainan, 2009). The following features make Havannah very diﬃcult for

computers, perhaps yet more diﬃcult than the game of Go:

– few local patterns are known for Havannah;

– no natural evaluation function;

– no pruning rule for reducing the number of reasonable moves;

– large action space (271 for the ﬁrst move with board size 10).

The advantage of Havannah, as well as in Go (except in pathological cases for the

game of Go), is that simulations have a bounded length: the size of the board.

The goal of this paper is to investigate the generality of this approach by

testing it in Havannah.

By way of notation, x ±y means a result with average x, and 95% conﬁdence

interval [x − y, x + y] (i.e. y is two standard deviations).

2 UCT

Upper Conﬁdence Trees are the most straightforward choice when implementing

a Monte-Carlo Tree Search. The basic principle is as follows. As long as there

is time before playing, the algorithm performs random simulations from a UCT

tree leaf. These simulations (playouts) are complete p ossible games, from the

root of the tree until the game is over, played by a random player playing both

black and white. In its most simple version, the random player, which is used

when in a situation s which is not in the tree, just plays randomly and uniformly

among legal moves in s; we did not implement anything more sophisticated, as

we are more interested in algorithms than in heuristic tricks. When the random

player is in a situation s already in the UCT tree, then its choices depend on

the statistics: number of wins and number of losses in previous games, for each

legal move in s. The detailed formula used for choosing a move depending on

statistics is termed a bandit formula, discussed below. After ea ch simulation,

the ﬁrst situation of the simulated game that was not yet in memory is archived

in memory, and all statistics of numbers of wins and numbers of defeats are

upda ted in each situation in memory which was traversed by the simulation.

Bandit formula

The most classical bandit formula is UCB (Upper Conﬁdence Bounds [15, 3]);

Monte-Carlo Tree Search based on UCB is termed UCT. The idea is to compute

a exploration/exploitation score for ea ch legal move in a s ituation s for choos-

ing which of these moves must be simulated. Then, the move with maximal

exploration/exploitation score is chosen. The exploration/exploitation score of

a move d is typically the sum of the empirical success rate score(d) of the move

d, and of an exploration term which is strong for less explored moves. UCB uses

Hoeﬀding’s bound for deﬁning the exploration term:

exploration

Hoeff ding

log(2 /δ)/n (1)

where n is the numb e r o f simulations for move d. δ is usually chosen linear as a

function o f the number m of simulations of the situation s: δ is linear in m. In

our implementation, with a unifor m random player, the formula was empirically

set to:

exploration

Hoeff ding

0.25 log(2 + m)/n. (2)

[1, 17] has shown the eﬃciency of using Bernstein’s bound instead of Hoeﬀd-

ing’s bound, in some settings. The exploration ter m is then:

exploration

Bernstein

score(d)(1 −score(d))2 log(3/δ)/n + 3 log(3/δ)/n

(3)

This term is smaller for moves with small variance (score(d) small or large).

After the empirical tuning of Hoeﬀding’s equation 2, we tested Bernstein’s

formula as follows (with p = score(d) for short)

exploration

Bernstein

4Kp(1 − p) log(2 + m)/n + 3

√

2K log (2 + m)/n. (4)

Within the “2+”, which is here fo r avoiding special cases for 0, this is Bernstein’s

formula for δ linear as a function of m.

We tested s e veral values of K. The ﬁrst value 0.25 corresponds to Hoeﬀding’s

bound except that the second term is added:

K Score against Hoeﬀding’s formula 2

0.250 0.503 ± 0.011

0.100 0.578 ± 0.010

0.030 0.646 ± 0.005

0.010 0.652 ± 0.006

0.001 0.582 ± 0.012

0.000 0.439 ± 0.015

Experiments ar e performed with 1000 simulations per move, with size 5. We

tried to experiment values below 0.01 for K but with poor results.

Scaling of UCT

Usually, UCT provides better and better results when the number of simulations

per move is increased. Typically, the winning rate with 2z simulations against

UCT with z simulations is roughly 63% in the game of Go (see e.g. [11]). In the

case of Havannah, with uniform random player (i.e. the random player plays uni-

formly amo ng legal moves) and bandit formula as in Eq. 2, we get the following

results for the UCT tuned as above (K = 0.25):

Number of simulations of both player Success rate

250 vs 125 0.75 ± 0.02

500 vs 250 0.84 ± 0.02

1000 vs 500 0.72 ± 0.03

2000 vs 1000 0.75 ± 0.02

4000 vs 2000 0.74 ± 0.05

These experiments ar e performed on a board of size 8.

3 Guiding exploration

UCT-like algorithms are quite strong for balancing exploration and exploitation.

On the other hand, they provide no information for unexplored moves, and on

how to choose among these moves; and little information for loosely explored

moves. Various tricks have b e e n proposed around this: Firs t Play Urgency, Pro-

gressive widening, Rapid Action Value Estimates (RAVE).

3.1 Progressive widening/unpruning

In progres sive widening [10, 7, 19], we ﬁrst r ank the legal moves a t a situation s

according to some heuristic: the moves are then renamed 1, 2, . . . ,n, with i < j if

move i is prefer red to move j for the heuristic. Then, at the m

simulation of a

node, all moves with index larger than f (m) have −∞ score (i.e. are discarded),

with f(m) some non-decreasing mapping from N to N. It was shown in [19]

that this can work even with random ranking , with f (m) = ⌊Km

1/4

⌋ for some

constant K. In [10] it was shown that f(m) = ⌊Km

1/3.4

⌋ for some constant K

HTML Viewer

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Creating an upper-confidence-tree program for havannah" ?

In this paper, the authors test the generality of the approach by experimenting on another game, Havannah, which is known for being especially difficult for computers. The authors show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go.

Q2. How many euros did Freeling offer in 2002?

In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match.

Q3. What is the common bandit formula?

Bandit formulaThe most classical bandit formula is UCB (Upper Confidence Bounds [15, 3]); Monte-Carlo Tree Search based on UCB is termed UCT.

Q4. how many simulations are a heuristic rave?

K=0, size 5, 30 000 simulations/move 0.53 ± 0.02 R=50, K=0.25, size 5, 30 000 simulations/move 0.47 ± 0.04 R=50, K=0.05, size 5, 30 000 simulations/move 0.60 ± 0.02 R=50, K=0.02, size 5, 30 000 simulations/move 0.60 ± 0.03 R=5, K=0.02, size 5, 30 000 simulations/move 0.61 ± 0.06 R=20, K=0.02, size 5, 30 000 simulations/move 0.66 ± 0.03

Q5. What is the heuristic for progressive widening?

The authors experimented with 500 simulations per move, size 5, various constants P and Q for the progressive widening f(m) = Q⌊mP ⌋:Q, P Success rate against no prog.

Q6. what is the optimum value of the 'exploration' term?

In their implementation, with a uniform random player, the formula was empirically set to:explorationHoeffding = √ 0.25 log(2 + m)/n. (2)[1, 17] has shown the efficiency of using Bernstein’s bound instead of Hoeffding’s bound, in some settings.

Q7. What is the difference between the two lines?

in the following lines, using a weaker exploration and a small value of R, the authors get better results. [16] points out that, with big simulation times, K = 0 was better, but an exploration bonus depending on patterns was used instead.

Q8. What are the main weaknesses of RAVE?

The main weaknesses of RAVE are:– tuning is required for each new number of simulations;– the results for large numbers of simulations are less impressive (yet significant).

Creating an upper-confidence-tree program for havannah

Summary (2 min read)

1 Introduction

2 UCT

3 Guiding exploration

3.1 Progressive widening/unpruning

3.2 Rapid Action Value Estimate

4 Games against Havannah-Applet

5 Discussion

Figures (3)

Citations

Cites background from "Creating an upper-confidence-tree p..."

References

Related Papers (5)

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Creating an upper-confidence-tree program for havannah" ?

Q2. How many euros did Freeling offer in 2002?

Q3. What is the common bandit formula?

Q4. how many simulations are a heuristic rave?

Q5. What is the heuristic for progressive widening?

Q6. what is the optimum value of the 'exploration' term?

Q7. What is the difference between the two lines?

Q8. What are the main weaknesses of RAVE?