scispace - formally typeset
Search or ask a question
Book ChapterDOI

Creating an upper-confidence-tree program for havannah

TL;DR: This paper test the generality of Monte-Carlo Tree Search and Upper Confidence Bounds by experimenting on the game, Havannah, and shows that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah.
Abstract: Monte-Carlo Tree Search and Upper Confidence Bounds provided huge improvements in computer-Go. In this paper, we test the generality of the approach by experimenting on the game, Havannah, which is known for being especially difficult for computers. We show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go.

Summary (2 min read)

1 Introduction

  • – Each player places one stone on one empty cell, also known as The rules are simple.
  • These figures are presented on Fig. 1 for the sake of clarify.
  • In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match.
  • The following features make Havannah very difficult for computers, perhaps yet more difficult than the game of Go: – few local patterns are known for Havannah; – no natural evaluation function; – no pruning rule for reducing the number of reasonable moves; – large action space (271 for the first move with board size 10).

2 UCT

  • Trees are the most straightforward choice when implementing a Monte-Carlo Tree Search.
  • As long as there is time before playing, the algorithm performs random simulations from a UCT tree leaf.
  • Usually, UCT provides better and better results when the number of simulations per move is increased.

3 Guiding exploration

  • UCT-like algorithms are quite strong for balancing exploration and exploitation.
  • On the other hand, they provide no information for unexplored moves, and on how to choose among these moves; and little information for loosely explored moves.
  • Various tricks have been proposed around this: First Play Urgency, Progressive widening, Rapid Action Value Estimates (RAVE).

3.1 Progressive widening/unpruning

  • Then, at the mth simulation of a node, all moves with index larger than f(m) have −∞ score (i.e. are discarded), with f(m) some non-decreasing mapping from N to N.
  • Consider a node of a tree which is explored 50 times only (this certainly happens for many nodes deep in the tree).
  • Meanwhile, progressive widening will sample a few moves only, e.g. 4 moves, and sample much more the best of these 4 moves - this is likely to be better than taking the average of all moves as an evaluation function.
  • These experiments were performed with the exploration formula given in Eq.

3.2 Rapid Action Value Estimate

  • In the case of Go [4, 12] propose to average the score with a permutation-based statistical estimate.
  • In the game of Go, RAVE values are a great improvement.
  • They involve complicated implementations due to captures and re-captures.
  • In the case of Havannah there’s no such problem, and we’ll see that the results are good.
  • The first line corresponds to the configuration empirically chosen for 1000 simulations per move; its results are disappointing, almost equivalent to UCT, for these experiments with 30 000 simulations per move.

4 Games against Havannah-Applet

  • The authors tested their program against Havannah-Applet http://dfa.imn. htwk-leipzig.de/havannah/, recommended by the MindSports association as the only publicly available program that plays by the Havannah rules.
  • In both cases, their program, based on RAVE, no exploration term, no progressive widening, was black (white starts and has therefore the advantage in Havannah).
  • The first game (played with 8 seconds per move for their program, against 30s for the opponent) is presented in Fig.
  • It then increases regularly until the end.
  • Consistently again, the estimated success rate is lower than 50 % at the beginning (as the opponent, playing first, has the advantage initially).

5 Discussion

  • The authors could clearly validate in the case of Havannah the efficiency of some well known techniques coming from computer-Go, showing the generality of the MCTS approach.
  • The success rate is higher than in the case of the game of Go (around 75%).
  • – Progressive widening, in spite of the fact that it was shown in [19] that it works even without heuristic, was not significant for us.
  • The authors program could defeat Havannah-Applet easily, whereas it was playing as black, and with only 2s per move instead of 30s (running on a single core).
  • Running more experiments was difficult due to lack of automated interface.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: inria-00380539
https://hal.inria.fr/inria-00380539
Submitted on 2 May 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Creating an Upper-Condence-Tree program for
Havannah
Fabien Teytaud, Olivier Teytaud
To cite this version:
Fabien Teytaud, Olivier Teytaud. Creating an Upper-Condence-Tree program for Havannah. ACG
12, May 2009, Pamplona, Spain. �inria-00380539�

Creating an Upper-Confidence-Tree p rogram for
Havannah
F. Teytaud and O. Teytaud
TAO (Inria), LRI, UMR 8623(CNRS - Univ. Paris-Sud),
bat 490 Univ. Paris-Sud 91405 Orsay, France, fteytaud@lri.fr
Abstract. Monte-Carlo Tree Search and Upper Confidence Bounds pro-
vided huge improvements in computer-Go. In this paper, we test the gen-
erality of the approach by experimenting on another game, Havannah,
which is known for being especially difficult for computers. We show that
the same results hold, with slight differences related to the absence of
clearly known patterns for the game of Havannah, in spite of the fact
that Havannah is more related to connection games like Hex than to
territory games like Go.
1 Introduction
This introduction is based on [21]. Havannah is a 2-players board game (black
vs white) invented by Christian Freeling [21, 18]. It is played on an hexagonal
board of hexagonal locations, with variable size (1 0 hexes per side usually for
strong players).
White starts, after which moves alternate. The rules are simple:
Each player places one stone on one empty cell. If there’s no empty cell and
if no player has won yet, the game is a draw (very rare c ases).
A player wins if he realizes:
a ring, i.e. a loop a round one or more cells (empty or not, occupied by
black or white stones);
a bridge, i.e. a co ntinuous string of stones from one of the six corner cells
to another of the six corner cells;
a fork, i.e. a connection be tween three edges of the board (corner points
are not edges).
These figures are presented on Fig. 1 fo r the sa ke of clarify.
Although computers can play some abstract stra tegy games better than any
human, the best Havannah-playing software plays weakly compared to human
exp erts. In 2002 Freeling offered a prize of 1000 euros, available through 2012,
for any computer program that could beat him in even one game of a ten-game
match.
Havannah is somewhat related to Go and Hex . It’s fully observable, involves
connections; also, rings are somewhat rela ted to the conce pt of eye or capture in
the game of Go. Go has been for a while the main target for AI in games, as it’s

Fig. 1. Three finished games: a ring (a loop, by black), a bridge (link ing two corners,
by white) and a fork (linking three edges, by black).
a very famous game, very important in many Asian countries; however, it is no
longer true that computers ar e only at the level of a novice. Since MCTS/UCT
(Monte-Carlo Tree Sear ch, Upper Confidence Trees) approaches have been de-
fined [6, 9, 14], several improvements have appeared like First-Play Urgency [20],
Rave-values [4, 12], patterns and progressive widening [10, 7], better than UCB-
like (Upper Confidence Bounds) exploration terms [16], large-scale parallelization
[11, 8, 5, 13], automatic building of huge op e ning books [2]. Thanks to all these
improvements, MoGo has already won games against a professional player in 9x9
(Amsterdam, 2007; Paris, 2008; Taiwan 20 09), and recently won with handicap
6 against a professional player, Li-Chen Chien (Tainan, 2009), and with hand-
icap 7 against a top professional player, Zhou Junxun, winner of the LG-Cup
2007 (Tainan, 2009). The following features make Havannah very difficult for
computers, perhaps yet more difficult than the game of Go:
few local patterns are known for Havannah;
no natural evaluation function;
no pruning rule for reducing the number of reasonable moves;
large action space (271 for the first move with board size 10).
The advantage of Havannah, as well as in Go (except in pathological cases for the
game of Go), is that simulations have a bounded length: the size of the board.
The goal of this paper is to investigate the generality of this approach by
testing it in Havannah.
By way of notation, x ±y means a result with average x, and 95% confidence
interval [x y, x + y] (i.e. y is two standard deviations).
2 UCT
Upper Confidence Trees are the most straightforward choice when implementing
a Monte-Carlo Tree Search. The basic principle is as follows. As long as there
is time before playing, the algorithm performs random simulations from a UCT
tree leaf. These simulations (playouts) are complete p ossible games, from the

root of the tree until the game is over, played by a random player playing both
black and white. In its most simple version, the random player, which is used
when in a situation s which is not in the tree, just plays randomly and uniformly
among legal moves in s; we did not implement anything more sophisticated, as
we are more interested in algorithms than in heuristic tricks. When the random
player is in a situation s already in the UCT tree, then its choices depend on
the statistics: number of wins and number of losses in previous games, for each
legal move in s. The detailed formula used for choosing a move depending on
statistics is termed a bandit formula, discussed below. After ea ch simulation,
the first situation of the simulated game that was not yet in memory is archived
in memory, and all statistics of numbers of wins and numbers of defeats are
upda ted in each situation in memory which was traversed by the simulation.
Bandit formula
The most classical bandit formula is UCB (Upper Confidence Bounds [15, 3]);
Monte-Carlo Tree Search based on UCB is termed UCT. The idea is to compute
a exploration/exploitation score for ea ch legal move in a s ituation s for choos-
ing which of these moves must be simulated. Then, the move with maximal
exploration/exploitation score is chosen. The exploration/exploitation score of
a move d is typically the sum of the empirical success rate score(d) of the move
d, and of an exploration term which is strong for less explored moves. UCB uses
Hoeffding’s bound for defining the exploration term:
exploration
Hoeff ding
=
p
log(2 /δ)/n (1)
where n is the numb e r o f simulations for move d. δ is usually chosen linear as a
function o f the number m of simulations of the situation s: δ is linear in m. In
our implementation, with a unifor m random player, the formula was empirically
set to:
exploration
Hoeff ding
=
p
0.25 log(2 + m)/n. (2)
[1, 17] has shown the efficiency of using Bernstein’s bound instead of Hoeffd-
ing’s bound, in some settings. The exploration ter m is then:
exploration
Bernstein
=
p
score(d)(1 score(d))2 log(3)/n + 3 log(3)/n
(3)
This term is smaller for moves with small variance (score(d) small or large).
After the empirical tuning of Hoeffding’s equation 2, we tested Bernstein’s
formula as follows (with p = score(d) for short)
exploration
Bernstein
=
p
4Kp(1 p) log(2 + m)/n + 3
2K log (2 + m)/n. (4)
Within the “2+”, which is here fo r avoiding special cases for 0, this is Bernstein’s
formula for δ linear as a function of m.
We tested s e veral values of K. The first value 0.25 corresponds to Hoeffding’s
bound except that the second term is added:

K Score against Hoeffding’s formula 2
0.250 0.503 ± 0.011
0.100 0.578 ± 0.010
0.030 0.646 ± 0.005
0.010 0.652 ± 0.006
0.001 0.582 ± 0.012
0.000 0.439 ± 0.015
Experiments ar e performed with 1000 simulations per move, with size 5. We
tried to experiment values below 0.01 for K but with poor results.
Scaling of UCT
Usually, UCT provides better and better results when the number of simulations
per move is increased. Typically, the winning rate with 2z simulations against
UCT with z simulations is roughly 63% in the game of Go (see e.g. [11]). In the
case of Havannah, with uniform random player (i.e. the random player plays uni-
formly amo ng legal moves) and bandit formula as in Eq. 2, we get the following
results for the UCT tuned as above (K = 0.25):
Number of simulations of both player Success rate
250 vs 125 0.75 ± 0.02
500 vs 250 0.84 ± 0.02
1000 vs 500 0.72 ± 0.03
2000 vs 1000 0.75 ± 0.02
4000 vs 2000 0.74 ± 0.05
These experiments ar e performed on a board of size 8.
3 Guiding exploration
UCT-like algorithms are quite strong for balancing exploration and exploitation.
On the other hand, they provide no information for unexplored moves, and on
how to choose among these moves; and little information for loosely explored
moves. Various tricks have b e e n proposed around this: Firs t Play Urgency, Pro-
gressive widening, Rapid Action Value Estimates (RAVE).
3.1 Progressive widening/unpruning
In progres sive widening [10, 7, 19], we first r ank the legal moves a t a situation s
according to some heuristic: the moves are then renamed 1, 2, . . . ,n, with i < j if
move i is prefer red to move j for the heuristic. Then, at the m
th
simulation of a
node, all moves with index larger than f (m) have −∞ score (i.e. are discarded),
with f(m) some non-decreasing mapping from N to N. It was shown in [19]
that this can work even with random ranking , with f (m) = Km
1/4
for some
constant K. In [10] it was shown that f(m) = Km
1/3.4
for some constant K

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of the literature to date of Monte Carlo tree search, intended to provide a snapshot of the state of the art after the first five years of MCTS research, outlines the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarizes the results from the key game and nongame domains.
Abstract: Monte Carlo tree search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarize the results from the key game and nongame domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.

2,682 citations


Cites background from "Creating an upper-confidence-tree p..."

  • ...Another approach called dynamic exploration, proposed by Bourki et al. [25], tunes parameters based on patterns in their Go program MOGO....

    [...]

  • ...MCCFR works by sampling blocks of terminal histories (paths through the game tree from root to leaf), and computing immediate counterfactual regrets over those blocks....

    [...]

Book
12 Dec 2012
TL;DR: In this article, the authors focus on regret analysis in the context of multi-armed bandit problems, where regret is defined as the balance between staying with the option that gave highest payoff in the past and exploring new options that might give higher payoffs in the future.
Abstract: A multi-armed bandit problem - or, simply, a bandit problem - is a sequential allocation problem defined by a set of actions. At each time step, a unit resource is allocated to an action and some observable payoff is obtained. The goal is to maximize the total payoff obtained in a sequence of allocations. The name bandit refers to the colloquial term for a slot machine (a "one-armed bandit" in American slang). In a casino, a sequential allocation problem is obtained when the player is facing many slot machines at once (a "multi-armed bandit"), and must repeatedly choose where to insert the next coin. Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this book, the focus is on two extreme cases in which the analysis of regret is particularly simple and elegant: independent and identically distributed payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, it also analyzes some of the most important variants and extensions, such as the contextual bandit model. This monograph is an ideal reference for students and researchers with an interest in bandit problems.

2,427 citations

Journal ArticleDOI
TL;DR: The Monte-Carlo revolution in computer Go is surveyed, the key ideas that led to the success of MoGo and subsequent Go programs are outlined, and for the first time a comprehensive description, in theory and in practice, of this extended framework for Monte- Carlo tree search is provided.

375 citations

Journal ArticleDOI
TL;DR: This paper describes the leading algorithms for Monte-Carlo tree search and explains how they have advanced the state of the art in computer Go.
Abstract: The ancient oriental game of Go has long been considered a grand challenge for artificial intelligence. For decades, computer Go has defied the classical methods in game tree search that worked so successfully for chess and checkers. However, recent play in computer Go has been transformed by a new paradigm for tree search based on Monte-Carlo methods. Programs based on Monte-Carlo tree search now play at human-master levels and are beginning to challenge top professional players. In this paper, we describe the leading algorithms for Monte-Carlo tree search and explain how they have advanced the state of the art in computer Go.

212 citations

Journal Article
TL;DR: In this paper, three parallelization methods for Monte-Carlo Tree Search (MCTS) are discussed: leaf parallelization, root parallelization and tree parallelization (tree parallelization requires two techniques: adequately handling of local mutexes and virtual loss).
Abstract: Monte-Carlo Tree Search (MCTS) is a new best-first search method that started a revolution in the field of Computer Go. Parallelizing MCTS is an important way to increase the strength of any Go program. In this article, we discuss three parallelization methods for MCTS: leaf parallelization, root parallelization,and tree parallelization. To be effective tree parallelization requires two techniques: adequately handling of (1) local mutexes and (2) virtual loss. Experiments in 1313 Go reveal that in the program Mango root parallelization may lead to the best results for a specific time setting and specific program parame- ters. However, as soon as the selection mechanism is able to handle more adequately the balance of exploitation and exploration, tree paralleliza- tion should have attention too and could become a second choice for parallelizing MCTS. Preliminary experiments on the smaller 99 board provide promising prospects for tree parallelization.

198 citations

References
More filters
Journal ArticleDOI
TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Abstract: Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

6,361 citations

Book ChapterDOI
18 Sep 2006
TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.
Abstract: For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.

2,695 citations

Book ChapterDOI
29 May 2006
TL;DR: A new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte- carlo phase is presented, that provides finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity.
Abstract: A Monte-Carlo evaluation consists in estimating a position by averaging the outcome of several random continuations. The method can serve as an evaluation function at the leaves of a min-max tree. This paper presents a new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte-Carlo phase. Instead of backing-up the min-max value close to the root, and the average value at some depth, a more general backup operator is defined that progressively changes from averaging to minmax as the number of simulations grows. This approach provides a finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity. The resulting algorithm was implemented in a 9 × 9 Go-playing program, Crazy Stone, that won the 10th KGS computer-Go tournament.

1,273 citations

Journal ArticleDOI
TL;DR: Two progressive strategies for MCTS are introduced, called progressive bias and progressive unpruning, which enable the use of relatively time-expensive heuristic knowledge without speed reduction.
Abstract: Monte-Carlo Tree Search (MCTS) is a new best-first search guided by the results of Monte-Carlo simulations. In this article, we introduce two progressive strategies for MCTS, called progressive bias and progressive unpruning. They enable the use of relatively time-expensive heuristic knowledge without speed reduction. Progressive bias directs the search according to heuristic knowledge. Progressive unpruning first reduces the branching factor, and then increases it gradually again. Experiments assess that the two progressive strategies significantly improve the level of our Go program Mango. Moreover, we see that the combination of both strategies performs even better on larger board sizes.

458 citations

Frequently Asked Questions (8)
Q1. What have the authors contributed in "Creating an upper-confidence-tree program for havannah" ?

In this paper, the authors test the generality of the approach by experimenting on another game, Havannah, which is known for being especially difficult for computers. The authors show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go. 

In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match. 

Bandit formulaThe most classical bandit formula is UCB (Upper Confidence Bounds [15, 3]); Monte-Carlo Tree Search based on UCB is termed UCT. 

K=0, size 5, 30 000 simulations/move 0.53 ± 0.02 R=50, K=0.25, size 5, 30 000 simulations/move 0.47 ± 0.04 R=50, K=0.05, size 5, 30 000 simulations/move 0.60 ± 0.02 R=50, K=0.02, size 5, 30 000 simulations/move 0.60 ± 0.03 R=5, K=0.02, size 5, 30 000 simulations/move 0.61 ± 0.06 R=20, K=0.02, size 5, 30 000 simulations/move 0.66 ± 0.03 

The authors experimented with 500 simulations per move, size 5, various constants P and Q for the progressive widening f(m) = Q⌊mP ⌋:Q, P Success rate against no prog. 

In their implementation, with a uniform random player, the formula was empirically set to:explorationHoeffding = √ 0.25 log(2 + m)/n. (2)[1, 17] has shown the efficiency of using Bernstein’s bound instead of Hoeffding’s bound, in some settings. 

in the following lines, using a weaker exploration and a small value of R, the authors get better results. [16] points out that, with big simulation times, K = 0 was better, but an exploration bonus depending on patterns was used instead. 

The main weaknesses of RAVE are:– tuning is required for each new number of simulations;– the results for large numbers of simulations are less impressive (yet significant).