Creating an upper-confidence-tree program for havannah
Summary (2 min read)
1 Introduction
- – Each player places one stone on one empty cell, also known as The rules are simple.
- These figures are presented on Fig. 1 for the sake of clarify.
- In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match.
- The following features make Havannah very difficult for computers, perhaps yet more difficult than the game of Go: – few local patterns are known for Havannah; – no natural evaluation function; – no pruning rule for reducing the number of reasonable moves; – large action space (271 for the first move with board size 10).
2 UCT
- Trees are the most straightforward choice when implementing a Monte-Carlo Tree Search.
- As long as there is time before playing, the algorithm performs random simulations from a UCT tree leaf.
- Usually, UCT provides better and better results when the number of simulations per move is increased.
3 Guiding exploration
- UCT-like algorithms are quite strong for balancing exploration and exploitation.
- On the other hand, they provide no information for unexplored moves, and on how to choose among these moves; and little information for loosely explored moves.
- Various tricks have been proposed around this: First Play Urgency, Progressive widening, Rapid Action Value Estimates (RAVE).
3.1 Progressive widening/unpruning
- Then, at the mth simulation of a node, all moves with index larger than f(m) have −∞ score (i.e. are discarded), with f(m) some non-decreasing mapping from N to N.
- Consider a node of a tree which is explored 50 times only (this certainly happens for many nodes deep in the tree).
- Meanwhile, progressive widening will sample a few moves only, e.g. 4 moves, and sample much more the best of these 4 moves - this is likely to be better than taking the average of all moves as an evaluation function.
- These experiments were performed with the exploration formula given in Eq.
3.2 Rapid Action Value Estimate
- In the case of Go [4, 12] propose to average the score with a permutation-based statistical estimate.
- In the game of Go, RAVE values are a great improvement.
- They involve complicated implementations due to captures and re-captures.
- In the case of Havannah there’s no such problem, and we’ll see that the results are good.
- The first line corresponds to the configuration empirically chosen for 1000 simulations per move; its results are disappointing, almost equivalent to UCT, for these experiments with 30 000 simulations per move.
4 Games against Havannah-Applet
- The authors tested their program against Havannah-Applet http://dfa.imn. htwk-leipzig.de/havannah/, recommended by the MindSports association as the only publicly available program that plays by the Havannah rules.
- In both cases, their program, based on RAVE, no exploration term, no progressive widening, was black (white starts and has therefore the advantage in Havannah).
- The first game (played with 8 seconds per move for their program, against 30s for the opponent) is presented in Fig.
- It then increases regularly until the end.
- Consistently again, the estimated success rate is lower than 50 % at the beginning (as the opponent, playing first, has the advantage initially).
5 Discussion
- The authors could clearly validate in the case of Havannah the efficiency of some well known techniques coming from computer-Go, showing the generality of the MCTS approach.
- The success rate is higher than in the case of the game of Go (around 75%).
- – Progressive widening, in spite of the fact that it was shown in [19] that it works even without heuristic, was not significant for us.
- The authors program could defeat Havannah-Applet easily, whereas it was playing as black, and with only 2s per move instead of 30s (running on a single core).
- Running more experiments was difficult due to lack of automated interface.
Did you find this useful? Give us your feedback
Citations
2,682 citations
Cites background from "Creating an upper-confidence-tree p..."
...Another approach called dynamic exploration, proposed by Bourki et al. [25], tunes parameters based on patterns in their Go program MOGO....
[...]
...MCCFR works by sampling blocks of terminal histories (paths through the game tree from root to leaf), and computing immediate counterfactual regrets over those blocks....
[...]
2,427 citations
375 citations
212 citations
198 citations
References
6,361 citations
2,695 citations
2,143 citations
1,273 citations
458 citations
Related Papers (5)
Frequently Asked Questions (8)
Q2. How many euros did Freeling offer in 2002?
In 2002 Freeling offered a prize of 1000 euros, available through 2012, for any computer program that could beat him in even one game of a ten-game match.
Q3. What is the common bandit formula?
Bandit formulaThe most classical bandit formula is UCB (Upper Confidence Bounds [15, 3]); Monte-Carlo Tree Search based on UCB is termed UCT.
Q4. how many simulations are a heuristic rave?
K=0, size 5, 30 000 simulations/move 0.53 ± 0.02 R=50, K=0.25, size 5, 30 000 simulations/move 0.47 ± 0.04 R=50, K=0.05, size 5, 30 000 simulations/move 0.60 ± 0.02 R=50, K=0.02, size 5, 30 000 simulations/move 0.60 ± 0.03 R=5, K=0.02, size 5, 30 000 simulations/move 0.61 ± 0.06 R=20, K=0.02, size 5, 30 000 simulations/move 0.66 ± 0.03
Q5. What is the heuristic for progressive widening?
The authors experimented with 500 simulations per move, size 5, various constants P and Q for the progressive widening f(m) = Q⌊mP ⌋:Q, P Success rate against no prog.
Q6. what is the optimum value of the 'exploration' term?
In their implementation, with a uniform random player, the formula was empirically set to:explorationHoeffding = √ 0.25 log(2 + m)/n. (2)[1, 17] has shown the efficiency of using Bernstein’s bound instead of Hoeffding’s bound, in some settings.
Q7. What is the difference between the two lines?
in the following lines, using a weaker exploration and a small value of R, the authors get better results. [16] points out that, with big simulation times, K = 0 was better, but an exploration bonus depending on patterns was used instead.
Q8. What are the main weaknesses of RAVE?
The main weaknesses of RAVE are:– tuning is required for each new number of simulations;– the results for large numbers of simulations are less impressive (yet significant).