scispace - formally typeset
Open AccessBook ChapterDOI

Adding expert knowledge and exploration in monte-carlo tree search

TLDR
A new exploration term, more efficient than classical UCT-like exploration terms, which combines efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values, and classical online values is presented.
Abstract
We present a new exploration term, more efficient than classical UCT-like exploration terms. It combines efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values, and classical online values. As this improved bandit formula does not solve several important situations (semeais, nakade) in computer Go, we present three other important improvements which are central in the recent progress of our program MoGo. We show an expert-based improvement of Monte-Carlo simulations for nakade situations; we also emphasize some limitations of this modification. We show a technique which preserves diversity in the Monte-Carlo simulation, which greatly improves the results in 19x19. Whereas the UCB-based exploration term is not efficient in MoGo, we show a new exploration term which is highly efficient in MoGo. MoGo recently won a game with handicap 7 against a 9Dan Pro player, Zhou JunXun, winner of the LG Cup 2007, and a game with handicap 6 against a 1Dan pro player, Li-Chen Chien.

read more

Content maybe subject to copyright    Report

HAL Id: inria-00386477
https://hal.inria.fr/inria-00386477
Submitted on 21 May 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Adding expert knowledge and exploration in
Monte-Carlo Tree Search
Guillaume Chaslot, Christophe Fiter, Jean-Baptiste Hoock, Arpad Rimmel,
Olivier Teytaud
To cite this version:
Guillaume Chaslot, Christophe Fiter, Jean-Baptiste Hoock, Arpad Rimmel, Olivier Teytaud. Adding
expert knowledge and exploration in Monte-Carlo Tree Search. Advances in Computer Games, 2009,
Pamplona, Spain. �inria-00386477�

Adding expert knowledge and exploration in
Monte-Carlo Tree Search
Guillaume Chaslot
1
, Christophe Fiter
2
, Jean-Baptiste Hoock
2
, Arpad
Rimmel
2
, Olivier Teytaud
2
1
Games and AI Group, MICC, Faculty of Humanities and Sciences, Universiteit
Maastricht, Maastricht, The Netherlands
2
TAO (Inria), LRI, UMR 8623 (CNRS - Univ. Paris-Sud),
bat 490 Univ. Paris-Sud 91405 Orsay, France, teytaud@lri.fr
Abstract. We present a new exploration term, more efficient than clas-
sical UCT-like exploration terms and combining efficiently expert rules,
patterns extracted from datasets, All-Moves-As-First values and classi-
cal online values. As this improved bandit formula does not solve several
important situations (semeais, nakade) in computer Go, we present three
other important improvements which are central in the recent progress
of our program MoGo:
We show an expert-based improvement of Monte-Carlo simulations
for nakade situations; we also emphasize some limitations of this
mo dification.
We show a technique which preserves diversity in the Monte-Carlo
simulation, which greatly improves the results in 19x19.
Whereas the UCB-based exploration term is not efficient in MoGo,
we show a new exploration term which is highly efficient in MoGo.
MoGo recently won a game with handicap 7 against a 9Dan Pro player,
Zhou JunXun, winner of the LG Cup 2007, and a game with handicap
6 against a 1Dan pro player, Li-Chen Chien.
1
1 Introduction
Monte-Carlo Tree Search (MCTS [5, 7, 11]) is a recent tool for difficult planning
tasks. Impressive results have already been produced in the case of the game of
Go [7, 10].
MCTS consists in building a tree, in which nodes are situations of the con-
sidered environnement and branches are the actions that can be taken by the
agent. The main point in MCTS is that the tree is highly unbalanced, with a
strong bias in favor of important parts of the tree. The focus is on the parts of
the tree in which the expected gain is the highest. For estimating which situa-
tion should be further analyzed, several algorithms have been proposed: UCT[11]
(Upper Confidence Trees), focuses on the proportion of winning simulation plus
an uncertainty measure; AMAF [4, 1, 10] (All Moves As First, also termed RAVE
1
A preliminary version of this work was presented at the EWRL workshop, without
proceedings.

for Rapid Action-Value Estimates in the MCTS context), focuses on a compro-
mise between UCT and heuristic information extracted from the simulations;
BAST[6] (Bandit Algorithm for Search in Tree), uses UCB-like bounds modi-
fied through the overall number of nodes in the tree. Other related algorithms
have b ee n proposed as in [5], essentially using a decreasing impact of a heuristic
(pattern-dependent) bias as the number of simulations increases. In all these
cases, the idea is to bias random simulations thanks to statistics.
In the context of the game of Go
2
, the nodes are equipp ed with one Go board
configuration, and with statistics, typically the numb er of won and lost games in
the simulations started from this node (the RAVE statistics requires some more
statistics). MCTS uses these statistics in order to iteratively expand the tree
in the regions where the expected reward is maximum. After each simulation
from the current position (the root) until the end of the game, the win and loss
statistics are updated in every node concerned by the simulation, and a new
node corresponding to the first new situation of the simulation is created. The
algorithm is therefore as follows:
Initialize the tree T to only one node, equipped with the current situation.
while There is time left do
Simulate one game g from the root of the tree to a final position, choosing
moves as follows:
Bandit part: for a situation in T , choose the move with maximal score.
MC part: For a situation out of T , choose the move thanks to Alg. 1.
Update win/loss statistics in all situations of T crossed by g.
Add in T the first situation of g which is not yet in T .
end while
Return the move simulated most often from the root of T .
The reader is referred to [5, 7, 11, 10, 9] for a detailed presentation of MCTS
techniques and various scores. We will propose our current bandit formula in
section 2.
The function used for taking decisions out of the tree (i.e. the so-called Monte-
Carlo part, MC) is defined in Algorithm 1. An atari occurs when a string (a group
of stones) can be captured in one move. Some Go knowledge has been added in
this part in the form of 3 × 3 expert designed patterns in order to play more
meaningfull games.
Unfortunately, some bottlenecks appear in MCTS. In spite of many improve-
ments in the bandit formula, there are still situations which are poorly handled
by MCTS. MCTS uses a bandit formula for moves early in the tree, but can’t
figure out long term effects which involve the behavior of the simulations far
from the root. The situations which are to be clarified at the very end should
therefore be included in the Monte-Carlo part and not in the bandit.
We therefore propose three improvements in the MC part:
Diversity preservation as explained in section 3.1;
2
Definitions of the different Go terms used in this article can be found on the web
site http://senseis.xmp.net/.

Algorithm 1 Algorithm for choosing a move in MC simulations, for the game
of Go.
if the last move is an atari, then
Save the stones which are in atari if possible (this is checked by liberty count).
else
if there is an empty location among the 8 locations around the last move which
matches a pattern then
Sequential move: play randomly uniformly in one of these locations.
else
if there is a move which captures stones then
Capture move: capture stones.
else
if there is a legal move then
Legal move:Play randomly a legal move
else
Return pass.
end if
end if
end if
end if
Nakade refinements as explained in section 3.2;
Elements around the Semeai, as explained in section 3.3.
2 Combining offline, transient, online learnings and
expert knowledge, with an exploration term
In this section we present how we combine online learning (bandit module),
transient learning (RAVE values), expert knowledge (detailed below) and offline
pattern-information. RAVE values are presented in [10]. We point out that this
combination is far from being straightforward: due to the subtle equilibrium
between online learning (i.e. naive success rates of moves) transient learning
(RAVE values) and offline values, the first experiments were highly negative,
and become clearly conclusive only after careful tuning of parameters
3
.
The score for a decision d (i.e. a legal move) is as follows:
score(d) = α ˆp(d)
|
{z}
Online
+β
b
bp(move)
|
{z }
T ransient
+(γ +
C
log(2 + n(d))
) H(d)
|
{z}
Of fline
(1)
where the coefficients α, β, γ and C are empirically tuned coefficients depending
on n(d) (number of simulations of the decision d) and n (number of simulations
of the current board) as follows:
3
We used b oth manual tuning and cross-entropy methods. Parallelization was highly
helpful for this.

β = #{rave sims}/(#{rave sims} + #{sims} + c
1
#{sims}#{rave sims})(2)
γ = c
2
/#{rave sims}(3)
α = 1 β γ(4)
where #{rave sims} is the number of Rave-simulations, #{sims} is the number
of simulations, C, c
1
and c
2
are empirically tuned. For the sake of completeness,
we precise that C, c
1
and c
2
depend on the board size, and are not the same in
the ro ot of the tree during the beginning of the thinking time, in the root of the
tree during the end of the thinking time, and in other nodes. Also, this formula
is computed most often with an approximated (faster) formula, and sometimes
with the complete formula - it was empirically found that the constants should
not be the same in both cases. All these local engineering improvements make
the formula quite unclear and the take-home message is mainly that MoGo has
good results with α + β + γ = 1, γ c
2
/#{rave sims} and with the logarithmic
formula C/ log(2 + n(d) for progressive unpruning. These rules imply that:
initially, the most important part is the offline learning;
later, the most important part is the transient learning (RAVE values);
eventually, only the “real” statistics matter.
H(d) is the sum of two terms: patterns, as in [3, 5, 8], and rules detailed
below:
capture moves (in particular, string contiguous to a new string in atari),
extension (in particular out of a ladder), avoid self-atari, atari (in particular
when there is a ko), distance to border (optimum distance = 3 in 19x19
Go), short distance to previous moves, short distance to the move before the
previous move; also, locations which have probability nearly 1/3 of being of
one’s color at the end of the game are preferred.
The following rules are used in our implementation in 19x19, and improve
the results:
Territory line (i.e. line number 3), Line of death (i.e. first line), Peep-connect
(ie. connect two strings when the opponent threatens to cut), Hane (a move
which “reaches around” one or more of the opponent’s stones), Threat, Con-
nect, Wall, Bad Kogeima(same pattern as a knight’s move in chess), Empty
triangle (three stones making a triangle without any surrounding opponent’s
stone).
They are used both (i) as an initial number of RAVE simulations (ii) as an
additive term in H. The additive term (ii) is proportional to the number of
AMAF-simulations.
These shapes are illustrated on Figure 1. With a naive hand tuning of pa-
rameters, only for the simulations added in the AMAF statistics, they provide
63.9±0.5 % of winning rate against the version without these improvements. We

Figures
Citations
More filters
Journal ArticleDOI

A Survey of Monte Carlo Tree Search Methods

TL;DR: A survey of the literature to date of Monte Carlo tree search, intended to provide a snapshot of the state of the art after the first five years of MCTS research, outlines the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarizes the results from the key game and nongame domains.
Journal ArticleDOI

Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search

TL;DR: An overview of the development and current state of the FUEGO project is given, which describes the reusable components of the software framework and specific algorithms used in the Go engine.
Journal ArticleDOI

Multi-armed bandits with episode context

TL;DR: In this paper, a multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0, 1] associated with that arm.
Journal ArticleDOI

Ensemble Determinization in Monte Carlo Tree Search for the Imperfect Information Card Game Magic: The Gathering

TL;DR: This paper examines the use of Monte Carlo tree search (MCTS) for a variant of one of the most popular and profitable games in the world: the card game Magic: The Gathering, and examines the effect of utilizing various pruning strategies in order to increase the information gained from each determinization.
Journal ArticleDOI

Current Frontiers in Computer Go

TL;DR: It is seen that in 9 × 9, computers are very close to the best human level, and can be improved easily for the opening book; whereas in 19 × 19, handicap 7 is not enough for the computers to win against top level professional players, due to some clearly understood weaknesses of the current algorithms.
References
More filters
Book ChapterDOI

Bandit based monte-carlo planning

TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.
Book ChapterDOI

Efficient selectivity and backup operators in Monte-Carlo tree search

TL;DR: A new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte- carlo phase is presented, that provides finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity.
Journal ArticleDOI

Progressive Strategies for Monte-Carlo Tree Search

TL;DR: Two progressive strategies for MCTS are introduced, called progressive bias and progressive unpruning, which enable the use of relatively time-expensive heuristic knowledge without speed reduction.
Proceedings ArticleDOI

Combining online and offline knowledge in UCT

TL;DR: This work considers three approaches for combining offline and online value functions in the UCT algorithm, and combines these algorithms in MoGo, the world's strongest 9 x 9 Go program, where each technique significantly improves MoGo's playing strength.
Journal ArticleDOI

Computing “elo ratings” of move patterns in the game of go

Rémi Coulom
TL;DR: A new Bayesian technique for supervised learning of move patterns from game records, based on a generalization of Elo ratings, which outperforms most previous pattern-learning algorithms, both in terms of mean log-evidence, and prediction rate.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Adding expert knowledge and exploration in monte-carlo tree search" ?

The authors present a new exploration term, more efficient than classical UCT-like exploration terms and combining efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values and classical online values. As this improved bandit formula does not solve several important situations ( semeais, nakade ) in computer Go, the authors present three other important improvements which are central in the recent progress of their program MoGo: – they show an expert-based improvement of Monte-Carlo simulations for nakade situations ; they also emphasize some limitations of this modification. The authors show a technique which preserves diversity in the Monte-Carlo simulation, which greatly improves the results in 19x19. Whereas the UCB-based exploration term is not efficient in MoGo, the authors show a new exploration term which is highly efficient in MoGo. 

Reducing the probability of simulations in which a group which should clearly live dies (or vice versa) improves the overall performance of the algorithm. 

MCTS consists in building a tree, in which nodes are situations of the considered environnement and branches are the actions that can be taken by the agent. 

The idea is to increase the number of locations at which Monte-Carlo simulations can find pattern-matching in order to diversify the Monte-Carlo simulations. 

After each simulation from the current position (the root) until the end of the game, the win and loss statistics are updated in every node concerned by the simulation, and a new node corresponding to the first new situation of the simulation is created. 

The authors present below two new improvements, both of them centered on an increased diversity when the computational power increases; in both cases, the improvement is negative or negligible for small computational power and becomes highly significant when the computational power increases. 

Many thanks to the french federation of Go and to Recitsproque for having organized an official game against a high level human; many thanks also to the several players from the French and Taiwanese Federations of Go who accepted to play and discuss test games against MoGo. 

The following rules are used in their implementation in 19x19, and improve the results:– Territory line (i.e. line number 3), Line of death (i.e. first line), Peep-connect (ie. connect two strings when the opponent threatens to cut), Hane (a move which “reaches around” one or more of the opponent’s stones), Threat, Connect, Wall, Bad Kogeima(same pattern as a knight’s move in chess), Empty triangle (three stones making a triangle without any surrounding opponent’s stone). 

Other related algorithms have been proposed as in [5], essentially using a decreasing impact of a heuristic (pattern-dependent) bias as the number of simulations increases. 

The following parameters had to be modified, when this model was included in H:– time scales for the convergence of the weight of online statistics to 1 (see Eq. 1) are increased;– the number of simulations of a move at a given node before the subsequent nodes is created is increased (because the computational cost of a creation is higher).– the optimal coefficients of expert rules are modified;– importantly, the results were greatly improved by adding the constant C (see Eq. 1). 

The authors will use the term nakade to design a situation in which a surrounded group has a single large internal, enclosed space in which the player won’t be able to establish two eyes if the opponent plays correctly. 

plenty of experiments around increasing the level of the Monte-Carlo simulator as a stand-alone player have given negative results - diversity and playing strength are too conflicting objectives.