Open AccessBook ChapterDOI

Adding expert knowledge and exploration in monte-carlo tree search

- Vol. 6048, pp 1-13

TLDR

A new exploration term, more efficient than classical UCT-like exploration terms, which combines efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values, and classical online values is presented.

Abstract:

We present a new exploration term, more efficient than classical UCT-like exploration terms. It combines efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values, and classical online values. As this improved bandit formula does not solve several important situations (semeais, nakade) in computer Go, we present three other important improvements which are central in the recent progress of our program MoGo. We show an expert-based improvement of Monte-Carlo simulations for nakade situations; we also emphasize some limitations of this modification. We show a technique which preserves diversity in the Monte-Carlo simulation, which greatly improves the results in 19x19. Whereas the UCB-based exploration term is not efficient in MoGo, we show a new exploration term which is highly efficient in MoGo. MoGo recently won a game with handicap 7 against a 9Dan Pro player, Zhou JunXun, winner of the LG Cup 2007, and a game with handicap 6 against a 1Dan pro player, Li-Chen Chien.

Content maybe subject to copyright Report

HAL Id: inria-00386477

https://hal.inria.fr/inria-00386477

Submitted on 21 May 2009

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Adding expert knowledge and exploration in

Monte-Carlo Tree Search

Guillaume Chaslot, Christophe Fiter, Jean-Baptiste Hoock, Arpad Rimmel,

Olivier Teytaud

To cite this version:

Guillaume Chaslot, Christophe Fiter, Jean-Baptiste Hoock, Arpad Rimmel, Olivier Teytaud. Adding

expert knowledge and exploration in Monte-Carlo Tree Search. Advances in Computer Games, 2009,

Pamplona, Spain. �inria-00386477�

Adding expert knowledge and exploration in

Monte-Carlo Tree Search

Guillaume Chaslot

, Christophe Fiter

, Jean-Baptiste Hoock

, Arpad

Rimmel

, Olivier Teytaud

Games and AI Group, MICC, Faculty of Humanities and Sciences, Universiteit

Maastricht, Maastricht, The Netherlands

TAO (Inria), LRI, UMR 8623 (CNRS - Univ. Paris-Sud),

bat 490 Univ. Paris-Sud 91405 Orsay, France, teytaud@lri.fr

Abstract. We present a new exploration term, more eﬃcient than clas-

sical UCT-like exploration terms and combining eﬃciently expert rules,

patterns extracted from datasets, All-Moves-As-First values and classi-

cal online values. As this improved bandit formula does not solve several

important situations (semeais, nakade) in computer Go, we present three

other important improvements which are central in the recent progress

of our program MoGo:

– We show an expert-based improvement of Monte-Carlo simulations

for nakade situations; we also emphasize some limitations of this

mo diﬁcation.

– We show a technique which preserves diversity in the Monte-Carlo

simulation, which greatly improves the results in 19x19.

– Whereas the UCB-based exploration term is not eﬃcient in MoGo,

we show a new exploration term which is highly eﬃcient in MoGo.

MoGo recently won a game with handicap 7 against a 9Dan Pro player,

Zhou JunXun, winner of the LG Cup 2007, and a game with handicap

6 against a 1Dan pro player, Li-Chen Chien.

1 Introduction

Monte-Carlo Tree Search (MCTS [5, 7, 11]) is a recent tool for diﬃcult planning

tasks. Impressive results have already been produced in the case of the game of

Go [7, 10].

MCTS consists in building a tree, in which nodes are situations of the con-

sidered environnement and branches are the actions that can be taken by the

agent. The main point in MCTS is that the tree is highly unbalanced, with a

strong bias in favor of important parts of the tree. The focus is on the parts of

the tree in which the expected gain is the highest. For estimating which situa-

tion should be further analyzed, several algorithms have been proposed: UCT[11]

(Upper Conﬁdence Trees), focuses on the proportion of winning simulation plus

an uncertainty measure; AMAF [4, 1, 10] (All Moves As First, also termed RAVE

A preliminary version of this work was presented at the EWRL workshop, without

proceedings.

for Rapid Action-Value Estimates in the MCTS context), focuses on a compro-

mise between UCT and heuristic information extracted from the simulations;

BAST[6] (Bandit Algorithm for Search in Tree), uses UCB-like bounds modi-

ﬁed through the overall number of nodes in the tree. Other related algorithms

have b ee n proposed as in [5], essentially using a decreasing impact of a heuristic

(pattern-dependent) bias as the number of simulations increases. In all these

cases, the idea is to bias random simulations thanks to statistics.

In the context of the game of Go

, the nodes are equipp ed with one Go board

conﬁguration, and with statistics, typically the numb er of won and lost games in

the simulations started from this node (the RAVE statistics requires some more

statistics). MCTS uses these statistics in order to iteratively expand the tree

in the regions where the expected reward is maximum. After each simulation

from the current position (the root) until the end of the game, the win and loss

statistics are updated in every node concerned by the simulation, and a new

node corresponding to the ﬁrst new situation of the simulation is created. The

algorithm is therefore as follows:

Initialize the tree T to only one node, equipped with the current situation.

while There is time left do

Simulate one game g from the root of the tree to a ﬁnal position, choosing

moves as follows:

Bandit part: for a situation in T , choose the move with maximal score.

MC part: For a situation out of T , choose the move thanks to Alg. 1.

Update win/loss statistics in all situations of T crossed by g.

Add in T the ﬁrst situation of g which is not yet in T .

end while

Return the move simulated most often from the root of T .

The reader is referred to [5, 7, 11, 10, 9] for a detailed presentation of MCTS

techniques and various scores. We will propose our current bandit formula in

section 2.

The function used for taking decisions out of the tree (i.e. the so-called Monte-

Carlo part, MC) is deﬁned in Algorithm 1. An atari occurs when a string (a group

of stones) can be captured in one move. Some Go knowledge has been added in

this part in the form of 3 × 3 expert designed patterns in order to play more

meaningfull games.

Unfortunately, some bottlenecks appear in MCTS. In spite of many improve-

ments in the bandit formula, there are still situations which are poorly handled

by MCTS. MCTS uses a bandit formula for moves early in the tree, but can’t

ﬁgure out long term eﬀects which involve the behavior of the simulations far

from the root. The situations which are to be clariﬁed at the very end should

therefore be included in the Monte-Carlo part and not in the bandit.

We therefore propose three improvements in the MC part:

– Diversity preservation as explained in section 3.1;

Deﬁnitions of the diﬀerent Go terms used in this article can be found on the web

site http://senseis.xmp.net/.

Algorithm 1 Algorithm for choosing a move in MC simulations, for the game

of Go.

if the last move is an atari, then

Save the stones which are in atari if possible (this is checked by liberty count).

else

if there is an empty location among the 8 locations around the last move which

matches a pattern then

Sequential move: play randomly uniformly in one of these locations.

else

if there is a move which captures stones then

Capture move: capture stones.

else

if there is a legal move then

Legal move:Play randomly a legal move

else

Return pass.

end if

– Nakade reﬁnements as explained in section 3.2;

– Elements around the Semeai, as explained in section 3.3.

2 Combining oﬄine, transient, online learnings and

expert knowledge, with an exploration term

In this section we present how we combine online learning (bandit module),

transient learning (RAVE values), expert knowledge (detailed below) and oﬄine

pattern-information. RAVE values are presented in [10]. We point out that this

combination is far from being straightforward: due to the subtle equilibrium

between online learning (i.e. naive success rates of moves) transient learning

(RAVE values) and oﬄine values, the ﬁrst experiments were highly negative,

and become clearly conclusive only after careful tuning of parameters

The score for a decision d (i.e. a legal move) is as follows:

score(d) = α ˆp(d)

{z}

Online

+β

bp(move)

{z }

T ransient

+(γ +

log(2 + n(d))

) H(d)

{z}

Of fline

(1)

where the coeﬃcients α, β, γ and C are empirically tuned coeﬃcients depending

on n(d) (number of simulations of the decision d) and n (number of simulations

of the current board) as follows:

We used b oth manual tuning and cross-entropy methods. Parallelization was highly

helpful for this.

β = #{rave sims}/(#{rave sims} + #{sims} + c

#{sims}#{rave sims})(2)

γ = c

/#{rave sims}(3)

α = 1 − β − γ(4)

where #{rave sims} is the number of Rave-simulations, #{sims} is the number

of simulations, C, c

and c

are empirically tuned. For the sake of completeness,

we precise that C, c

and c

depend on the board size, and are not the same in

the ro ot of the tree during the beginning of the thinking time, in the root of the

tree during the end of the thinking time, and in other nodes. Also, this formula

is computed most often with an approximated (faster) formula, and sometimes

with the complete formula - it was empirically found that the constants should

not be the same in both cases. All these local engineering improvements make

the formula quite unclear and the take-home message is mainly that MoGo has

good results with α + β + γ = 1, γ ≃ c

/#{rave sims} and with the logarithmic

formula C/ log(2 + n(d) for progressive unpruning. These rules imply that:

– initially, the most important part is the oﬄine learning;

– later, the most important part is the transient learning (RAVE values);

– eventually, only the “real” statistics matter.

H(d) is the sum of two terms: patterns, as in [3, 5, 8], and rules detailed

below:

– capture moves (in particular, string contiguous to a new string in atari),

extension (in particular out of a ladder), avoid self-atari, atari (in particular

when there is a ko), distance to border (optimum distance = 3 in 19x19

Go), short distance to previous moves, short distance to the move before the

previous move; also, locations which have probability nearly 1/3 of being of

one’s color at the end of the game are preferred.

The following rules are used in our implementation in 19x19, and improve

the results:

– Territory line (i.e. line number 3), Line of death (i.e. ﬁrst line), Peep-connect

(ie. connect two strings when the opponent threatens to cut), Hane (a move

which “reaches around” one or more of the opponent’s stones), Threat, Con-

nect, Wall, Bad Kogeima(same pattern as a knight’s move in chess), Empty

triangle (three stones making a triangle without any surrounding opponent’s

stone).

They are used both (i) as an initial number of RAVE simulations (ii) as an

additive term in H. The additive term (ii) is proportional to the number of

AMAF-simulations.

These shapes are illustrated on Figure 1. With a naive hand tuning of pa-

rameters, only for the simulations added in the AMAF statistics, they provide

63.9±0.5 % of winning rate against the version without these improvements. We

HTML Viewer

Figures

Fig. 1. We here present shapes for which exact matches are required for applying the bonus/malus. In all cases, the shapes are presented for the black player: the feature applies for a black move at one of the crosses. The reverse pattern of course applies for white. Threat is not an exact shape to be matched but just an example: in general, black has a bonus for simulating one of the liberties of an enemy string with exactly two liberties, i.e. to generate an atari.

Fig. 4. In Figure (a) (a real game played and lost by MoGo), MoGo (white) without specific modification for the nakade chooses H4; black plays J4 and the group F1 is dead (MoGo looses). The right move is J4; this move is chosen by MoGo after the modification presented in this section. Examples (b), (c) and (d) are other similar examples in which MoGo (as black) evaluates the situation poorly and doesn’t realize that his group is dead. The modification solves the problem. (e) An example of more complicated nakade , which is not solved by MoGo - we have no generic tool for solving the nakade .

Fig. 2. Effect of adding patterns extracted from professional games in MoGo. The first tuning of parameters is the tuning of α, β and γ as functions of n(d) (see Eq. 1) and of coefficients of expert rules. A second tuning consists in tuning constant C in Eq. 1.

Table 1. Experimental validation of the nakade modification: modified MoGo versus baseline MoGo. Seemingly, the higher the number of simulations (which is directly related to the level), the higher the impact.

Fig. 5. Left: Example of situation which is poorly estimated without approach moves. Black should play B before playing A for killing the white group and live. Right: situation which is not handled by the “approach moves” modification.

Fig. 3. Left: diversity loss when the “fillboard” option was not applied: the white stone is the last move, and the black player, starting a Monte-Carlo simulation, can only play at one of the locations marked by triangles. Right: results associated to the “fillboard” modification. As the modification leads to a computational overhead, results are better for a fixed number of simulations per move; however, the improvement is clearly significant. The computational overhead is reduced when a multi-core machine is used: the concurrency for memory access is reduced when more expensive simulations are used, and therefore the difference between expensive and cheap simulations decays as the number of cores increases. This element also shows the easier parallelization of heavier playouts.

Open Access

More filters

Book ChapterDOI

Bandit based monte-carlo planning

Levente Kocsis, +1 more

TL;DR: In this article, a bandit-based Monte-Carlo planning algorithm is proposed for large state-space Markovian decision problems (MDPs), which is one of the few viable approaches to find near-optimal solutions.

...read moreread less

Book ChapterDOI

Efficient selectivity and backup operators in Monte-Carlo tree search

Rémi Coulom

TL;DR: A new framework to combine tree search with Monte-Carlo evaluation, that does not separate between a min-max phase and a Monte- carlo phase is presented, that provides finegrained control of the tree growth, at the level of individual simulations, and allows efficient selectivity.

...read moreread less

Journal ArticleDOI

Progressive Strategies for Monte-Carlo Tree Search

Guillaume M. J-B. Chaslot, +4 more

- 01 Nov 2008 -

New Mathematics and Natural Computation

TL;DR: Two progressive strategies for MCTS are introduced, called progressive bias and progressive unpruning, which enable the use of relatively time-expensive heuristic knowledge without speed reduction.

...read moreread less

Proceedings ArticleDOI

Combining online and offline knowledge in UCT

Sylvain Gelly, +1 more

TL;DR: This work considers three approaches for combining offline and online value functions in the UCT algorithm, and combines these algorithms in MoGo, the world's strongest 9 x 9 Go program, where each technique significantly improves MoGo's playing strength.

...read moreread less

Journal ArticleDOI

Computing “elo ratings” of move patterns in the game of go

Rémi Coulom

TL;DR: A new Bayesian technique for supervised learning of move patterns from game records, based on a generalization of Elo ratings, which outperforms most previous pattern-learning algorithms, both in terms of mean log-evidence, and prediction rate.

...read moreread less

Frequently Asked Questions (12)

Q1. What are the contributions in "Adding expert knowledge and exploration in monte-carlo tree search" ?

The authors present a new exploration term, more efficient than classical UCT-like exploration terms and combining efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values and classical online values. As this improved bandit formula does not solve several important situations ( semeais, nakade ) in computer Go, the authors present three other important improvements which are central in the recent progress of their program MoGo: – they show an expert-based improvement of Monte-Carlo simulations for nakade situations ; they also emphasize some limitations of this modification. The authors show a technique which preserves diversity in the Monte-Carlo simulation, which greatly improves the results in 19x19. Whereas the UCB-based exploration term is not efficient in MoGo, the authors show a new exploration term which is highly efficient in MoGo.

Q2. What is the key point in improving the MC engine?

Reducing the probability of simulations in which a group which should clearly live dies (or vice versa) improves the overall performance of the algorithm.

Q3. What is a Monte Carlo Tree Search?

MCTS consists in building a tree, in which nodes are situations of the considered environnement and branches are the actions that can be taken by the agent.

Q4. What is the idea of the modification of Monte-Carlo?

The idea is to increase the number of locations at which Monte-Carlo simulations can find pattern-matching in order to diversify the Monte-Carlo simulations.

Q5. What is the main idea of MCTS?

After each simulation from the current position (the root) until the end of the game, the win and loss statistics are updated in every node concerned by the simulation, and a new node corresponding to the first new situation of the simulation is created.

Q6. What are the two improvements that were made to the Monte-Carlo simulator?

The authors present below two new improvements, both of them centered on an increased diversity when the computational power increases; in both cases, the improvement is negative or negligible for small computational power and becomes highly significant when the computational power increases.

Q7. Who accepted to play and discuss test games against MoGo?

Many thanks to the french federation of Go and to Recitsproque for having organized an official game against a high level human; many thanks also to the several players from the French and Taiwanese Federations of Go who accepted to play and discuss test games against MoGo.

Q8. What are the rules used in the implementation of 19x19 Go?

The following rules are used in their implementation in 19x19, and improve the results:– Territory line (i.e. line number 3), Line of death (i.e. first line), Peep-connect (ie. connect two strings when the opponent threatens to cut), Hane (a move which “reaches around” one or more of the opponent’s stones), Threat, Connect, Wall, Bad Kogeima(same pattern as a knight’s move in chess), Empty triangle (three stones making a triangle without any surrounding opponent’s stone).

Q9. What is the main idea of Monte-Carlo Tree Search?

Other related algorithms have been proposed as in [5], essentially using a decreasing impact of a heuristic (pattern-dependent) bias as the number of simulations increases.

Q10. What was the effect of a modification of the Monte-Carlo simulator on the global?

The following parameters had to be modified, when this model was included in H:– time scales for the convergence of the weight of online statistics to 1 (see Eq. 1) are increased;– the number of simulations of a move at a given node before the subsequent nodes is created is increased (because the computational cost of a creation is higher).– the optimal coefficients of expert rules are modified;– importantly, the results were greatly improved by adding the constant C (see Eq. 1).

Q11. What is the purpose of the term nakade?

The authors will use the term nakade to design a situation in which a surrounded group has a single large internal, enclosed space in which the player won’t be able to establish two eyes if the opponent plays correctly.

Q12. What are the main objectives of the Monte-Carlo simulator?

plenty of experiments around increasing the level of the Monte-Carlo simulator as a stand-alone player have given negative results - diversity and playing strength are too conflicting objectives.

Adding expert knowledge and exploration in monte-carlo tree search

Figures

Citations

A Survey of Monte Carlo Tree Search Methods

Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search

Multi-armed bandits with episode context

Ensemble Determinization in Monte Carlo Tree Search for the Imperfect Information Card Game Magic: The Gathering

Current Frontiers in Computer Go

References

Bandit based monte-carlo planning

Efficient selectivity and backup operators in Monte-Carlo tree search

Progressive Strategies for Monte-Carlo Tree Search

Combining online and offline knowledge in UCT

Computing “elo ratings” of move patterns in the game of go

Related Papers (5)

Bandit based monte-carlo planning

Efficient selectivity and backup operators in Monte-Carlo tree search

Combining online and offline knowledge in UCT

Progressive Strategies for Monte-Carlo Tree Search

Computing “elo ratings” of move patterns in the game of go

Frequently Asked Questions (12)

Q1. What are the contributions in "Adding expert knowledge and exploration in monte-carlo tree search" ?

Q2. What is the key point in improving the MC engine?

Q3. What is a Monte Carlo Tree Search?

Q4. What is the idea of the modification of Monte-Carlo?

Q5. What is the main idea of MCTS?

Q6. What are the two improvements that were made to the Monte-Carlo simulator?

Q7. Who accepted to play and discuss test games against MoGo?

Q8. What are the rules used in the implementation of 19x19 Go?

Q9. What is the main idea of Monte-Carlo Tree Search?

Q10. What was the effect of a modification of the Monte-Carlo simulator on the global?

Q11. What is the purpose of the term nakade?

Q12. What are the main objectives of the Monte-Carlo simulator?