19 juin 2012
2
19
/06
/juin
/2012
10:11
MCTS algorithm
Input: a state S
Output: a decision D
Global variable: a tree T, each node having nbSims
initiliazed at a single node with state S and nbSims=0 and totalReward=0
While (I have time)
{
Do one simulation from state S until a game is over
==> we get a sequence s(0), s(1), s(2), s(3),..., s(k) (with s(0)=S and s(k) a final state)
==> we get a reward r
Consider i minimal such that s(i) is not in T
If such an i exists, then add s(i) in T, with an edge from s(i-1) to s(i)
For each i such that s(i) is in T:
s(i).nbSims++
s(i).totalReward+=r
}
Procedure for doing one simulation from state S until a game is over:
s=S
while (s it not terminal)
{
choose child s' of s such that UCB(s') is maximum
s=s'
}
Procedure for computing the UCB score of a node s':
{
meanReward=s'.totalReward / s'.nbSims
fatherSims=s'.father.nbSims
UCB = meanReward + sqrt( log(fatherSims) / s'.nbSims)
}