SimulationPolicies

Authors:

Simulation Policies

The aim of this page is to describe ideas and experiments related to simulation policies in Fuego.

Quick Links

Links to related other wiki pages

[SelfAtariExperiments] My unpublished experiments from 2008 about modifying the Fuego policy to avoid more self ataries
[MeasuringPolicies] How to measure properties of policies such as quality, bias.
[LabeledPositions] How to obtain Go positions that are labeled, for example as win/loss

Data for CG2013 paper

Download an archive of the data from http://www.ualberta.ca/~sumudu/cg2013-files.tar.gz (100 MB)

The archive contains five subfolders: one for Fuego source-code patches, one for the raw data produced during the experiments, and three for scripts and configuration files used to run the various experiments. Each subfolder contains a README file with more details.

What makes a good policy?

Unfortunately, we still do not know much about this important topic. We should try hard to learn more. Gelly and Silver 2007 have clearly shown that a policy which is a stronger player by itself can result in a weaker player when used as the simulation policy of a MCTS engine. http://portal.acm.org/citation.cfm?id=1273531

Silver and Tesauro 2009 provide evidence that it is more important that policies are balanced than that they are strong players. I think this means that policies that make roughly the same number and size of mistakes work better than those that introduce a systematic bias towards one player. See Fig. 1 in their paper. http://portal.acm.org/citation.cfm?id=1553495

Our own experience with Fuego confirms these observations. One early experiment in this direction was unpublished work on avoiding self atari. These are obviously bad moves which lead to a player's stones being captured in the next move. However, all attempts to remove self ataries from the playout policy in Fuego were unsuccessful. See the [SelfAtariExperiments] page for a summary and conclusions.

Evaluating the Quality of a Policy - State of the Art

The current state of the art evaluates a policy by playing a statistically significant number of games. A policy P1 is judged to be better than policy P2 if it performs better in test games, either self-play, or against a fixed pool of opponents.

Conjecture

Simulation is used as an evaluation function. Therefore it should preserve the "goodness" of a position. A winning position should lead to many wins, and a losing position to many losses.

The simulation quality conjecture is as follows: The quality of a simulation policy is directly related to how well it preserves wins and losses.

In order to be a useful conjecture, the scenario for testing it must be made precise. One big question is how to select a good test set over which to evaluate policies. A second question is how to evaluate the test positions themselves. A third question is how to measure how well it preserves wins and losses.

Problems with Evaluation

Results may change depending on the number of simulations, time used, type of positions, behavior of the tree search, and even depending on whether the program plays Black or White.

Proposed Research

A first step is develop a framework to evaluate the quality of a given simulation policy, and to make it possible to compare two policies.

Build test sets

Build a test set with high quality labeled examples - positions for which the perfect evaluation (win or loss) is known.

I propose to build a test set for 9x9, with 7.5 komi, only. This is for simplicity and because we have many annotated games. Problem: which distribution of opening, middle game, endgame positions?

Build a test set with samples from the middle of simulations. Example: take the test set from step 1, run 1 simulation from each, pick a random position along this simulation.

Problem: this set will vary depending on the policy used to create the samples.

Similar to 2. but sample from all simulations during a MCTS, not just from a simulation at the root.

In the progression from 1. to 3., the degree of realism improves. 3. are real random samples from a MCTS. The sampled positions will be later and later in the game. E.g. if we pick a position at move 10 in step 1., the average number of moves in a simulation is 100, and the average tree dept is 10, then a step 2 sample will be on average 60 moves deep into a game, and a step 3 sample 70 moves.

Evaluate simulations

We would like to evaluate what happens during a simulation. The basic measure is how stable it is with respect to preserving wins and losses. Given an evaluation oracle, we can assign a win or a loss to every state along the trajectory of the simulation. Then we can do statistics about when and how often the simulation switches from a win to a loss and vice versa. In terms of Silver/Tesauro?, we can say a policy is balanced if the probability of switching is (what???? independent of player?)

Evaluate test positions and whole test sets

How "realistic" are they? How biased, how difficult?

Evaluate a single policy

Can we evaluate a policy using the test sets above, and the statistics on simulations defined above?

Compare two policies

Given two policies P1 and P2. In practice, P1 and P2 may be very similar and differ only in one detail - the proposed change in policy that we want to evaluate.

Now we can define another kind of test set: positions where P1 and P2 proposed different moves (or different sets of moves, or different probability distributions for selecting moves). This will sharpen the comparison by focusing on cases where P1, P2 are different.

Given labeled test sets as in step 1 above, we can run a fixed number of simulations from them using either policy, and check the correlation - the percentage of how often the simulation resulted in the correct outcome. (Could we also run a complete MCTS? Would that be better?)

Wiki: Home
Wiki: LabeledPositions
Wiki: MeasuringPolicies
Wiki: SelfAtariExperiments

Fuego Wiki