Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Dennis J. N. J. Soemers; Eric Piette; Matthew Stephenson; Cameron Browne

doi:10.1109/CIG.2019.8848037

Authors:
Dennis J. N. J. Soemers, Eric Piette, Matthew Stephenson, Cameron Browne

Venue:
IEEE Conference on Games (CoG), 2019

DOI:
10.1109/CIG.2019.8848037

Topics:
Reinforcement learning, Monte Carlo Tree Search, self-play, policy gradients, general game playing

Links: PDF · IEEE / ACM entry · arXiv

Abstract

This paper investigates learning policies in self-play settings where Monte Carlo Tree Search (MCTS) and learning algorithms iteratively improve each other.

Instead of mimicking MCTS behaviour through cross-entropy loss, the authors propose a novel objective function based on policy gradients that avoids encouraging exploratory behaviour.

The approach relies on MCTS value estimates rather than visit counts, and is designed to learn policies that are more suitable for interpretability and strategy extraction. Empirical results across several board games show that the proposed method produces strong and less exploratory policies.

Context

This work is part of the broader Digital Ludeme Project, where the goal is not only to achieve strong game-playing performance, but also to extract interpretable strategies from learned policies.

Unlike standard approaches such as Expert Iteration (e.g. AlphaGo-like methods), this paper proposes a formulation that explicitly avoids learning exploratory policies, which are less suitable for human interpretation.

The resulting policies exhibit lower entropy and clearer action preferences, making them better candidates for explaining decision-making in general game playing systems.

Full reference

Soemers, D. J. N. J., Piette, E., Stephenson, M., Browne, C. (2019). Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates. In Proceedings of the IEEE Conference on Games (CoG). DOI: 10.1109/CIG.2019.8848037

BibTeX

@inproceedings{soemers2019learning,
  author    = {Soemers, Dennis J. N. J. and Piette, Eric and Stephenson, Matthew and Browne, Cameron},
  title     = {Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates},
  booktitle = {Proceedings of the IEEE Conference on Games (CoG)},
  year      = {2019},
  doi       = {10.1109/CIG.2019.8848037},
  url       = {http://arxiv.org/abs/1905.05809}
}