Show Me The Code: Search
Published:
This notebook talks about reinforcement learning with search methods. Specifically, we focus on the series of work of AlphaGo [1], AlphaGo Zero [2], AlphaZero [3], and MuZero [4]. We will talk about these works step-by-step, illuminating each of their advances to train stronger RL models.
<!DOCTYPE html>
Reinforcement Learning With Search¶
This notebook talks about reinforcement learning with search methods. Specifically, we focus on the series of work of AlphaGo [1], AlphaGo Zero [2], AlphaZero [3], and MuZero [4]. We will talk about these works step-by-step, illuminating each of their advances to train stronger RL models.
For the whole notebook, we will talk about in the game of Go, while the methods apply to any other similar environments.
Search¶
What is Search?¶
Given a state of a Go game, what should be your next position? Someone plays game by intuition, i.e., they just think of one specific position that looks good. Most others, play games with varied degrees of Search: if I play this position, the opponent will play that one, then I can play that position to counter, Ouch! I will lose the game if he checkes the game! OK, let me play another position.
Search is about "mentalizing" over the rollouts of the game, evaluting the potential of each action based on the rollouts' outcomes, and decide the best action. There are multiple characteristics of search:
- It is "mentalizing", i.e., it is purely imagination. No matter how much you search, it is over an imagined space. You don't play in the real game.
- It is about the next single action. You might reason about the opponent's move, then you next move, then the opponent's move, ... Different people reasons different number of iterations (professionals can certainly reason more than ordinary people). But it is the same for everyone, that you just need to play the next action/position no matter how much you reason.
Random Search¶
The simplest search method is to search randomly. Starting from the root state, i.e., the game state you are at, you pick random actions in each search iteration. After certain amounts of random search, you gather all game's results and pick the more favorable action. However, random search is terribly inefficient because you are basically wandering in a giant ocean and you usually select very bad moves.
Random Search With A Policy¶
Better than random search, you can search with a policy, such as some rule-based policy, your intuition, or a neural network. In each search iteration, you pick the next iteration randomly based on the predictive probabiities of the policy (yes, the policy is usually non-deterministic). Compared to random sampling, it is more efficient cuz the policy usually guides you towards some reasonable moves. With a decent policy, searching a lot can give you good reuslts, like in the game of Chess. However, the game of Go is much more complex. Searching with the policy can result in repetitive moves or searching the space only within the moves favored by the policy. This might miss out very good moves.
Frequency-aware Search with A Policy¶
Random search with a policy can lead to repetitive games if different search trajectories are irrelevant. Let's fix this problem. Intuitively, if an action has been selected many times, we would encourage the model to select less frequent actions for exploration. Therefore, we can add an exploration bonus to the formula. $$a_t = \max_a \frac{P(a | s_t)}{N(a | s_t) + 1}$$ Here $N(a|s_t)$ represents the number of times the action $a$ has been selected at the state of $s_t$.
Monte Carlo Tree Search (MCTS)¶
The idea of MCTS is to search more efficiently with guided state values (either by a value model or rollout comes). In essence, it tries to
- Maintain a tree structure so that we can avoid repetitive searching.
- With in the tree structure, search through the most promising moves but also explore other moves.
- Expand the tree while search to cover biggest spaces.
Here is an illustration figure of MCTS, taken from the AlphaGo paper.

Starting from the root state, i.e., the state you are at and try to play the next move, MCTS searches over the next stages:
- Selection. Starting from the root state, following the MCTS tree you already have, navigate along the tree. In each step $t$, select the action that maximizes some "state-action value" $Q(s_t, a)$ together with exploration bonus $U(s_t, a)$. For the actions taken before, $Q(s_t, a)$ is known; for actions not taken before in $s_t$, $Q(s_t)$ is zero.
Here $Q(s_t, a)$ stores the best knowledge of "how good is playing $a$ at state $s_t$" till now. And $U(s_t, a) \approx \frac{P(s_t, a)}{1 + N(s_t, a)}$. Here $P(s, a)$ is the probability from a policy network you have (or the best existing policy), which is like selecting $a$ based on your policy. $N(s_t, a)$ is the count of visits in the MCTS. The more you have played $(s_t, a)$, the less you want to play $a$ again.
- Expansion. When you do selection along the tree, you don't always choose the action $a$ that exists in the tree. At certain step $t$, you will choose a action $a$ that is not in the tree. This expands the
tree.
- Evaluation. With the new edge $(s_t, a)$, you need to know their $Q(s_t, a), P(s_t, a), N(s_t, a) $. It is simple that $N(s_t, a) = 1$ and $P(s_t, a)$ is directly available from the existing policy. What is $Q(s_t, a)$?
If you had a value network, that would be great. Ok, let's assume that we have trained a value network $v(s)$. In addition, we also want to know the game outcome if we keep playing the game until the end. Let's keep playing then. But we cannot do MCTS cuz we are in a MCTS now. Let's just play the game using the current policy until the end. Then will know whether we win or loss. This gives us $z \in \{1, -1\}$. We can combine the value network and the rollout outcome as $Q$: $$Q(s_t, a) = (1-\lambda) v(s_{t+1}) + \lambda z$$
Backup. Finally, we need to update the statistics of all visited nodes in the search. We need to increase their $N(s_t, a)$ and we need to update their $Q(s_t, a)$ as the new average.
Iteration. From Step 1 to Step 4, finishes one iteration of MCTS. We can do this for many (thousands of) iterations to build a big tree. The bigger the tree, the more confident we are.
Finally, after the MCTS, with a big tree, we just pick one single move at the root state $s_0$. We pick the next move as the most visited action from the root state.
AlphaGo [1]¶
AlphaGo trains two networks: a policy network and a value network.
- The policy network is trained from supervised learning using expert demonstrations and policy gradient from self-plays.
- Given the trained policy network, it samples a lot of games. These games are used to train the value network.
All training is done. At inference time, i.e., when playing with other Go players, AlphaGo uses MCTS to select each move. Yes! MCTS is only used in inference instead of training.
AlphaGo Zero [2]¶
Compared to AlphaGo where MCTS is only used in inference, AlphaGo Zero uses the MCTS in training the networks, as a policy iteration method. There are several differences:
- AlphaGo Zero is trained without any human data, purely with RL from scratch.
- The policy network and the value network are combined in one.
- In each training step, they play the game of Go until the end $s_0, a_0, s_1, a_1, ...., s_T$. Each move $a_t$ is selected by the MCTS rollout starting with $s_t$ using the policy network and the value network. Finally the policy network $q_\theta(s_t)$ is trained with mimic the MCTS predicted probabilities $\pi(s_t)$; the value network $v_\theta(s_t)$ is trained to regress the game outcome $z_T \in \{1, -1\}$. The objectives are
AlphaZero [3, 5]¶
AlphaZero is almost the same as AlphaGo Zero that it applies to other games beyond Go, including Chess and Shogi. Similar to AlphaZero, Expert Iteration [5] was proposed to train policy networks with a stronger expert model, which is the MCTS search with the policy network itself.
MuZero [4]¶
In the game of Go, the game transition is known, i.e., when you play $a_t$ at $s_t$, you know exactly what the next state $s_{t+1}$ is. Therefore, you can rollout through the MCTS tree. However, in other general domains like Atari games, the game transition is not known, i.e., you don't know what $s_{t+1}$ is exactly. To deal with this problem, MuZero trains a recurrent network that predicts the next hidden state given the current hidden state. These hidden states are then used in AlphaZero-style training. The key idea to learn the dynamics in the hidden space is that the model can focus on what really matters for the game instead of trying to predict everything, a lot of which are irrelevant, e.g., the exact RGB for a certain pixel location.
References¶
[1] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
[2] Silver, David, et al. "Mastering the game of go without human knowledge." nature 550.7676 (2017): 354-359.
[3] Silver, David, et al. "Mastering chess and shogi by self-play with a general reinforcement learning algorithm." arXiv preprint arXiv:1712.01815 (2017).
[4] Schrittwieser, Julian, et al. "Mastering atari, go, chess and shogi by planning with a learned model." Nature 588.7839 (2020): 604-609.
[5] Anthony, Thomas, Zheng Tian, and David Barber. "Thinking fast and slow with deep learning and tree search." Advances in neural information processing systems 30 (2017).