Q-Learning with TensorFlow (Python)

In Part 5 of this series, we finished refactoring our Gym code to use a type family. This would make it much easier to add new games to our framework in the future. In this part, we're going to incorporate TensorFlow and perform some more advanced learning techniques.

We've used Q-Learning to train some agents to play simple games like Frozen Lake and Blackjack. Our existing approach uses an exhaustive table from observations to expected rewards. But in most games we won't be able to construct such an exhaustive table. The observation space will be too large, or it will be continuous. So in this article, we're going to explore how to use TensorFlow to build a more generic function we can learn. We'll start this process in Python, where there's a bit less overhead.

Next up, we'll be using TensorFlow with our Haskell code. We'll explore an alternative form of our FrozenLake monad using this approach. To make sure you're ready for it, download our Haskell TensorFlow Guide.

A Q-Function

Our goal here will be to make a more general Q-Function, instead of using a table. A Q-Function provides another way of writing our chooseAction function. With the table approach, each of the 16 possible observations had 4 scores, one for each of the actions we can take. To choose an action, we just take the index with the highest score.

We now want to incorporate a simple neural network for chooseAction. In our example, this network will consist of a single matrix of weights. The input to our network will be a vector of size 16. This vector will have all zeroes, except for the index of the current observation, which will be 1. Then the output of the network will be a vector of size 4. These will give the scores for each move from that observation. So our "weights" will have size 16x4.

So one useful helper function we can write already will be to convert an observation to an input tensor. This will make use of the identity matrix.

def obs_to_tensor(obs):
  return np.identity(16)[obs:obs+1]

Building the Graph

We can now go ahead and start building our tensor graph. We'll start with the part that makes moves from an observation. For this quick Python script, we'll let the tensors live in the global namespace.

import gym
import numpy as np
import tensorflow as tf

tf.reset_default_graph()
env = gym.make('FrozenLake-v0')

inputs = tf.placeholder(shape=[1,16], dtype=tf.float32)
weights = tf.Variable(tf.random_uniform([16, 4], 0, 0.01))
output = tf.matmul(inputs, weights)
prediction = tf.argmax(output, 1)

Each time we make a move, we'll pass the current observation tensor as the input placeholder. Then we multiply it by the weights to get scores for each different output action. Our final "prediction" is the output index with the highest weight. Notice how we initialize our network with random weights. This helps prevent our network from getting stuck early on.

We can use these tensors to construct our choose_action function. This will, of course take the current observation as an input. But it will also take an epsilon value for the random move probability. We use sess.run to run our prediction and output tensors. If we choose a random move instead, we'll replace the actual "action" with a sample from the action space.

def choose_action(input_obs, epsilon):
  action, all_outputs = sess.run(
    [prediction, output],
    feed_dict={inputs: obs_to_tensor(input_obs)})
  if np.random.rand(1) < epsilon:
    action[0] = env.action_space.sample()
  return action, all_outputs

The Learning Process

The first part of our graph tells us how to make moves, but we also need to update our weights so the network gets better! To do this, we'll add a few more tensors.

next_output = tf.placeholder(shape=[1,4], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_output - output))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)

init = tf.initialize_all_variables()

Let's go through these one-by-one. We need to take an extra input for the target values, which incorporate the "next" state of the game. We want the values we get in the original state to be closer to those! So our "loss" function is the squared difference of our "current" output and the "target" output. Then we create a "trainer" that minimizes the loss function. Because our weights are the "variable" in the system, they'll get updated to minimize this loss.

We can use this section group of tensors to construct our "learning" function.

def learn_env(current_obs, next_obs, reward, action, all_outputs):
  gamma = 0.81
  _, all_next_outputs = choose_action(next_obs, 0.0)
  next_max = np.max(all_next_outputs)
  target_outputs = all_outputs
  target_outputs[0, action[0]] = reward + gamma * next_max
  sess.run(
    [update_model, weights],
    feed_dict={inputs: obs_to_tensor(current_obs),
               next_output: target_outputs})

We start by choosing an action from the "next" position (without randomness). We get the largest value from that choice. We use this and the reward to inform our "target" of what the current input weights should be. In other words, taking our action should give us the reward and the best value we would get from the next position. Then we update our model!

Playing the Game

Now all that's left is to play out the game! This looks a lot like code from previous parts, so we won't go into too much depth. The key section is in the middle of the loop. We choose our next action, use it to step the environment, and use the reward to learn.

rewards_list = []

with tf.Session() as sess:
  sess.run(init)
  epsilon = 0.9
  decay_rate = 0.9
  num_episodes = 10000
  for i in range(num_episodes):
      # Reset environment and get first new observation
      current_obs = env.reset()
      sum_rewards = 0
      done = False
      num_steps = 0
      while num_steps < 100:
        num_steps += 1

        # Choose, Step, Learn!
        action, all_outputs = choose_action(current_obs, epsilon)
        next_obs, reward, done, _ = env.step(action[0])
        learn_env(current_obs, next_obs, reward, action, all_outputs)

        sum_rewards += reward
        current_obs = next_obs
        if done == True:
          if i % 100 == 99:
            epsilon *= decay_rate
          break
    rewards_list.append(sum_rewards)

Our results won't be quite as good as the table approach. Using a tensor function allows our system to be a lot more general. But the consequence of this is that the results aren't stable. We could, of course, improve the results by using more advanced algorithms. But we'll get into that another time!

Conclusion

Now that we know the core ideas behind using tensors for Q-Learning, it's time to do this in Haskell. In part 7 we'll do a refresher on how Haskell operates together with Tensor Flow, and apply it to our game. We'll see how we can work these ideas into our existing Environment framework.