Temporal Difference Primer

time_icon.png

Last week we finished our exploration of supervised learning with our maze game. We explored a more complex model that used convolution and pooling. This week, we're going back to "unsupervised" learning. We'll consider another approach that does not require the specification of "correct" outputs.

This approach is Temporal Difference Learning (TD Learning). It relies on having a function to evaluate a game position. Its main principle is that the current position should have a similar evaluation to positions in the near future.

Our evaluation function will use weights whose values our training program will learn. We'll want to learn these weights to minimize the difference between game evaluations. In this article, we'll take a high level look at this approach, before we get into the details next time.

History of TD Learning

The concept of TD learning was first developed in the 1980's. One of the more famous applications of TD learning in the 1990's was to learn an AI for Backgammon, called TD Gammon. This agent could play the game at an intermediate human level. It did this initially with no hand-crafting of any of the game rules or any algorithm.

Getting to this level with a "knowledge free" algorithm was almost unheard of at the time. When providing hand-crafted features, the agent could then play at a near-expert level. It explored many possibilities that human players had written off. In doing so, it contributed new ideas to high level backgammon play. It was an important breakthrough in unsupervised techniques.

Q-Learning vs. TD Learning

A few weeks back, we explored Q-Learning. And at first glance, Q-Learning and TD learning might sound similar. But with temporal difference, we'll be learning a different function. In Q-Learning, we learned the Q function. This is a function that takes in our current game board and provides a score for each possible move. With TD, we'll be learning what we call the V function. This function is a direct evaluation of the current board.

With our game mechanics, our agent chooses between 10 different moves. So the "output" vector of our Q-Learning network had size 10. Now in temporal difference learning, we'll only output a single number. This will be an "evaluation", or score, of the current position.

If a game has more than 2 outcomes, you would want the evaluation function to give a score for each of them. But our game has a binary outcome, so one number is enough.

Basics

Despite this difference, our TensorFlow code will have a similar structure to Q-Learning. Here's a high level overview:

  1. Our model will take an arbitrary game state and produce a score.
  2. At each iteration, we will get the model's output score on all possible moves from that position. We'll account for enemy moves when doing this. We will then choose the move for the best resulting board.
  3. We will advance the world based on this move, and then pass the resulting world through our model again.
  4. Then, adjust the weights so that the evaluations of the new world and the original world are more similar.
  5. If the resulting world is either a "win" or a "loss", we'll use the correct value (1 or 0) as the evaluation. Otherwise, we'll use our evaluation function.

What's Next

Next time, we'll dig into more specifics. It will be a bit tricky to use an evaluation function for our game in conjunction with TensorFlow. But once we have that, we can get into the meatier parts of this algorithm. We'll see exactly what operations we need to train our agent.

To learn more about using Haskell with AI, read our Haskell AI Series! This series shows some of the unique ideas that Haskell can bring to the world of machine learning.

Previous
Previous

Setting Up Our Model with Look-Ahead

Next
Next

Enemies, Convolution and Pooling