Our first algorithms of a totally different class. Temporal-Difference (TD), just like Monte-Carlo method, learns directly from an experience of interacting with an environment. TD is model-free, does not require any knowledge of MDP transactions or rewards. No need to worry about how different things affect our state values. There is, also no more need to have a full episode to get experience. TD applies to non-episodic – continuous environments as well. It learns from incomplete episodes, using BOOTSTRAPPING.

The first paragraph is a bit packed with terms. Let’s untangle it a little. First of all, the way we avoid the need to reach the end of the episode is that we simply **estimate** how much reward we are going to get when we reach the end (or some horizon, in case of continuous environments). It seem absurd, but TD updates a guess towards a guess. This fundamental idea behind TD learning is called BOOTSTRAPPING. Let us formalize what we just learned in a definition of the simplest TD learning algorithm:

Few things to note about the definition. Zero in TD(0) means that we are only looking one step ahead, later we will expand this algorithm to look farther. TD target is the return value estimate that we are “moving” towards and trying to “get right”. And as usual, the difference between the target and the current value is the error.

TD updates a guess towards a guess

It is worth mentioning another great benefit of “quicker” feedback from our environment is that in an extreme situation it is far more advantageous to account for an encountered conditions right away. Let’s say we have a drone that is learning how to deliver packages in an urban environment. Unfortunately neighborhood kids decided to trap our drone and get the package using a fishing net. So our drone senses a net flying towards him and calculates that it will inevitably crash if continues to fly straight towards the destination. But there is no action programmed for this exact situation see it flies on. Well, as it turns out there was not enough force behind the net throwing hand and the gravity took the net down before it hit our drone. So the drone continues to fly forward, as if nothing really happened. Hence, the extreme experience would not be backed up since the drone reaches its destination and the next time it encounters similar situation it will not be ready. I know, this is a lot of words for something this simple, but it illustrates the thinking behind the algorithm really well.

You cannot backup death

In conclusion of the post, let’s talk about the bias-vs-variance trade-off, as it is the perfect point to illustrate it. Let’s recall the Monte-Carlo return

Gt is an unbiased estimation of a state value under some policy. It is so, because there is no bias introduced and we get a true value. However, since a return depends on so many intermediate rewards, and each reward depends on some set of known/unknown variables (inevitable environment randomness) there is imminent variance in a return. Nevertheless, Monte Carlo does provide a true return value. TD target on the other hand has much lower variance, because it depends only on a single random action. Despite of TD being low variance it is highly biased, since we introduce an estimate of a state value, and that estimate can be off by a great amount.

MC – high variance. TD – high bias.

In the next part we’ll expand temporal difference method for policy evaluation to look farther down a path, and will hopefully have some code to show that it actually works.