In section 1 of TD learning we have seen a new class of algorithms that can learn online after every step. In other words TD can learn before and without the final outcome using Bootstrapping – the idea of updating a guess towards a guess. In particular, we’ve looked at the TD(0), an algorithms that updates state’s value after one step. Intuitive continuation is to gather more information by looking ahead more than one step and only then updating the value. Let’s see how we can fully leverage that idea.

Instead of looking ahead 1 step, let’s see how would we go about looking ahead a few more steps and try to generalize the TD algorithms to look n steps ahead. It is fairly intuitive, move forwards as much as you like, gather rewards and update initial state value using gathered rewards. If your environment is an episodic and you decide to look ahead all the way till the terminal episode, our TD algorithm would turn into a MC.

Let’s formalize what we just talked about in mathematical terms:

So far we’ve only looked at TD(0) case, when n = 1. We only look ahead one step and using the value of the state that we end up in. If we set n = 2, the return will be the sum of the return after one step, discounted reward after two steps, and twice discounted value of the state that we end up in after two steps. You get the idea. In the same way we may define n-step temporal-difference learning.

Now we know that it is convenient to look n steps ahead. The problem is that for different problems there are different optimal values for n. And we don’t really want to be figuring out the best n for every problem. We want a robust algorithm that doesn’t need to know the best n.

To build intuition, let’s combine returns from 2 different values of n [2,4] and average them. So we look two steps ahead, calculate the return; we look four steps ahead and again calculate the return. Sum up the results and divide by 2. It is a good example, but the real question is how do we combine the results for all n in an efficient manner? The answer is TD(λ).

The idea behind TD(λ) is to use combined geometrically weighted returns from all n-steps. The weights are controlled by a constant λ parameter, that is between 0 and 1. Now, if you are thinking that we kind of rewrote MC in a fancy way, meaning that our algorithm still suffers from the need to have complete episodes in order to update any values. You are exactly right. So called forward-view TD(λ) can only be computed from complete episodes. Luckily there is a little trick that achieves the same results that forward-view does but without the need to wait for episode completion. And it is… the backward-view, which updates state values online, every step, from incomplete sequences.

There is one minor problem with looking back in time – which states should be assigned more credit? The most frequent ones, or the most recent ones. The so called “ELIGIBILITY TRACES” combine two characteristics in the following way:

Which means that every time we visit a state it’s eligibility trace increases, and for every step that doesn’t visit the trace value decays exponentially. So we need to keep an eligibility trace for every state. Update value of every state every time we do updates in proportion to TD-error and eligibility trace. Let’s see what the complete Backward View TD(λ) looks like.

There you have it. Now you know what TD(λ) is. We also looked at the forward and backward views, and eligibility traces. I think it is enough for the model-free predictions. We’ve walked around the really interesting topic long enough and in the next part we are going to finally see what it takes to do the Model-Free Control.