Ever wondered how online algorithms came to be the thing? There is a clear path from Monte-Carlo method to the non-stationary online algorithms that rock our world nowadays. It seems trivial, but does help with understanding how different pieces fit together.

In previous part we’ve discovered the power of Model-Free Prediction algorithms. In a nutshell, Model-Free algorithms allows us to disregard everything that has an effect on the value function, and go directly to the actual value function. Let’s say, for example, we want to “teach” some robot to pick up an object and throw it in a basket. There are a lot of variables that affect our value function: object’s weight, shape, it’s position with regards of a robot and a basket, size and position of a basket, etc. Yet, we don’t need to define how all of the environment variables affect our value function.

Learning from an experience allows us to bypass all intermediate dependencies and go directly to the value function. This is one of the greatest powers of Reinforcement Learning. In this part we’ll rewrite our formulation of Monte-Carlo method to see how we could avoid one of the major drawbacks of the method – the need for complete episodes. There will be very little of new material, but understanding it really helps you to understand the grand scheme of things and how different pieces fit together.

Recall from the previous part, that the Monte-Carlo method estimates a value function based on complete episodes by simply calculating a mean of all returns.

Depending on your math background, you may know (I didn’t) that there is a way to calculate mean incrementally. Believe it or not, I had an interview question asking me to implement an incremental mean. I had to actually convince myself doing pen and paper calculations that it is, in fact, true. Anyways, here is how it is calculated.

If you don’t care about how we got the last line, let me explain why it is so darn interesting. What it really shows is that to calculate our new estimate (new mean) we can take our previous estimate (previous mean) and add some learning rate (step size, or in this case 1/j) multiplied by the difference between previous estimate and the value that we actually observed. It does look familiar, doesn’t it? The difference is nothing more than an **error** between predicted and actual estimate. The multiplication coefficient is your learning rate – alpha. There are a lot of online algorithms that take this form. I’m sure you’ve seen the last line written with another terms, but fundamentals stay the same. I hope it is very revealing to see how Monte-Carlo method is pretty much a grandfather of all Learning.

Let’s improve Monte-Carlo method by using incremental updates. Meaning that we don’t need to keep all samples that we saw, just the number of times we visited a state.

This simple and trivial rewrite of the Monte-Carlo method allowed us to get rid off keeping all returns. That’s nice. There is one more bit that separates this algorithms from becoming an online one – we still keep track of the number of visits. Meaning that episode has to terminate. Let’s think about this, we want our algorithms to be online. What if we simply replace error multiplication coefficient with a constant value and forget about the visit count for good? It sure sounds daring, but helps with problems may change their rules overtime.

It seems as if the previous equation is somehow universal, if you’ve ever read about back-propagation and gradient decent, you saw this exact equation written in somewhat different form. Well, now you know how it got to be what it is.

No code this time, it seems too trivial to update Monte-Carlo to use incremental updates. The next part is very exciting, as we going to move towards Temporal-Difference Learning. Stay tuned…