RL Part 4.2 Policy Iteration.

This is a continuation of my attempt to learn basics of Reinforcement Learning. I took a short, so deserved break, and now ready to continue. In previous post as I remember we went over dynamic programming and discovered our first algorithm to evaluate a given fixed policy Iterative Policy Evaluation. Which is a good start, but does not really helps as to find an optimal policy which is an ultimate goal.

Let’s start with the following question: “How do we make some policy better?” Well, in previous part we’ve figured out how to evaluate a given policy, that’s one step that allows as to compare policies. To actually improve some policy we can force our agent to always take the most valued action. If you recall policy gives as a set of available actions for our agent in a current state along with probabilities for each action. To improve some policy all we need to do is to replace allowed/given actions with the action that yields the most reward according to our value function. In other words we force our agent to pick an action that will transition as to the state with the biggest value. It seems as if we have a full circle: evaluate policy -> improve policy -> evaluate policy -> improve policy … and so on. At some point our policy evaluation and policy improvement will converge to an optimal policy. I don’t know about you, but it sounds very intuitive and somewhat magical. The magic is in the fact that we will learn an optimal policy regardless of what our initial policy was.

Figure 1. Policy iteration.

The picture above describes the basic idea that is used throughout the Reinforcement Learning area. There are two alternating actions: policy evaluation and policy improvement. Evaluate a given policy, the result is a value function that we use to update our policy by acting greedily with respect to our value function. And we repeat the cycle again. If it seems as if I tell the same thing over and over again, it is because I do indeed. It is a fundamental idea that needs to be well understood.

Let’s get to coding then. Click, click, click-click….. 95 lines in 4 clicks, not too bad. There you go, it is done.

TLDR: Here is the CODE. Make sure you have all the files from the parent folder.


I encourage you to write out the algorithm by yourself. It is a great exercise and really forces you to understand the material.

Stay tuned for the next part. Write some code, will you!?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.