So far in the series we’ve got an intuitive idea about what RL is, we described the system using Markov Reward Process and Markov Decision Process. We know what the policy is, what the optimal state and action value functions are. We’ve seen Bellman Optimality Equation that helped as to define the optimal action value function as a recursive function. BUT, we haven’t done anything to actually solve Markov Decision Process – find an optimal action value function, that would tell as the best actions at a given state. In this post we will explore Dynamic Programming approach to do just that. TLDR: just how me the code.
Dynamic Programming overview
Before we jump in let’s briefly go over what Dynamic Programming is word by word. Dynamic means that the process has deterministic sequential steps. Programming is not referring to what we do when we write code but the actual sequence of steps that we decide to use/apply. More broadly it is a way of solving sequential algorithms that satisfy two following properties:
1. Optimal substructure – another way of saying recursive definition. Meaning that our main algorithm can be described in terms of itself but on the smaller scale. For example if we have a path A-B-C the problem of finding a path from A to C can be decomposed as finding a path from A to B plus from B to C.
2. Overlapping sub-problems – plenty of repetitive sub-problems that allows us memorize (or as computer scientists call it memoize) previously computed values. Let’s say there is a graph and we need to find a path from A to B and from A to C. Once we found a path from A to B we don’t need to calculate it again when we look for a path from A to C, we just reuse previously computed value.
You may ask how does it help us to solve MDP. Luckily Bellman Equations are recursive (optimal substructure) and value functions use previously computed values (overlapping sub-problems). It should be noted that dynamic programming assumes a full knowledge of the environment.
It is important to understand the importance of evaluating some random policy. It is the key to finding an optimal policy. In a prediction task we are given a MDP and a policy, and the task is to predict the value function with respect to a given policy. In other words, we need to predict the value of each state if we act according to the given policy. Remember that the value of a state is the expected value of the sum of immediate reward for leaving state s and discounted value of the future state. Having an optimal value function is enough for having an optimal policy. Acting greedily according to an optimal value function yields the optimal control. Let’s formalize it and look at the example. Recall from Part 3 that the Bellman Expectation Equation for state value function allows for a two step look ahead.
The idea is the following: start with some initial policy. Using Bellman Expectation Equation update state value function. Calculate the difference between previous and current values for all state values. If difference is within some small margin we assume that current state value function is optimal for a given policy. Look at the previous graph Vk is the previous state value function. We plug in values at the bottom leaves and update Vk+1. Seems simple enough, right? Let’s write it down the way real scientists do:
GitHub repo of implementation below.
If you want to see the code in action and play with it. There you go, the link to the github repository. I encourage you to code it up yourself. It really made me to understand what is going on and what needs to be set up in order for everything to work. It took a day to cook up. Good times.
In the next part we’ll explore Policy iteration and Value iteration, I’ll start writing implementation soon.