Recall that in part 2 we introduced a notion of a Markov Reward Process which is really a building block since our agent was not able to take actions. It was simply transitioning from one state to another along with our environment. That’s not really helpful since we want our agent to not only take actions but also be able to pick actions. In order to do so we will introduce a set of actions to MRP to promote it to a Markov Decision Process, term that is actually used in Reinforcement Learning.
If you’ve been following the series you saw that we’ve been gradually adding additional layers of complexities. Today we are adding a finite set of actions that our agent is able to perform. Do not assume that the set of actions has to be finite, there are ways to deal with continuous sets of actions. For now we will deal with a finite action sets. Since we added agent’s actions into the mix our state transition probability will also depend on the action that our agent decides to take. Think of it this way, previously our agent was a passive part of the environment. If it rains and our agent is not under a cover it just gets wet. Since we introduced actions future state depends on agent’s action. If agent decides to move under a cover (becomes active) it will not get wet. Hence different state. Previously, state transition probability matrix was a 2 dimensional matrix, now each action will have it’s own corresponding state transition matrix. Since our actions affect what next state our environment end up in. Intuitively, there are multitude of possible policies to follow, later we will see how to pick an optimal one. But for simplicity, let’s say that we have a single policy that we follow, we don’t care at this point if it is an optimal one. So it should look something like the following:
Following the same logic reward function may also depend on the action. It is not always the case since in some environments it could be the case that no matter what actions you take rewards stay the same or only depend on the environment transitions. So aside from adding a set of actions and adding extra dimension to transition matrix and reward function everything else stays the same. Let’s see how we can modify the Markov Reward Process graph from previous part to include decisions.
There are 4 distinct states (1..4) when out agent is faced with a choice of an action. Each action carries associated reward. Small crossed out circle represents randomness in our process. When agent is in state 2 or 3 and decides to play video game, just like before there is 0.9 probability that our agent will be transitioned by our environment to the terminal state 5. And 0.1 probability of transitioning back to starting state 1. In this example state transition probabilities are only defined for the crossed out state. Since our agent can now choose an action we need to formalize a way to do so. Policy does exactly that.
This seems very similar to state transition matrix. But instead of state to state mapping we have state to action mapping. The idea stays the same, if our agent is in the state s policy defines probabilities of our agent choosing some action a. That’s it. Simply put, a policy defines how our agent behaves. Important thing to note is that a policy only depends on a current state and is independent of a time stamp or a history. No matter how far into a trajectory our agent is or what happened previously, policy stays the same.
Action Value Function
Since we added actions, we need to redefine our state value function to reflect that. Once again, it is important to understand that there are multiple policies that our agent can follow. Each policy will affect state value function.
If you think about the state value function definition, you might start to wonder how would we evaluate actions in a way that we evaluate states. Since our actions affect future states, it does make sense to have action value function.
Intuitively, it states the following: when our agent is in state s, what is the expected total reward we can get if the agent takes action a. This is what we are going to use to make decisions about what actions to take to behave optimally. Next thing on our list, just like in previous post is a Bellman Expectation Equation, but with added policy layer. It sounds intimidating, but all it really tells is that the action value function (just like the state value function) can be broken down to immediate reward plus discounted value of successor (action, state) tuple.
Once again, it may look intimidating if you are not used to dealing with expected values and probabilities. So let’s take a look at pictorial tree-like representation of what Bellman Expectation Equation is telling us if we start at some state s.
Try not to concentrate on the formula too much, instead stick with the intuitive look ahead graph. The root of our two step recursive look ahead tree is some state s. What we do next is that we sum the probabilities of taking some action a multiplied by a reward for taking that particular action plus discounted sum of probabilities of state transitions multiplied by the state value function of the resulting state. If I am just confusing you, let’s look at the numbers. For now we assume that we know state values of all states that are marked as s, and we want to calculate state value for state x given the following rewards and probabilities.
Look ahead tree does not have to binary, it is binary for demo purposed. Assuming that discount factor is 1, Bellman Expectation Equation for the value of the state x looks as the following.
We also can do the same two step look ahead starting from an action. This will give Bellman Expectation Equation for action value function. If we set the root of the tree to be some action a it will look as the following.
The difference is that there is we don’t weight the rewards, since we are evaluating action and reward is given before we take this action. The idea stays the same, take the immediate reward for taking some action, plus discounted sum of probability times action value weighted with state transition probability. It sounds more confusing than it really is. Let’s look at the example to hopefully get this scary formula to settle down. Once again discount factor is 1.
I encourage you to go through the math to really soak the idea in. Let’s wrap up with the culmination of the whole article. 1174 words not counting images just to arrive at what really matters, the way to choose actions optimally.
Optimal Value Function
So far we have used Bellman Expectation Equation to formalize state and action value function, and calculate them. But haven’t seen how to pick an optimal action being in some state s. This is where optimal value functions come into play. Just one more layer of complexity.
This is it. q* tells our agent what action to take if it is in some particular state s. Optimal value function tells the agent how to behave to maximize total rewards. If we know optimal value function, we solved Markov Decision Process. Let’s take a look at our example MDP one more time and without going into a lot of calculations figure out on optimal policy.
Later we will see how to calculate q* for each state, but for now let’s try our intuition. We start from the “last” state 4 and see that there is only one action “Get a Raise” which yields +12 reward. So that’s q* for state 4. Let’s back up one more state and take a look at state 3. There are two possible actions to take: “Play Video Game” with unknown value (for now) and “Publish a Paper” with value -1 + q* of state 4 which is +12 = +11. So the q* for state 3 is +11. Following the same logic we calculate the value for state 2: -2 + 11 = +9. State 1: -4 + 9 = +5. Now it is the time to calculate the action of “Play Video Game” for state 2: +1 + 1*(0.9 * 0 + 0.1 * 5) = 1.5. It is less than our previously calculated value so this action is not optimal. The very last scary image for today that formalizes what we just did just using our intuition using two step look ahead tree like recursive representation that we have seen before.
The only difference from previous Bellman Optimality Equation is that instead of taking a weighted sum we take the max. Nothing more, nothing less. We can also define similar looking thing for v*, but it is December 31, it is late and I am tired. In the following posts I’m planning to explore different algorithms for computing q*. I’m glad we are done with the basic, yet important theory bit and moving on to actual algorithms. Yyyyeeee!!!