In Part 1 we found out what is Reinforcement Learning and basic aspects of it. Probably the most important among them is the notion of an environment. Environment is the part of RL system that our RL agent interacts with. An agent makes an action, an environment reacts and an agent observes a feedback from an action. This circle of events creates a process. In this post we’ll try to mathematically formalize (using Markov property) and describe an environment and a process in the simple terms.
State of the environment is said to have a Markov Property if a future state only depends on the current state. In other words, a past does not matter and the current state has (or accumulated) all the information about the history and fully determines a future state.
Now that we have a notion of a current state and a next/future/successor state, it is the time to introduce a State Transition Matrix (STM). STM contains probabilities of an environment transition from state to state.
State Transition Matrix
Each row in a State Transition Matrix represents the transition probabilities from that state to the successor state. It always helps to see a concrete example. Let’s say that we have a radio controlled car that is operated by some unknown algorithm. For simplicity we assume that the car is always on and can turn only while stationary.
State Transition Matrix for our environment (car in this case) has the following values (totally made up) :
- moving forward
- moving backward
- turning right
- turning left.
First of all, note that each row sums to 1. Probability cannot be greater than 100%.
Remember to look at the rows, as each row tells us transition probabilities, not columns. Let’s take a look at the first row and see what it tells us. Remember that each row number represents a current state. So the car is in the state number one, it is stationary. The probability of it staying stationary is 0.25, moving forward : 0.40, moving backward : 0.15, turning left or right : 0.10. There are zeros in the second and third rows because we assumed that the car cannot turn while moving. Now that we fully understand what a State Transition Matrix is let’s move on to a Markov Process.
Simply stated, a Markov Process is a sequence of random states with the Markov Property. Let’s see how we could visualize concrete example of a Markov Process.
The graph above simply visualizes state transition matrix for some finite set of states. We start with a desire to read a book about Reinforcement Learning at the “Read a book” state. After we are done reading a book there is 0.4 probability of transitioning to work on a project using knowledge from the book ( “Do a project” state). And 0.6 probability of getting bored and deciding to quit (“Get Bored” state). When we are in “Do a project” state we might decide to publish a paper on our amazing breakthrough with a probability of 0.4. Or decided to be the best at the latest and most popular multiplayer FPS game. If we decide to publish a paper there is 0.8 probability of getting a raise because the company we work for gets super famous because of our paper. When we get a raise there is nothing more to do than just get bored. And finally if we decide to play video games 8 hours a day for a few years. We forget a lot so we might go back to “Reading a book” with probability of 0.1 or “Get bored” with probability of 0.9. Rectangular box, “Get Bored” state, represents a terminal state; when the process stops. Just take a moment and stare at the graph. Later we will add few things to it, to make it actually usable for Reinforcement Learning.
Now that we have our Markov Process set up, let us draw a few Sample Episodes or just samples from it. A sample, for now, is just a random sequence of states that ends with a terminal state that uses dynamics set up by state transition matrix . My guess is that while staring at our Markov Process graph above you kind of saw what sequence of states we could sample from it. For instance:
- “Read a book” -> “Get Bored”
- “Read a book” -> “Do a project” -> “Publish a paper” -> “Beat video Game” -> “Get Bored”
- “Read a book” -> “Do a project” -> “Publish a paper” -> “Beat video Game” -> “Read a book” -> “Do a project” -> “Beat video Game” -> “Get Bored”
- and so on, you get the idea.
So far we’ve seen Markov Process without any rewards for transitioning from one state to another. Let’s see how we could incorporate rewards into what we have seen so far. Meet Markov Reward Process.
Markov Reward Process
From previous definition we see that there is a reward function added that is defined as Expected value of a random variable (weird looking Capital E) reward R at time t+1 if we are to transition from state t to some other state. Take a look at this for a quick refresher on random variables and expected values. Simply put a reward function tells us how much immediate reward we are going to get if we leave state s. Let’s add rewards to our Markov Process graph.
I hope you see where this is going. Since we have some made up numerical value for leaving some state in our environment we can calculate how much total reward we are going to get for some trajectory (another way of saying – sequence of states). We just need one more thing to make it even more interesting. Think about how would we value immediate reward more than the future ones, or vice versa. One way to do that is to use a discount coefficient gamma. Discounting rewards while summing to get a total rewards gives us yet another formal definition to process.
Important note: previous definition does not use expected values because we are evaluating sample episodes. If you are wondering why do we need to discount, think about what total reward would we get if we tried to sum up rewards for an infinite sequence. Total reward would also be equal to infinity, which isn’t so great since the whole goal of Reinforcement learning is to maximize total reward, not just set it to infinity. I hope that makes sense. Let’s think about what it would mean to use the edge values of gamma. If we set gamma to zero we would only “care” about the immediate reward, which would make this approach a short sighted greedy one. Just take what you can right now while we can. On the other hand setting gamma to one would mean that we are looking ahead as much as we can. Note that we can use gamma equals to one only if all trajectories terminate.
Well this is exiting; now we can say that being in one state is better than the other one. It gives the ability to evaluate our sample episodes and calculate how much total reward we are expected to get if we follow some trajectory. Let us ask this question: “What if we are not evaluating a sample episode, but actually trying to determine while we are drawing a sample what is the expected value of being in some state s?” The following definition tells us just that.
Let’s look at the concrete example using our previous Markov Reward Process graph.
Let’s calculate the total reward for the following trajectories with gamma 0.25:
1) “Read a book”->”Do a project”->”Publish a paprt”->”Beat video game”->”Get Bored”
G = -3 + (-2*1/4) + (-1*1/16) + (1*1/64) = -3.55.
2) “Read a book”->”Do a project”->”Get Bored”
G = -3 + (-2*1/4) = -3.5
I think you get the idea. As we draw samples from our Markov Reward Process and calculate returns for them we can start to calculate an expected value for states (State Value Function). In conclusion to this overly long post we will take a look at the fundamental equation of Reinforcement Learning.
Bellman Equation For MRP
Through some dark magic (law of total expectation). We can decompose value function into immediate reward plus value of the next state.
Let’s say we want to calculate the value function for the state “Publish a paper” and we already know the values (made up of course) of all possible successor states (“Get a raise” and “Beat video game”). So the reward for leaving the state “Publish a paper” is -1 + probability of transitioning to state “Get a raise” 0.8 * value of “Get a raise” 12 +
probability of transitioning to state “Beat a video game” 0.2 * value of “Beat a video game” 0.5 = 8.7
I suggest going through this post a few times. Also try to come up with your own simple Markov Reward Processes and do the math by hand. It will help you to retain what you just learned. Otherwise stay tuned for the next part, where we add actions to the mix and expand to Markov Decision Process.