If you’ve been following along with the series, you might start to wonder “What do we do if we want to solve Markov Decision Process (MDP) but don’t know how environment operates?” In other word, we don’t have a model of our environment, but our agent still wants to predict the best way to act. And if you think about it, the most interesting problems, that we might want to solve, most likely, don’t provide the luxury of knowing exactly how they work. Hence, let’s see what one of the Model-Free Prediction approaches – Monte-Carlo Learning, can do for us in this situation.
Monte-Carlo method, later we’ll see why, is not the most efficient method, but is very effective and despite of being nearly a century old is still widely used in practice. The basic idea is fairly simple – learn directly from full/complete episodes of agent’s experience of interacting with an environment.
Learn directly from full/complete episodes of agent’s experience of interacting with an environment.
It does not require any information about MDP state transitions or rewards. It simply estimates state value by averaging over all samples of full/complete episodes of exploration. Just to give a concrete example. Let’s say our agent starts at state A and at some point reachs a terminal state T. And we let this agent to run for two episodes. After the first episode agent gets total reward of 2, and 4 after the completion of the second episode. So we can already estimate a value of a starting state A as (2 + 4)/2=3. The more complete episodes our agent runs through, the more accurate estimate for a state values we get. That’s the whole idea behind Monte-Carlo simulation. But don’t be fooled by the simplicity, there are a lot of formal mathematical proofs behind it, and understanding this approach is very important going forward. And, Yes, you are correct. The episodes must terminate, otherwise Monte-Carlo method will never stop.
Estimate state value by averaging over all samples of complete episodes of exploration.
The goal at this post is to figure out how to use Monte-Carlo method to evaluate a given policy. Let’s formalize everything that we talked about previously. One experience episode (sample) is a sequence of state-action-reward tuples under a given policy:
As our agent explores an environment we are looking at each state that our agent is in and estimating a return from this state/time on wards. The return is the total discounted reward. If you need a refresher on what a return is read this. So we collect rewards for all states that we are in. Recall from this post that the value function is the expected return. Well, in Monte-Carlo method empirical mean return is used. Just plain old mean value.
Monte-Carlo method uses empirical mean return instead of expected return as a value for a state.
One little caveat is that we can’t really set each state to be a starting state and have it run multiple times. That would be extremely inefficient. There are better approaches: First-Visit and Every-Visit Monte-Carlo Policy Evaluation Methods. Let’s go over First-Visit method and hopefully write some code.
First-Visit Monte-Carlo Policy Evaluation Methods.
The First-Visit and the Every-Visit differ only if you have a policy that introduces a cycle in your agent’s trajectory. Otherwise there is no difference. Let’s write up some pseudo code and get to coding:
As usual if you just want to see the code -> here it is. All ready to go. As usual, I encourage you to write it up on your own.
TLDR: As usual, here is the code.