Part 5.4 Model-Free Prediction: Temporal-Difference Learning, section 2 TD(λ).

In section 1 of TD learning we have seen a new class of algorithms that can learn online after every step. In other words TD can learn before and without the final outcome using Bootstrapping - the idea of updating a guess towards a guess. In particular, we've looked at the TD(0), an algorithms that… Continue reading Part 5.4 Model-Free Prediction: Temporal-Difference Learning, section 2 TD(λ).

Part 5.3 Model-Free Prediction: Temporal-Difference Learning, section 1 TD(0)

Our first algorithms of a totally different class. Temporal-Difference (TD), just like Monte-Carlo method, learns directly from an experience of interacting with an environment. TD is model-free, does not require any knowledge of MDP transactions or rewards. No need to worry about how different things affect our state values. There is, also no more need… Continue reading Part 5.3 Model-Free Prediction: Temporal-Difference Learning, section 1 TD(0)

Part 5.2 Model-Free Prediction: fundamentals of online algorithms.

Ever wondered how online algorithms came to be the thing? There is a clear path from Monte-Carlo method to the non-stationary online algorithms that rock our world nowadays. It seems trivial, but does help with understanding how different pieces fit together.

Part 5.1. Model-Free prediction: Monte-Carlo method.

If you've been following along with the series, you might start to wonder "What do we do if we want to solve Markov Decision Process (MDP) but don't know how environment operates?" In other word, we don't have a model of our environment, but our agent still wants to predict the best way to act.… Continue reading Part 5.1. Model-Free prediction: Monte-Carlo method.

RL Part 4.1 Dynamic Programming. Iterative Policy Evaluation.

So far in the series we've got an intuitive idea about what RL is, we described the system using Markov Reward Process and Markov Decision Process. We know what the policy is, what the optimal state and action value functions are. We've seen Bellman Optimality Equation that helped as to define the optimal action value… Continue reading RL Part 4.1 Dynamic Programming. Iterative Policy Evaluation.

RL part 3. Markov Decision Process, policy, Bellman Optimality Equation.

Recall that in part 2 we introduced a notion of a Markov Reward Process which is really a building block since our agent was not able to take actions. It was simply transitioning from one state to another along with our environment. That's not really helpful since we want our agent to not only take… Continue reading RL part 3. Markov Decision Process, policy, Bellman Optimality Equation.

RL. part 2. Markov Reward Process.

In Part 1 we found out what is Reinforcement Learning and basic aspects of it. Probably the most important among them is the notion of an environment. Environment is the part of RL system that our RL agent interacts with. An agent makes an action, an environment reacts and an agent observes a feedback from… Continue reading RL. part 2. Markov Reward Process.

RL. part 0.Absolute minimum amount of math, necessary for studying Reinforcement Learning.

     Math is an absolute must have for anyone trying to learn Reinforcement Learning techniques. Writing any king of RL program requires precise understanding of the algorithms and underlying math. It will make your life easier, otherwise things will not work and the agent will not learn the way you expect it to and… Continue reading RL. part 0.Absolute minimum amount of math, necessary for studying Reinforcement Learning.

RL. part 1. What is Reinforcement Learning? Intuition.

    The idea behind Reinforcement Learning (RL later) is fairly simple and intuitive. Let's learn by interacting with what we are trying to master. One analogy that in my opinion explains the term in a good way is any kind of a puzzle (box puzzle in particular) that does not let you see its… Continue reading RL. part 1. What is Reinforcement Learning? Intuition.