0:02

So now let's take a little bit more formal look on this decision process, shall we?

Â It is a model widely used in the reinforcement learning,

Â it's called The Markov Decision Process.

Â It follows the same structure as we had in the previous chapter with

Â the kind of intuitive definition of decision process,

Â but it's slightly more restricted and has math all around it.

Â So, again we have an agent and environment,

Â and this time environment has a state denoted by S here,

Â and the state is what agent can observe from the environment.

Â So, the agent will be able to pick an action and send it back into the environment.

Â The action here is noted by A.

Â Those capital S and capital A are just sets of

Â all possible states and all possible actions because mathematicians love sets.

Â Now, this third arrow,

Â the vertical one, is how we formalized the feedback.

Â There is some kind of reward denoted by R. Again,

Â this is just the real number,

Â and the larger the reward gets,

Â the more agent should be proud of himself

Â and the more you want to reinforce his behavior.

Â Now this process was called Markov Decision Process for a reason.

Â There's a thing called Markov assumption,

Â which holds about such process.

Â Intuitive, it means that the state, the S,

Â is a thing sufficient to define

Â the environment state and there is nothing else affecting how environment behaves.

Â In terms of math, it means that whenever you want to predict

Â the probability of next state and the rewards your agent is going to get for his action,

Â you only need the current state of environment and agents action to do so,

Â and no other input will be helpful.

Â You probably noticed that from here on

Â the things get slightly unrealistic with this Markov assumption.

Â This actually means that if you want

Â to show your users absolute banners, what you need is,

Â you need a state that encompasses everything about the users that defines how he

Â behaves and this may or may not

Â include the quantum states of all the particles in his brain.

Â So, this is of course, impossible,

Â but just don't get too focused on it.

Â This is just a mathematical model and in practice you can,

Â of course, simplify it a little bit,

Â because models don't have to be accurate.

Â In fact, they're never accurate, they're just sometimes useful.

Â In this case, you can suffice with some kind of higher level features that you use

Â for your decision making process

Â and just pretend that everything else is around the noise,

Â which is what mathematicians usually do.

Â Now, as usual, we want to optimize our reward to our feedback,

Â but the difference here is that unlike our intuitive definition,

Â this time your environment can give you intermediate rewards after every time step.

Â Think of it this way,

Â you have your robots and you want your little robot to walk forward.

Â You can of course simply give him one reward per the entire session.

Â Whenever he falls, just measure how long was it

Â able to walk before it fell and reward him for this value.

Â But intuitively, you can try to give him some small pools of feedback whenever

Â he moves himself forward slightly over duration of one turn.

Â Now, for the purpose of the simple algorithm we are going to see now,

Â this is no different from what we had before,

Â because we want to optimize,

Â not just the individually rewards,

Â we want to optimize the sum of rewards per session.

Â So, we don't want to go as fast as possible right now.

Â Right now, we want to go as fast as possible over the duration of the entire episode.

Â This is also quite useful when you, for example,

Â train your agent to win a board game. Because in chess,

Â you can try to optimize the immediate reward.

Â You can try to say,

Â it has many pons as you can,

Â but this might result in you losing the game quickly,

Â because for the immediate reward has not always the best move.

Â In fact, it's often the worst move you can take.

Â Now, what you want to do with this process,

Â is you want to define an agent,

Â whether trained an agent, so that he thinks

Â actions in a way that gets highest ever reward.

Â This is from [inaudible] Basically,

Â you can think of policy for now as attributed to

Â distribution that takes a state and assigns probabilities to all the possible actions.

Â Now, in this case,

Â you can just use whatever Machine Learning model or table to build the distribution.

Â This is so far outside our scope,

Â but we'll get into the implementation details later this week.

Â Again, we have a policy and want to optimize the reward expected for the policy.

Â If you break down all the maths explicitly,

Â then you'll get the following weird formula,

Â which basically says that you have to,

Â well, sample the first state,

Â then take the first action based on this first state in your agent's policy,

Â then observed the second state and get your reward.

Â Then, take the second action,

Â third state, third action,

Â fourth state and so on,

Â until you reach the end of the episode,

Â then just add up all the rewards.

Â In my humble opinion, this formula below looks slightly

Â more uglier than this informal definition on the top.

Â So, it's only important that you grasp the concept.

Â You don't have to memorize this, of course.

Â