0:02

Speaking of those policy-based algorithm we've just learned,

Â there's more than one way you can actually tune them to be more efficient or run

Â smoother by using the intuitive approach, introducing some heuristics.

Â You've probably learned about some of them already,

Â discover them again just to make sure that you've got them.

Â First, if your'e using the at actor critic method,

Â let's say adventure critic,

Â have to leverage the importance of two losses it has.

Â The first loss is the policy-based loss, the policy gradient,

Â the second one is that you have

Â to train your critic to minimize the temporal difference loss.

Â The idea here is that kind of more or less in the majority of cases,

Â you can assume the value-based loss,

Â the temporal difference loss to be less important.

Â This is because if your'e having a perfect critic,

Â but a terrible actor,

Â you are still having a critic which estimates how well does a random agent performs.

Â But if you have a good actor and some random critic,

Â you still have an algorithm which is at least as good as the reinforce.

Â The idea is that you can express this intuition by

Â reducing the comparative weight of the value-based loss.

Â You can just multiply it by some number less than one.

Â To another important part is that whenever you try

Â to apply policy-based methods in practice,

Â you might end up with a situation whereby some particular query can be a policy.

Â If say, the gradient of explosion if you are using neural networks,

Â we end up with algorithm that completely

Â abandons one action in at least a subset of situations.

Â This is basically a vicious circle because in this case,

Â you'll probably have your algorithm only train on

Â the actions it has just produced because most them are on-policy here,

Â and this case, you won't be able to learn to dig this action ever again.

Â So, you if you have abandoned an action,

Â you're no longer receiving samples consisting of this action in some particular state.

Â You are no longer able to kind of

Â forgive the notion that this action might be optimal sometimes.

Â Of course, if you're dead sure that this action is useless, it's okay to drop it,

Â but in other cases, you have to,

Â in the future algorithm that it should not completely give up on actions.

Â As we had already done in the cross entropy method section the very first week,

Â there is a way to do so with neural networks by introducing

Â a loss that kind of regularizes the policy.

Â This case, you can use for example, the negative entropy.

Â What you want to do is, you want to encourage your agent

Â to increase the entropiness policy here with of course,

Â some very small coefficient.

Â And if you remember entropy works,

Â this is basically resulting your agent

Â preferring to not give a probability of zero to anything.

Â Of course, this requires you to change to another parameter,

Â but as long as is safe to assume that if you have

Â a sufficiently small but non-zero coefficient between multiplied by the entropy,

Â you'll probably have your agent kind of forget

Â this malicious policy of not taking

Â an action after at least some large fixed amount of iterations.

Â This is the weak guarantee,

Â but you're probably not going to get anything better with approximate methods.

Â Another thing, as we have already discussed in the [inaudible] section,

Â you can take advantage of the fact that in the modern world,

Â almost anything including a smartphone probably has more than one CPU core in it.

Â The idea here is that if you have parallel sessions,

Â you can parallelorize the sampling procedure.

Â You can basically train your algorithm on

Â sessions that are obtained by relying on such environments and such a parallel course.

Â Or you can go even further by training in parallel and averaging core basically,

Â synchronizing weights as it was done in the A3C.

Â Finally, just a very tiny, teeny,

Â technical query concerning neural networks only,

Â or well, neural network correlated the most.

Â Using policy gradient, you probably required to construct a formula which uses

Â the logarithm of the probability of taking

Â an action eigen state S multiplied by your advantage,

Â or a word depending on what logarithm use,

Â and in deep learning framework,

Â you probably have to do so a bit more carefully than your otherwise can.

Â Especially here is that, if you seem to take the probability here,

Â and then take the logarithm of this probability here,

Â other frameworks can get this probability in a very inefficient way.

Â Basically, if you use for such a deep precision,

Â you may end up with a probability which rounds up to almost zero,

Â and the logarithm of almost zero is almost negative infinity.

Â You can mitigate this with the logsoftmax formula.

Â The need here is that you, if you explicitly write down

Â the formula of the logarithm of your softmax non linearity,

Â you will end up with a formula which is much simpler

Â than if you just take a multiple other derivatives.

Â Now, this is what you are going to do in the practice session,

Â so don't worry if you have not addressed this concept entirely from the first attempt.

Â