0:02

Now, if you tackle the problem of exploration versus

Â expectation from the perspective of Bayesian methods,

Â you probably will find yourself doing the exact same thing you've been taught

Â at the respective course for [inaudible].

Â Mainly you would start from some prior distribution.

Â In this case, distribution on the rewards in

Â bounded setting or the Q function in multi-step and the P setting.

Â Then you'll learn a posterior distribution of the same variable after observing the data.

Â Namely, the actions you've taken

Â in states and the rewards you've obtained for those actions.

Â After you've learned this distribution,

Â you can then simply take the percentile of this distribution,

Â say 95th or 99th,

Â and sort actions by this percentile to pick whichever is most interesting to you.

Â The rest comes naturally after you've done the update properly.

Â The only blank spot here is,

Â how do we actually learn those distributions?

Â How do we learn this P of Q given data?

Â And there are two main families of methods that can do that.

Â One approach here, is the so-called parametric Bayesian inference.

Â This basically means that you have to select some kind of

Â known probability distribution family to use as the P of Q given data.

Â One popular, if slightly two popular choices for this role,

Â is the normal distribution family,

Â which is franchised by,

Â in this case, Mu and Sigma squared.

Â First being the expected value and the other being the variance.

Â Once you've learned the Mu and Sigma squared for the P(Q) with Q given data for

Â each possible action in each possible state or

Â the P of Q given data for each action bounded setting,

Â you can then simply plug them into an easily global formula for the percentile,

Â which is the net pure in reading section,

Â to actually get the percentile then execute on your UCB policy.

Â Another way you could do the known UCB exploration is to get

Â those Mu and Sigmas and sample from them using normal distribution sample,

Â a standard function most math packages to

Â get samples and perform the forms and sampling strategy.

Â This approach is quite powerful but has its limitations.

Â The main limitation of Parametric Bayesian Inference is that,

Â it relies wholly on your choice of the distribution.

Â For example, if you choose the normal distribution,

Â it means that your distribution is, by definition, unimodels.

Â So there's one peak and the further you go from this peak,

Â the lower the probability is.

Â And the other problem is that,

Â the probability decays exponentially,

Â the exponent of the square lower from the distance.

Â And this might be useful in some cases but in other cases,

Â it might be detrimental.

Â It might even cause you to overestimate

Â some actions like it did before or underestimate and never learn the optimal policy.

Â But if I was to try more complicated family,

Â like a mixture of normal distributions or well,

Â anything else that suits your particular problem maybe a Gamma distribution,

Â maybe a Landau distribution,

Â or any other well, less popular choices.

Â But this is still limited because if you don't

Â have some expert knowledge of how they would look like,

Â then you have to simply,

Â well, leave what's currently there.

Â Now, alternative approach, which is slightly more

Â advanced and a whole lot more complicated,

Â is non-parametric Bayesian Inference.

Â In this case, we try to explain it on example of Bayesian neural networks.

Â Again, this is a rather complicated thing to explain in the scope of our course.

Â But if I had to squeeze it into a few sentences,

Â Bayesian neural networks would be a mechanism by which you learn a natural model that

Â predicts not just one value at its output, but a distribution.

Â It does so by having not just one weight at each connection,

Â but distribution and it's weight.

Â Now, think about it,

Â the regular neural networks have the neurons which are connected with those weights,

Â and each weight is just basically

Â a number anywhere between minus infinity to plus infinity.

Â And a Bayesian neural network,

Â and if expanded by saying that each neuron is not just one number but some distributions,

Â say a normal distribution,

Â and to infer from this network,

Â you have to sample those numbers from the distribution.

Â You have to sample weights each time and, of course,

Â each time you will have slightly different well,

Â probabilities of Q at the end.

Â Now, this approach is slightly more powerful than

Â the first one because it doesn't just take itself to

Â a particular formula for normal distribution.

Â Namely, it doesn't have to be unimodal,

Â or exponentially decaying as normal distribution,

Â but it can if so are the demands of this particular task.

Â Or condone some kind of trimodal,

Â or bimodal thing whether flat plate or

Â at the middle which might make sense in some cases.

Â Again, this is much more powerful but it comes at a price,

Â and in this case the prices that you don't get

Â the entire probability based from just one round of your method.

Â Namely if you have a Bayesian neural network,

Â then if you sample all those weights given,

Â so if you feed an input to this sample set of weights,

Â you only going to get one sample,

Â and if you sample it again,you'll get another outcome.

Â So to get the whole picture,

Â you have to repeat the sampling procedure many times

Â until you get enough data to [inaudible] the percentiles.

Â Of course, if you sample say a 1,000 times,

Â you can more or less reliably find the 90th percentile by just taking

Â sample number 900 ranked from the lowest to the highest one,

Â and this is how you can compute empirical percentiles in basic statistics.

Â And of course, chaining those networks is

Â another challenge in itself which takes a lot of well,

Â another set of prior distributions there,

Â only with those parameters and [inaudible] tricks

Â and all these stuff you've been learning throughout the Bayesian networks course.

Â So we are not including them into the compulsory part of our course.

Â So you can just read about them in more detail on the bonus section.

Â Okay so, the first approach is a rather simple,

Â the second approach is slightly less simple,

Â but potentially more powerful and has of course

Â other approaches which are also

Â non-parametric but doesn't have Bayesian neural networks in them.

Â So if you chose the Bayesian UCB over the frequency one,

Â the one based on inequality.

Â It means you've just traded

Â a few more added benefits for

Â a few more drawbacks that usually goes with any applied science.

Â On the plus side of the Bayesian methods,

Â get that they no longer depend on the inequality as the [inaudible] ones do.

Â This means that you no longer run at risk of having to

Â explore actions past the point that it become irrelevant,

Â just because you're inequality is slightly more optimistic than it should have been.

Â Instead you have a particular probability distribution with

Â an exact percentile that can measure that which makes

Â sense given your exact choice of prior.

Â And here we kind of stumble in the first major drawback,

Â the Bayesian Methods can be made or broken with a choice of prior.

Â And if you choose prior to,

Â well loosely it means that you're going to over explore,

Â and it means that you are not certain of anything,

Â and means that you might have to spend more time exploring at the very beginning.

Â Alternatively, if you choose a more, well,

Â steep prior the one that looks closer to a data function,

Â it means that your inference is going to be a bit more green than she'd have been.

Â And in the worst case it might not even learn the optimal policy.

Â Another benefit is that,

Â if you pick a more complicated probability distribution family,

Â say the one from Bayesian neural network,

Â or even a mixture of normal distribution,

Â you can engineer a particular formulation of

Â the distribution of this well [inaudible] previously of this,

Â the probability of a reward given a particular action,

Â and you can tailor this for

Â your particular problem given all the expert knowledge you have about it.

Â This is slightly complicated but you've probably been taught

Â a few tricks and how to do this in a setting of applied Bayesian methods.

Â Again previously not specialization.

Â So here it goes.

Â You have a more accurate estimate which

Â wholly depends on the prior and can turn for a task.

Â This is how Bayesian methods compared with the frequency ones.

Â So again, here's some motivational image that shows that

Â distributions should not be unimodal and so on and so forth.

Â This was learned with Bayesian neural networks.

Â So if you compare all the methods we just learned,

Â the thumbs of sampling, the UCB,

Â and the Bayesian UCB on a problem of information retrieval,

Â the one where you are band it tries to select lieges for a user to satisfy his query,

Â and you'll see that they compare one way or another.

Â This is they regret [inaudible] over time.

Â And you can see that basically, depending

Â on a choice of prior and depending on a particular inequality,

Â those methods can have different performance.

Â Of course that doesn't mean that you have

Â to just pick the one which loops the highest here,

Â but it means instead that

Â the performance methods depends on the particular problem and you better try all of them,

Â maybe have some library that gets all that implemented before picking a particular one.

Â Of course, if you have more domain knowledge than Bayesian method it's going to be

Â slightly more efficient as a rule of thumb than the frequent ones,

Â or if you have almost no domain knowledge,

Â and you're just about

Â to figure out priors from top of your hats then that means that you're

Â likely going to shoot yourself in the foot

Â by under exploring or over exploring with a wrong prior.

Â This concludes a section of our course where we

Â learn about the advanced exploration methods.

Â And by now we have learned a few,

Â or quite a lot actually of the details of how

Â the exploration methods are defined and how you can compare them against one another.

Â The first known thing with the notion of regret,

Â or how much rewards or money would your agent lose if

Â it's started exporting from random compared to

Â the agent that had the optimal policy from the get go.

Â And this is basically a universal measure of how

Â you can compare one choice of exploration versus the other.

Â We learned about the future regret,

Â which is regret that is measured from say now till 10 days into

Â the future and might make sense in some finite time horizon processes,

Â and the notion of infinite regret,

Â which basically sums the difference between your agent and the optional agent from

Â now until the moments when it reaches the optimal policy if at all.

Â And the infinite regret can grow algorithmically or it can grow

Â linearly or it can grow otherwise depending on the particular exploration policy.

Â We've also measured a few of extended

Â exploration strategies like Epsilon Greek exploration, [inaudible] dis-regret.

Â We figured out that unless you pick their parameters in a very particular way,

Â you can end up with something that never converges an optimal policy,

Â and has a linearly growing regret,

Â which is terrible, the worst thing you can get actually.

Â Now the other kind of

Â new batch of exploration methods we've learned are the ones bassed on uncertainty.

Â Those that are represented with two kinds of UCB.

Â The frequent is one based on [inaudible] equation

Â of inequality and the Bayesian one we just covered.

Â As it was the Thompson sampling which doesn't make

Â the percentile because they're just samples from the distributions.

Â Now those exploration methods only allow you to improve and

Â trade on your agents exploration in your contextual [inaudible] setting,

Â like a line Ads or recommendation systems,

Â or in information retrieval.

Â But we're not going to get it a little further, were going to extend

Â those exploration methods to a model-based multistep setting.

Â In other words, we're going to find out how we can employ

Â all those UCB like exploration methods based on uncertainty,

Â in order to improve how you plan in an environment where you

Â know the probability of next stage given state an action.

Â This is going to come right now at the next section by [inaudible] which I recommend you

Â to switch right now. Farewell.

Â