0:00

[MUSIC]

Â There are, as it turns out, multiple ways to do so.

Â The first one is specific to this UCB.

Â First, in general, you don't need to get the whole distribution.

Â You only need to know it's percentile.

Â And you might have to compromise a little bit in the way that you don't

Â know the exact value, but you only know it for some inequality.

Â Turns out, in statistics, there is more than one way you can compute that.

Â There are inequalities like the counting inequality,

Â also known as Chernoff inequality.

Â Or or there's going to be like five more of them.

Â That all compute something like the previnsi of some particular value being

Â more than some particular threshold.

Â The case that this particular rate regardless of the distribution.

Â Or what as in the housing case regardless of a distribution shape

Â as long as the possible value of this action q value is between 0 and 1.

Â In this case the formula on the slide is a simplified version of the Hoeffding

Â inequality here.

Â You can easily find the full version if you just Wikipedia it.

Â But the essence here is that the probability of your action value, q-value,

Â being larger than the average q-value you've measured so far by more than t.

Â Which is a value of say plus 5.

Â Would decay as exponent to the power of -2 times number of samples,

Â number of times you've measured this q-value, times t squared.

Â So if your t is equal to 5, and say your q-value,

Â your average q-value is 10, and t is equal to 5.

Â The [INAUDIBLE] of your q-value being larger than 15 is

Â exponent of -2 times n times 25.

Â And n might be some 1,000.

Â Which is kind of really, really small number but they are non 01 nevertheless.

Â What you want to do with this inequality is you want to solve it not for

Â the probability, but for t, for this offset here.

Â You want to find such an offset, such a difference between expected value and

Â your action value, q-value.

Â Such that the probability of being above this offset is greater,

Â okay lesser or equal than 5%.

Â Or another chance probability of being within this offset, so less or equal,

Â is at least 95%.

Â This would be some estimate of the percentile wanted,

Â all by it is an inequality so it might be a less efficient one.

Â If you solved this thing for t, plugging in some probablilty that you want,

Â say you want to be 99% sure that you are within this range.

Â Then your solution is going to look something like this.

Â Of course there can be a coefficient here between the square root.

Â Which depends on the probability, the certainty you want to get.

Â And if you want to be 90% it would be less than 1 here, and

Â if you want to be 9.99999% sure it will be, say, maybe 10 times, 100 times more.

Â But the general idea looks like this.

Â You take the amount of times you've picked this action in a state.

Â And you take the amount of times you've ever been to the state and

Â picked any action, regardless of which one.

Â And then you plug them in this square root thing here.

Â So you take the logarithm over time, over a number of times,

Â you've been to the state.

Â Divided by times you've picked this action in the state,

Â multiplied by two square root, you know all that stuff.

Â And this is how far you get from the expected value.

Â So the final and if priority of action is a sum of this exponent value and

Â the addition from the UCB, from the upper confidence bound.

Â 3:31

Now, let's see how this formula works.

Â Let's plug in some numbers.

Â For starters lets say that we have never picked an action.

Â There's new action that we have not yet picked.

Â And I want you to tell me how inched are we in this action.

Â What's the odds that we're going to pick it.

Â Yes, we're actually really interested.

Â And unless there is some other action we should also pick zero times.

Â The action which has not yet been picked will be the top priority.

Â Because if na is equal to 0 the number times it picked this action,

Â then the priority of this v-hat kind of gets set to infinity.

Â And it means that we are dead sure when we pick this action.

Â It might be optimal so we might just well try it now.

Â 4:48

Well yes, it turns out the v-hat is going to be very close to the expectation,

Â and it's going to get even closer as the N and na increase towards the infinity.

Â The infinity it will get, exactly, it will convert exactly into the expected value.

Â Because algorithm of some value divided by this value itself, or

Â something proportional to it, it's going to convert to zero.

Â Okay, so this is how it's going to be explained.

Â And again, it makes some sense because after you've explored everything countless

Â times, it doesn't really matter,.

Â The exploration value doesn't really matter.

Â You know everything perfectly.

Â So you can just as well exploit it.

Â 5:26

This is the kind of approach to computing the upper bound.

Â This actually practically means that you have to store for each state and for

Â each action how many times have you picked this action.

Â And how many times have you been to this state.

Â And it's, well, just three times as many parameters if you are using 3 times

Â plus 2 times plus 1, many parameters compared to standard q learning.

Â And it's kind of bearable in a terrible case.

Â If you're not using a table, if you're using some kind of approximations.

Â You'll also have to approximate the densities of the state and

Â the action being picked in the state for some parametric functions.

Â But that's going to be covered in a reading section in much more detail.

Â 6:07

So this is of course one thing we've derived from [INAUDIBLE] or

Â turn off inequality, but you can find your own modified formula.

Â Which would work for

Â any other inequality of your preference, and it might be better.

Â [INAUDIBLE] hears that those are all inequalities,

Â so they are upper bounds of how we are interested.

Â Okay, sorry.

Â Lower bounds of how we're interested.

Â So if something says to you, if UCB-1 says to you that you should explore,

Â then it might be that you really should.

Â But it might be that just overestimate the value of this action.

Â Because It doesn't know anything about the distribution beforehand.

Â And again for this particular inequality we assume that it is over-

Â [MUSIC]

Â