0:00

[MUSIC]

Â In this video,

Â we're going to cover sampling from one-dimensional distributions.

Â And let's start with the simplest case, a one-dimensional discrete distribution.

Â So say we have a random variable like this,

Â which takes the value of a1 with probability 0.6,

Â value a2 with probability 0.1, and the value a3 with probability 0.3.

Â How can we sample from it?

Â Well, we have to start from something.

Â It's hard to generate randomness from nothing on a deterministic computer.

Â So we're going to use this kind of function which can generate your random

Â number on the interval from zero to one.

Â And this function is included in almost any possible programming language's

Â library.

Â So it's usually there, and

Â we can use it to base our further sampling techniques on it.

Â But anyway, we have this function.

Â How can we generate a sample from this discrete distribution?

Â Well, let's start with a interval and let's decompose it into three chunks,

Â where the first chunk has length 0.6,

Â the second chunk has length 0.1, and the third chunk has length 0.3.

Â And let's say that we're going to generate a random number in this interval.

Â And let's say that if it appears in the first chunk, in the right one,

Â then we will say that we have just generated the value of a1.

Â So let's generate a random number from our function which just [INAUDIBLE]

Â point on this interval.

Â And let's say that we were to be somewhere around 0.9 for example.

Â It means we have just generated the value of a3.

Â So, why does it work?

Â Well, because, obviously, if you throw a random point on an interval,

Â then the probability that it's appeared in some chunk is

Â proportional to the length of the chunk.

Â And the length of the first chunk is 0.6,

Â which is exactly the probability we wanted to, a1 to be generated, and etc.

Â So this indeed is the correct procedure for generating random numbers

Â from discrete distributions with finite number of values.

Â And the bottom line here is that it's really easy to generate samples

Â from discrete distributions.

Â But note that here we had just three values.

Â If we for example had 1 million values, or 1 billion values, it would be much harder.

Â Because we would have to spend amount of time proportional to

Â 1 billion to separate this interval into 1 billion chunks of

Â appropriate sizes and then to find in which particular interval,

Â in which particular chunk the random value appeared.

Â But if you have finite number of discrete events and this number is not that large,

Â it's like 1,000, then you can really efficiently and

Â easily generate samples from this distribution.

Â Okay, let's move to continuous distributions.

Â So let's say for beginning that we want to

Â sample a point from a Gaussian, like this.

Â How can we do it?

Â Well, one way is to use the central limit theorem.

Â That says that if we have a few independent random values, and

Â if we sum them up together, we'll get something like Gaussian.

Â So if we just throw, say, 12 random numbers from the uniform

Â distribution on an interval from 0 to 1 and then add them all together,

Â we'll get something with expected value being 6.

Â Because expected value of each x i is 0.5, it's the center of the interval.

Â And the sum of 12 x i's will be will be from 0 to 12.

Â And the center of this interval is 6.

Â So I have to subtract 6 to make the expected value of this

Â distribution to be 0.

Â And in this way, our distribution will be actually not from minus infinity to

Â plus infinity like in the Gaussian case, but rather from minus 6 to 6.

Â So it's indeed an approximation, it can't generate you anything like

Â an actual Gaussian, but it's a pretty close one.

Â So although it cannot ever generate for example, a number of 100,

Â but the Gaussian itself is also very, very improbable to generate you 100.

Â So we can be okay with losing these highly improbable parts of your sample space.

Â And you may ask, why 12 here?

Â Well, just to make the variance of the resulting

Â random variable z to be equal to 1 as of standard Gaussian.

Â If you use for example not 12 but, I don't know, 100 values,

Â you would have to divide this thing by something to make its variance to be 1.

Â Or if you don't want to use these kind of loose approximations,

Â you can ask your favorite programming language for a function which

Â just generates you numbers from Gaussian, and it probably has it.

Â So it's a pretty standard thing to have in a programming language

Â to generate numbers from a standard Gaussian.

Â And this is kind of more efficient and

Â more accurate than just generating it from central limit theorem.

Â So, okay, Gaussians are easy.

Â But what if we have a more complicated distribution like this and

Â we don't know how to sample from it?

Â Again, as we discussed in the previous video, it can happen if for example,

Â this is a posterior distribution on some parameters or some latent variables, and

Â we may even know it up to normalization constant, not exactly.

Â So what do we do in this situation?

Â 6:14

Well, we know how to sample from Gaussians.

Â So let's try to use it.

Â Let's upper bound our distribution with some Gaussian times

Â some constant, for example 2 in this example.

Â Because it's impossible to upper bound one distribution with another

Â one without a constant, without multiplying it with something.

Â Okay, so we have a distribution P of x, the blue curve,

Â which we want to sample from.

Â And we have the orange curve, which is 2 times some Gaussian,

Â which we can sample from.

Â So how can we generate samples from the blue curve?

Â Well, let's start with generating a sample from the orange curve, which we can do.

Â Now we, for example, with high probability, can get a number around 0.

Â Because it's around 0, it's somewhere where the orange density is high.

Â So there will be quite a few points there.

Â But the blue density is not high there, it's kind of a local minimum,

Â so we have to somehow correct for this.

Â We have to, for example, throw away some of the points which were generated in this

Â region where the actual blue probability is low.

Â So let's do just that.

Â Let's reject some of the points generated from the orange curve.

Â And what is the rule for ejecting the point?

Â Well, first of all, it should be probabilistic.

Â Because you can't just reject all points in some region.

Â Otherwise, the resulting distribution of your

Â accepted points will have 0 probability in this region.

Â So we have to reject some of the points in some kind of probabilistic manner.

Â And the probability of each one to reject the particular point, so

Â point x tilde here should be proportional to the height of the blue curve and

Â with respect to the height of the orange curve.

Â So if in the current point the two curves coincide,

Â they have the same values, you don't want to reject anything.

Â You are happy to accept the point which was just generated.

Â If for example, at some region, the orange curve is high, but

Â the blue curve is 0, then you don't want to accept anything.

Â You want to reject all the points in these regions.

Â If it's somewhere in between, if for example, like here we have around 0.7,

Â proportion of the blue curve to the orange one around 0.7,

Â then we'll expect to keep 0.7, so

Â 70% points in this region from the ones we were sampling.

Â So we can say that we have generated x tilde, which is from the orange curve.

Â Then we generate its y coordinate,

Â which is from 0 to the orange curve at the current point.

Â And then we'll reject this point if it appeared

Â in the red region so above the blue curve.

Â And we'll accept this point if it appeared below the blue curve.

Â 9:33

You may ask, why does it work?

Â Well, if you look at the points that were generated by this procedure, before

Â rejecting anything, they will be uniformly distributed under the orange curve.

Â When we apply our rejection procedure,

Â we throw away the points that appeared

Â above the blue curve and below the orange one.

Â And what is left is the points that are uniformly distributed under the blue

Â curve.

Â And it's kind of natural that these points,

Â their x coordinates are distributed as the blue curve suggests.

Â So that's great, it's a procedure to generate samples from any

Â distribution you can upper bound with something you can sample from.

Â But how efficient is it?

Â So if you want to generate one point, how many points you will have to

Â generate from the orange curve before you accept one of them?

Â Well, the answer to this question is just 1 divided by the constant.

Â So one-half in our example, n1 divided by m in the general case.

Â And the reason for that is just because it's the ratio between

Â the area under the blue curve, which is 1 because it's a probability distribution.

Â And the area under the orange curve,

Â which is m because it's m times a probability distribution.

Â So if we throw, informally, points in some area and

Â then reject some of them, then the probability of accepting

Â the point is proportional to the area where we accept them.

Â And note that here we can even use this method if we know our distribution after

Â normalization constant as it usually happens in probabilistic modeling.

Â So in this case, if you can still upper bound your unnormalized distribution,

Â you can say that this constant m, this tilde m,

Â just includes the normalization constant somehow.

Â So you don't know the normalization, but

Â you can upper bound the distribution with some constant and this will still work.

Â So, as a summary, this method is called rejection sampling.

Â And it's kind of universally applicable, so it's really often that you can

Â upper bound your distribution and generate samples from it like this.

Â But the downside is that if q and p are too different from each other,

Â then you have to choose large constant m and

Â then you will have to reject most of the points.

Â And so if, for example, m is 100, then to generate just one point from your actual

Â distribution, the blue curve, you'll have to generate 100 points and

Â then on average just one of them will be accepted.

Â And another problem with this approach is that if you

Â apply to multidimensional case, it will still work,

Â but the constant m will probably be very, very large.

Â So it's kind of exponential in the number of dimensions.

Â And it's really improbable that you will be able to upper-bound your complicated

Â distribution in ten-dimensional space with a constant smaller than, for

Â example, 1 million.

Â And having a constant 1 million means that to generate just one point,

Â you have to wait a million computational cycles, which is a lot.

Â So this is a method for one dimension or maybe a few dimensions,

Â but not like thousand-dimensional spaces.

Â [MUSIC]

Â