0:00

[MUSIC]

Â In the previous video, we discussed that it

Â can be nice to build a Markov chain to sample from sum distribution.

Â But how to build such Markov chain?

Â A simplest method for that is called Gibbs sampling.

Â Involves as follows.

Â So say you have a distribution, a three-dimensional one,

Â which we know up to normalization cost.

Â So we can compute this probability or

Â density in the continuous case at any given point, but up to normalization.

Â So the Gibbs sampling works as follows.

Â First of all, we start with some initial point, for example, all 0s.

Â And it doesn't matter where we start.

Â Well, it does matter for some convergence properties, and

Â stuff like that, but it doesn't matter for the sake of correctness.

Â And then, as we're starting to build our next iteration point,

Â we're going to build this next iteration point one dimension at a time.

Â So the first dimension of the next point will be a sample from

Â conditional distribution, on x1, given the varies of x2 and

Â x3, which we had from our initialization.

Â And you may ask, how can we sample from this conditional distribution?

Â It's also a hard problem, right?

Â But note that the conditional distribution is proportional to the joint distribution.

Â So we know it, again, up to normalization constant.

Â And sampling from one-dimensional distributions,

Â it's usually much easier of it, than something from my multidimensional one.

Â So we assume that we can do it.

Â In some particular practical cases we can invent a really efficient scheme for

Â analytically sampling from these kind of one-dimensional distributions.

Â In others, we have to use general purpose schemes, like rejection sampling.

Â But anyway, one-dimensional sampling is easy.

Â So let's go to the next sub-step.

Â On the next sub-step,

Â we're going to generate the second dimension of our first duration point.

Â And it's going to be generated from the conditional distribution on x2,

Â given x1 being the dimension we have just generated,

Â and x3 being the initialization point dimension.

Â And finally, we will generate the third dimension in the same fashion.

Â So we'll generate it from the conditional distribution,

Â given the stuff we have just generated.

Â And in the general case, this was the iteration one.

Â And these three sub-steps constitutes into just one

Â kind of big step of our Markov chain.

Â So it's a way we transition from x0 to x1.

Â And if you write it down for

Â general iteration number k, it will look like this.

Â So the point from duration k + 1 is generated one-dimension at a time,

Â by using some conditional distributions, which are conditioned on the stuff

Â we have from the previous duration, and from the current duration.

Â And note that this schema's really not parallel.

Â So each step depends on the previous steps to be executed.

Â And this means that no matter how many computers we have,

Â if we have a 1 million-dimensional space,

Â we'll have to sample these dimensions one at a time, which is kind of a downside.

Â But we'll discuss later how can we avoid it.

Â Anyway, this procedure gives you a way to move from x0 to x1, and to x2, and etc.

Â And let's prove that this indeed defines a Markov chain,

Â which converged to the distribution p.

Â So if we sample like this long enough,

Â we'll get samples from the desired distribution.

Â And we are going to do it on the blackboard.

Â 4:12

So let's prove that the Gibbs sampling over the three sub-steps,

Â considered as one big step, indeed provides you a Markov chain

Â that converged to the desired distribution p.

Â So what we want to prove is that p of the new point,

Â x prime, y prime, and z prime, equals, so

Â we want to prove that it equals, to the one step for

Â the Gibbs sampling, starting from the distribution p,

Â and then doing one step of the sampling.

Â So here we'll have marginalizing out the previous step, x, y, and z.

Â And the transition probability, q, probability to go from x,

Â y, z, to x prime, y prime, z prime.

Â 5:17

Let's look at the right hand side of this expression.

Â So we have sum with respect to the previous

Â point of the transition probability.

Â So how probable is it, under the Gibbs sampling, to move from x,

Â y, z, to x prime, y prime, z prime?

Â Well, we have to first of all move from x to x prime,

Â so that will be least conditional

Â probability, y = y, and z = z.

Â Then we have to assume that we move to x prime, and

Â then what is the probability that we move from y to y prime?

Â So it'll be y prime, given that we have already

Â moved to x prime, and started with z.

Â 6:27

So this is the transition probability q, this overall thing, and

Â times the starting probability, p (x, y, z).

Â So times p (x, y, z).

Â And we want to prove that this thing equals to the p (x prime,

Â y prime, z prime), because in this case,

Â we will prove that this p is stationary for our Markov chain.

Â And then, it means that the Markov chain converged to this distribution, p.

Â So first of all, it's not that this part, p of z prime,

Â it doesn't depend on x, y or z at all.

Â So we can move it outside the sum.

Â It'll be p (z prime | x prime,

Â y prime) times sum,

Â 7:32

And then, times the probability of p (x, y, z).

Â And note here that, actually, the only part that depends on x is this one.

Â So it's p (x, y, z), which basically means that we can push the summation aside.

Â We can write here the sum with respect to only y and z, and

Â write the summations with respect to x here.

Â So it's summation with respect to all values of x,

Â from 1 to the cardinality of x.

Â And also, by the way, note that if our variables x, y, and

Â z would be continuous, we will have integrations instead of sums.

Â But the overall logic of the proof will be exactly the same.

Â Okay, so we have this expression.

Â 9:21

Let's start with this term, p (y prime | x prime, z),

Â times the product of these two joints, which is the joint distribution.

Â And note that the only part that depends on y is this joint distribution.

Â So we can move this summation again, with respect to y, inside and right down here,

Â summation only with respect to z, and here the summation with respect to y.

Â So it will be p (x prime, y, z),

Â which is a product of these two terms, okay?

Â And again, note that when we marginalize out y here,

Â we end of with p (x prime, z).

Â 11:39

So we're done here.

Â We have just proven that if you start with distribution p on the step x,

Â y, z, and then make one step of this Gibbs sampling procedure,

Â one coordinates at a time, we'll end up with the same distribution,

Â p, p (x prime, y prime, z prime).

Â Which basically means that this distribution beam's stationary for

Â the Gibbs sampling.

Â And that's what will converge to this distribution from any starting point.

Â And basically, it means that if we use Gibbs sampling, it, indeed,

Â will eventually start to produce us samples from the distribution p.

Â [MUSIC]

Â