0:00

We define the notion of the general Markov chain.

Â We talked about how certain conditions guaranteed that if we sampled from the

Â chain we achieve conversion sistationary distribution.

Â We then described how if we then generate samples from the Markov Chain we can then

Â use that to estimate properties of that stationary distribution like varying

Â statistics that we might care about. But the question of where the Markov

Â Chain might come from was left very vague.

Â It turns out that four graphical models people have constructed a general purpose

Â sampler that can be used to sample from any graphical model, and that class of

Â chains is called the Gibbs chain. So now we're going to talk about the

Â Gibbs chain and how to use in the context of graphical models.

Â So, here are target distribution, the one from which we would like to sample is

Â the, good ole, Gibbs distribution, P of Pi, of X1, up to XN.

Â And just as a reminder that P of Pi can equally well represent a, a Gibbs

Â distribution that we got from just the product of undirected factors.

Â A Markov net style distribution. Or it could, as well, represent a set of

Â factors that [INAUDIBLE] Bayesian network, reduced by evidence.

Â And so, that is the factor set Pi. And everything that I'm going to say from

Â now on is agnostic to where those, factors Pi came from.

Â If they came from a directed or an undirected model.

Â So, now our Markov chain state space. Is a set of complete assignments, little

Â x, to the set of variables big X. So we're sampling over, note, an

Â exponentially large state space. The space of all possible assignments of

Â the random variables in a graphical model.

Â So those little circles that we saw when we were talking about mark off chains,

Â and we were g, going from one state to the other.

Â Each of these now is now a complete assignment.

Â [INAUDIBLE]. So it turns out that the Gibbs chain is

Â actually a very simple and easy to understand, Markov chain.

Â And here's what it does. Assuming that I have a starting state,

Â little x. What I'm going to do is, I'm going to

Â take each variable in its turn, using some arbitrary ordering.

Â And I'm going to sample. A new value for that variable given

Â values for all of the rest. So let me introduce this notation, this

Â is an assignment. All.

Â x1 of the xn except xi. So for example if we had three random

Â variables: x1, x2, and x3. We would sample x1 given x2 and x3.

Â X2 given x1 and x3. And x3 given x1 and x2.

Â Right, so one at a time. Okay so let me, let me draw that out

Â because in order to make that clear. So here we would have, here's my current

Â assignment, let's say here's X1, X2, X3, and let's say that we have zero, zero,

Â zero as my initial assignment. So I'm going to start by.

Â Sampling X one from the probability of X one given X two equals zero.

Â 3:32

And x3 equal zero, and let's say I end up with one, that other two haven't changed

Â yet. I now sample from p of x two, given x1

Â equaled the new value one, and x3 equals zero.

Â And say that's thay zero. Could.

Â And then I sample x three from x, the probability of x three given x one equals

Â one and x two equals zero and save our changes to the value one and so now this

Â is my new assignment 101 which gets substantiated into x five.

Â That is a single step of the Gibbs chain. It goes and potentially modifies, it

Â doesn't have to modify because you might end up staying in the same assignment.

Â Resampled every single variable, in its turn.

Â And when all of them are done, that's my new state.

Â 4:37

So let's do this example in the context of a real graphical model.

Â So here is the good old student Bayesian network.

Â And I'm assuming that I have observed two of the values, two, values for two of the

Â variables. So I've observed l equals l zero and s

Â equals s one. And so now I have a product of, reduced

Â factors that represent the evidence that I received and only three variables are

Â now getting sampled because the other two are fixed so they observe values so now I

Â am sampling from P of P tilta of Phi, I mean, I like the sample from P tilta Phi

Â of DIG, given little S1 and L0. And so what I'm going to do in this

Â sampling process is I'm going to start out with some arbitrary assignment to

Â these three variables, D, I and G. So, for example, since it's arbitrary,

Â D0, I0, E0. And now I'm going to go ahead and sample

Â each of these in turn. And so I'm going to initially sample D

Â from. Of the, given.

Â I've. Zero.

Â G zero. SL0 and S1.

Â 6:18

Okay? And we'll talk about how that's done in

Â just a moment. And that might give me D one by zero D

Â zero. And now I'll continue and I might sample

Â I. So I'm going to sample I from probability

Â of I given D one. E zero, L zero, S, one, they end up with

Â I one. Finally I sample G and it couldn't be G0,

Â I apologize because G only goes one, two, three.

Â So let's say it was G1 and now I sample G, from the probability of G.

Â Okay, so this was G1 here too, and this is the propability of, of mouse sampling

Â G, from the propability of G, given, D1, I1 that's where I am right now, L0 and

Â S1. And double G3, for the new finer.

Â Use this one. D1, i1, g3.

Â So how do I do this in a tractable way? So here is the trick that.

Â I mean, you might say, well. You're dealing with this intractable

Â distribution. I mean, you just moved the problem from

Â the original problem that we had to computing these conditional

Â probabilities. Why is that any easier?

Â Well, turns out it is. So let's look at the sampling stuff over

Â here where I sample'xi' from this conditional distribution and let's break

Â that down into these into its constituent definitions.

Â So this distribution, this conditional distribution is a ratio.

Â Between the join distribution of the value XI and the stuff on the right hand

Â side of the conditioning bar, divided by the stuff on the right hand side of the

Â conditioning bar. Now importantly, because each of these

Â is, is, an unnormalized measure divided by the partition function, the partition

Â function cancels and so I end up so if I had a one over z here.

Â The one over z cancels then I end up with the ratio to unnormalized measures, so,

Â the point being now, that this. Is a complete assignment.

Â 9:00

And therefore the product of factors. So let's look at that in the context of a

Â concrete example, and see how that might help us.

Â So now we're doing an example that's a Markov network in order to demonstrate

Â that this is equally applicable to both. So this is our good old misconception

Â network where we have these four factors Pi one, Pi two, Pi three, and Pi four.

Â And let's imagine that we're trying to do P Pi of A given B, C, and D.

Â And we can see that what we have here is P tilde Pi of A, B, C, D divided by the

Â sum over A prime. And I'm using A prime to distinguish it

Â from this A over here, just to avoid ambiguity.

Â P tildify of sum over A prime, P tildify of A prime BCD.

Â And now I'm going to. Since each of these [INAUDIBLE] is a

Â product of the factors, I'm going to inject product of factors into each of

Â the numerator and denominators separately.

Â And so over here I end up with five one of ab times five two bc five three cd

Â five four ad. And similarly end up with exactly the

Â same expression in the denominator. Except I sum out, over A prime.

Â 11:16

Well that's for the numerator, what about the denominator?

Â Well the denominator is just our good old.

Â Normalizing constant. So really what we have here is that this

Â is proportional to, Pi one, of A, little B times pi four of A, little D.

Â So this is a factor over A that I can compute, by multiplying two, singleton

Â factors in this case over A. Multiply them together and renormalize

Â and I get a distribution over a, from which I can sample, and we already talked

Â about how to sample from a discrete distribution.

Â So what is the computational cost, now, of this algorithm?

Â The sampling step for xi, given all of the others is simply this expression,

Â over here. Which is proportional to a product of

Â factors. And not just any old product of factors,

Â but just the factors that involves XI involve.

Â 12:39

Just the factors that have XI in its scope.

Â Which means that I only care about XI and its neighbors in the graph.

Â Which, for a large graph, with tens of thousands of nodes, can be a considerable

Â savings, okay?

Â So only XI and its neighbors. So, do we like the Gibbs chain?

Â We've certainly defined it. Does it have the properties that we,

Â would. Like it to have?

Â Well, the answer is unfortunately, not always.

Â So, let's look at and specifically is a, is a regular change and we have

Â guaranteed convergence for unique station redistribution.

Â So the answer is unfortunately not and to that we're going to return to our old

Â enemy the exclusive-or network which as I said in other examples is the counter

Â example to pretty much anything is exclusive-or.

Â So let's look at the exclusive-or network.

Â So here we have Y which is the exclusive or of X1, X2.

Â And let's imagine that I observe. Y equals one.

Â And so these two disappear. And now I'm going to try and sample.

Â X one and X two, which I should be able, you know, if, if, if I were so lucky, I

Â would get the, each of the two possible states for X one and X two occurs with

Â probability a half. So let's imagine that I start out with

Â this one. And I'm going to now sample X one given X

Â two and Y. So I have my initial state is zero one

Â for X one and X two. And of course Y equals one is my

Â evidence. And now I'm going to sample.

Â what I'm going to sample, X one given X two and Y.

Â And that is a deterministic dependence, and the only possible value that I get is

Â zero for X one. Well, am I going to have any better luck

Â with X two? Nope.

Â I'm going to end up with exactly the same zero one that I started out with.

Â And I can do the exact same thing again, and you know what?

Â It'll work the same way every single time, and I will always end up with zero

Â one. Conversely, if I start out with one zero,

Â I will never leave one zero. This is a classical example of a non

Â mixing chain. On the other hand, if all factors are

Â positive, that is, if there are no zero entries in any of the factors, then you

Â can show that the Gibbs chain is regular. Now, however, it's important to realize

Â that regular doesn't mean good. So, I can make this chain regular by

Â adding a tiny probability epsilon of having y not be the exact exclusive orb X

Â one and X two. Now the chain is regular, because all

Â possibilities have probability non zero. And it's still going to mix extremely

Â slowly for values of epsilon that are small.

Â So regularity remember, is a very weak condition.

Â It says if you sample long enough, we will eventually converge.

Â But it doesn't mean it'll happen in a reasonable time frame.

Â 16:16

And one such example of a very slow to mix chain actually occurs in practical

Â applications. So, here's one example.

Â Here, it's an example in our image segmentation domain.

Â And imagine that we're trying to resample, say, the class of.

Â let me use a different color. Of this super pixel here, and that we

Â have some fairly strong potentials that tell me that the class of this super

Â pixel, should be very highly correlated with the class of its adjacent super

Â pixels, because of the appearance being, very similar to each other.

Â Because I have some set of peer wise Markov network with some very strong

Â potentials that try and enforce consistency between adjacent super

Â pixels. So if it turns out that for whatever

Â reason, my initial assignment was actually wrong, so for example what is

Â often an error mode for set segmentation models is that is that road is often

Â labeled water or sky because the appearance is actually kind of a similar

Â and grayish kind of appearance so say that for, for example if I happen to

Â label all of these superpixels as say water, and now I'm going to use Gibbs

Â sampling. None of them are likely to move away from

Â their current assignment because each of the superpixels is really constrained by

Â strong pairwise potentials with its neighbors.

Â And so this notion of slow to mix Markov chains is not just a theoretical

Â construct that occurs with exclusive or it actually happens in practice a fair

Â amount. Now to summarize what Gibbs sampling does

Â is it takes the very challenging problem of inference in a Gibbs distribution.

Â And converts it to a sequence of easy sampling steps, where each sampling step

Â samples one variable, given all of the others.

Â This method has the advantage that it's probably the simplest Markov Chain for

Â probabilistic graphical model and it's very efficient to compute each of the

Â different stuff. The cause is that they are very slow to

Â mix, when the probabilities are peaked and the variables are very highly

Â correlated. And, it's only applicable in cases where

Â we can sample from a product of factors. Now, we can always do that in a discreet

Â graphical model, if our factors aren't too large.

Â But, if our samples are, but if we have, for example, either a very densely

Â connected Say Markov network or we have one with

Â continuous distributions where the product the factors isn't something has a

Â nice parametric form that we can sample, [INAUDIBLE] sampling might not be

Â actually applicable. And so there is non-trivial limitations

Â to the applicability of gift sampling and even when it applies, it is actually a

Â very slowed mix. And we often want to do something more

Â clever, and we'll talk about more clever algorithm later on.

Â