0:00

We previously defined the notion of a Markov chain that allows to generate

Â samples from an intractable distribution. But we left unanswered the question of,

Â assuming we have a Markov chain that has the desired stationary distributions.

Â How do we go about using it in the context of a concrete algorithmic

Â procedure? Well let's imagine that somebody has

Â provided us a Markov chain. for, a distribution, sorry.

Â The, imagine that our goal is to compute some probability relative to a

Â distribution P. But, for whatever reason, P is too hard

Â to sample from directly. And so how do I use the Markov chain to

Â get around that? First, we construct the Markov Chain, p,

Â whose unique stationary distribution is P. SO so it needs to be a regular Markov

Â chain, usually. And then we're going to generate samples

Â from this distribution. So we're going to start out by sampling

Â my, initial state from some arbitrary distribution, p0.

Â P0's not the same as P. So obviously I can't use the zero sample

Â as a way of estimating an event relative to P.

Â And so what I do now is I start walking along the chain.

Â Starting sampling xt + 1 from the transition model.

Â And because the chain converges to a stationary distribution eventually I'll

Â have a sample that is very close to to the to be in sample from the

Â distribution P that I care about. So, now, here's the sticky issue.

Â And this is really a sticky issue. when have we walked long enough that we

Â could actually use our samples? Because we don't want to use samples at

Â the beginning, because they're far from P.

Â and so we need to walk around long enough for the chain to get close to its

Â stationary distribution. And that state is called mixing.

Â So mix is when we, is when P of t is close enough.

Â 2:14

Pi. So that your time T distribution is close

Â to the stationary, and that's the point that which I could say, good enough, I

Â can start collecting samples. So how do we know if a chain has mixed or

Â not? And the short answer is you don't.

Â and so this is, as I said, the sticky part of Markov chain methods.

Â So in general, you can never really prove that a chain has mixed.

Â But in some cases you can show that it hasn't mixed.

Â And then if you run lots and lots of tests, and none of them have convinced

Â you that the chain hasn't mixed, then you sort of resign yourself to assuming that

Â it has, in fact, mixed. So how do you convince yourself that a

Â chain hasn't mixed? You compute chained statistics, and we'll

Â show some examples of that in a moment, in different windows within a single run

Â of a chain. So for example, here I have a single run

Â of the chain. And I run for a while And then I look at

Â little windows, usually they won't be this small and I compare this window.

Â This window computing various statistics. For example, what is the probability of

Â hitting a particular state? And I say, is the probability of that in

Â this window close to the probability of that in this window?

Â And if it is, then maybe I've reached at least some kind of convergence point,

Â in terms of where the chain is reached. Now this of course is not a sufficiently

Â good answer because it could also be derived in a chain that has these sort of

Â two very high probability regions but are very hard to get from each other.

Â 4:15

And if the chain is kind of ambling along in this part of the space and never

Â hitting the other part of the space. Then, you're still going to get very

Â similar probabilities across a window of a single run of the chain.

Â Because all of the samples are taken from this part, and none of the are ever taken

Â from that part. And so we don't know that we have them

Â mixed. So, a more reliable statistic is to take

Â these, more reliable evaluation criterion is to take these statistic across

Â different runs that are initialized in different parts of the space.

Â And then you might hope that one chain is traversing,

Â one run of the chain is traversing this region, and another run of the chain is

Â traversing this region. And so now the statistics would show a

Â difference and indicate that mixing hasn't taken place.

Â So, what statistics might, So how, what might we do it, more

Â concretely? So let's look at two examples.

Â Here is, two example runs of a chain that we'll describe a little bit later.

Â and what we measure here. And this is the first statistic to

Â measure in Markov chains, is the log of probability of of a sample.

Â So you compute, The log probability of sampling.

Â You can't always compute the log probability directly.

Â You might compute an un-normalized log probability, as we'll talk about, as

Â we'll talk about later. But basically, you compute the log

Â probability, or some constant factor thereof.

Â And now, you can compare two runs. And this is a run that's initialized from

Â an arbitrary state. And this is one that's initialized from a

Â high probability state. And you look at those, and you say, has

Â it mixed, relative to this criterion? The answer is maybe.

Â It looks okay, but you're not entirely sure.

Â But we can look at other statistics. Oh, sorry.

Â Let's look at another example. What about this one?

Â Here is exa, again an example of two runs, one of which is initalized from an

Â arbitrary state, and initialized from a high probability state and you can see,

Â that the log propability values are no where close to each other and so in this

Â case the answers is definetely not, these are two runs of the chain, and really,

Â neither is mixed. And so you need to run for a lot longer

Â which, you know, this comes out, this goes up to 600,000 so it kind of indicate

Â to you how much time this might take. A different statistic, a different way of

Â looking at this, is for a different kind of statistic.

Â So now we have for example, the probability relative to a window that we

Â compute in the chain. So remember, all of this is relative to a

Â window in the chain, after we hope that mixing has taken place.

Â And now we compute the probability that, within states in this region, what is the

Â probability of, that, the states are in some sets, so for

Â example the set of states where. X3 is equal to two.

Â And now we compute that statistic using the two initializations of the chain, so

Â this is chain one, or run one, and this is run two.

Â And now we do a scatter plot for for different statistics.

Â So each of these points. This is the probability say of X3 equals

Â X3 equals value two. This might be the probability that X1 is

Â equal to zero. This might be some other probability

Â that, you know, X5 is equal to seven. And so each of these is a statistic and

Â what you see here is a scatter plot. One is the estimate that they get from

Â the one chain and from the other, and looking at this It should be obvious that

Â this first chain, has not, that, we have not got mixing on the left hand side,

Â because you can see that there are all these points here that have high

Â probability in one of the two rungs, and a probability zero in the other, and vice

Â versa. Whereas here, most of the estimates are

Â clustered around the diagonals. So that you're getting similar estimates

Â from the two chains. And so again,

Â this one is, the first one is a clear no and the second is maybe.

Â And if I do a lot of these statics and they all come up with maybe then I'm

Â willing to then trust the answers and the that is taking place.

Â So now that I've started collecting samples, how do I use these samples?

Â Well, one important observation to keep in mind is that once the chain is mixed,

Â all of my samples are from stationary distribution.

Â That is xt is from pi, so as xt1, + 1, t2, + 2, t + 3 and so on and so forth.

Â And so we can use every single one of those samples.

Â Because they are all from the correct distribution.

Â So once I've determined that a cert-, that I've sampled long enough for mixing,

Â or believe that I've sampled long enough for mixing.

Â We should collect and use all of the samples.

Â And in fact, there are you might read some papers that say.

Â Oh, I'm only going to collect every hundredth sample.

Â There's actually papers that prove that using every single sample is better than

Â collecting every hundredth sample. But then you might ask, well why would

Â those papers tell me that I should only collect every 100 samples as opposed to

Â collecting all the samples if they're all from the correct distribution?

Â Because, and this is undoubtedly true, adjacent samples, ones that are near by

Â to each other in time are correlated with each other.

Â Because how, even if xt is from the right distribution Pi and so is xt + 1.

Â Xt + 1 is still going to be close to T, close to xt.

Â And so you're not really getting two different samples.

Â You're getting two that are very close relatives to each other.

Â Now as I said it's important to recognize that phenomenon.

Â Because it's important to realize that just because you've reached mixing and

Â collected a thousand samples, doesn't mean you have a thousand samples worth of

Â information. So you shouldn't go back and apply the,

Â apply the you know, one of the bounds that we saw in assuming that you have iad

Â samples. The samples are not.

Â I ID. So that, that doesn't mean you shouldn't

Â use them, using them is still better than not.

Â Now, this is where you get bitten twice by the same phenomenon.

Â The worse a chain is to mix. So the longer you need to wait for the

Â initial distri, for the initial samples to be good enough, the more correlated

Â the samples are because the slower you are moving around in the space in

Â general. And so if a chain is bad, it's bad in two

Â different ways. It's bad because it takes you longer to

Â mix, and it's bad because the samples you are collecting are not as useful.

Â because of the correlation structure between the samples.

Â 13:12

And then finally, when I've collected enough samples, I can go ahead and

Â compute the expectation in anything that I care about, whether it's an indicator

Â function or a more complicated function. Summarizing the implications, Markov

Â chains are a very general purpose class of methods for doing approximate

Â inference in a, in general probabilistic models, not even necessarily

Â probabilistic graphical models. they're often very easy to implement

Â because the local sampling that we're doing is often quite straightforward to

Â execute. And, it has good theoretical properties

Â as we sample, as we generate enough samples that are sufficiently far away

Â enough from our starting distribution. Unfortunately, these theoretical

Â guarantees are very theoretical in many cases.

Â And so this method also has some significant cons.

Â First, it has a very large number of tunable parameters and design choices.

Â What's my mixing time? Which statistics do I measure?

Â How close are the statistics to each other in order for me to declare that

Â mixing is taking place? how many samples do I collect?

Â do I what window size do I use to evaluate mixing.

Â All of these are design choices, they all make a difference, and so there's a lot

Â of, finicky tuning that needs to be done when running an MC and C method in

Â practice. The second is that bit depending on the

Â design of the chain, this can be quite slow to converge because it's very

Â difficult to design chains that have good mixing properties.

Â And finally, it's very hard to tell whether a chain is working.

Â That is it's not straight forward to determine whether the chain has mixed.

Â and how whether my samples are sufficiently uncorrelated to each other

Â that I'm getting a reliable estimate of whatever it is that I'm trying to figure

Â out. So, a lot of advantages but also some

Â significant disadvantages.

Â