0:00

A very different paradigm to inference in graphical models is the use of what's

Â called sampling or particle based methods.

Â In those methods, rather than trying to manipulate the exponential or even

Â infinitely large probability distribution as a whole, we randomly sample instant,

Â instances from that distribution and then use those instances as a sparse

Â representation. We can use those instances then to

Â estimate various quantities that we care about regarding the statistics of the

Â overall distribution. Before we show how this is used in the

Â context of graphical models, let's go ahead and see how this is applied in the

Â simple setting. So first of all, how do we use samples

Â for estimating quantities that we care about?

Â So imagine that somehow, and we're going to talk about how that might happen

Â somebody gives us or we manage to construct a data set d which consist of a

Â bunch of samples from distribution P. And for the moment we're going to assume

Â that these samples are what's called IID which stands for independent and

Â identically distributed. That's where the IID comes from.

Â And now, let's imagine that we're trying to compute or estimate.

Â And this is all going to be approximate. approximate sum, quantity of the

Â distrib-, of, of the distribution P. And so, for the moment, let's focus on

Â the really simple case, where we're trying to figure out, where these are

Â binary valued random variables. So the distribution P that we're sampling

Â from is one where the probability that x = 1 is P.

Â And so, think of this like tossing a coin.

Â Now when are they reasonable estimator for the for this parameter P that we are

Â trying to estimate. The probability that X falls heads.

Â Although this is a fairly intuitive answer here, this estimator which we're

Â going to call P sub d. Is, simply obtained by looking at all of

Â the samples that we got and counting the fraction of ones.

Â 2:23

And if you think about it, that's a perfectly reasonable approximation.

Â Right? Now, more generally, for any distribution

Â P, and any function f whose statistics we're

Â trying to estimate. We can approximate the expectation of the

Â function f relative to the distribution P in terms of this weighted average of the

Â value of f on the samples. And this is often called the Empirical

Â Expectation. So this is how we might, if we had some

Â way of which we have not yet talked about of constructing or sampling from a

Â distribution. How we might use those samples for

Â approximating various properties relative to the distribution p.

Â Whether it's the probability of an event or the expectation of a function.

Â And know by the way an expectation of a function subsumes the probability of an

Â event because you can also take the expectation for example of an indicator

Â function and that corresponds directly probability. So for example, this

Â probability can be viewed as the expectation relative to P of the

Â indicator function representing x1. = 1.

Â 9:48

And once again we have an exponential we have a bound on the error, which is

Â written over here. And once again, the number of samples

Â appears in the exponent that makes us happy.

Â We have the same type of epsilon squared term, that shows that the dependence on

Â the tolerance that's required. But her we have this one other term over

Â here, which is the actual value p to which we're trying that we're trying to

Â estimate. So if we can give you a bound on the

Â error that you get the this bound on on how far you are away from p.

Â A bound on the tolerance epsilon. And that also the bound on the

Â probability with which you get a bad sample set.

Â We can now say, for a given tolerance that we want.

Â I'm sorry for a given error probability that we want, that is.

Â How, if we want to guarantee that we're within epsilon, with probability that's

Â greater than one minus delta, we need this many samples.

Â And that's fairly straight forward algebra.

Â You simply, say this is less than or delta.

Â And then just take logs, and move things around and it all comes out, to be

Â exactly this. And for the turn off bound we have, a

Â similar expression. and, that gives us exactly that same kind

Â of, bound on m as a function of epsilon in delta, and in this case p as well.

Â So this looks great. Right?

Â We can you, you give me an epsilon which is your air tolerance and a delta which

Â is the probability which you are willing to take being wrong.

Â And I can say if you can only sample this many samples m then I can give you those

Â probabilistic guarantees. They're all deterministic but they're,

Â but they're pretty solid. Why is this not a perfect solution to our

Â inference problems? Because each of these has significant

Â limitations. So, let's think about the first which is

Â our additive bound. And let's imagine that you're going in to

Â a doctor and you're saying, you know what is the probability that I have some

Â horrible disease. Well, that probability hopefully for you,

Â is still pretty low. So maybe, if you're unlucky it's, I don't

Â know, 1% or 2%.. Well, an additive error epsilon on 1%,

Â the, the epsilon that you need to get something that's meaningfully bounded

Â when the true probability p is 1%.. The epsilon needs to be really, really

Â small. You can't do epsilon equals 0.01.

Â Because that could move you up from 1% to 2%..

Â And that's a factor of two increase in your probability.

Â And that's something that people really care about the difference between 1% and

Â 2%. so you'd need something that's more like

Â 0.0001 or maybe 00001 depending on, you know,

Â how confident you wanted to feel. And now, this epsilon2.

Â squared over here it was beginning to look pretty, pretty scary in terms of the

Â number of samples that are required to get a reasonable doubt.

Â So you might come and say well, fine. Let's use the turn off bound, because

Â that gives me relative errors on epsilon. So now if epsilon is, is, sorry, sorry on

Â p. So if now p is small, then by all means I

Â can go ahead and just, you know, have a, it, it's a multiplicative factor of p, so

Â I can say p plus or minus 1% of p. Well, unfortunately, there's no free

Â lunch because if p's small, notice that it appears here in the denominator, and

Â so once again we need a number of samples that could potentially be quite large

Â when we're dealing with small probabilities.

Â So the main message from this is that sampling-based estimation.

Â Is a reasonable thing to do when p is not too small.

Â When p is not too small this works fine. When P p begins to get smaller,

Â we the, the, the tractability of this is more in doubt.

Â Now that we understand when we might expect this to work let's think about how

Â we might apply it in the context of Bayesian Networks.

Â So here is our little baby network that we've used for the student example.

Â And what we'd like to do is we'd like to generate samples from this distribution.

Â So this is the distribution of, remember, P( D, I, G, S, L).

Â And we'd like to sample from this high dimensional distribution.

Â Not that high, but still. And the way in which this is done is

Â actually very natural. When you think about the sort of, causal

Â intuitions, or forward flow of a Bayesian network.

Â We start out, say, by sampling the difficulty variable.

Â And the difficulty variable was sampled from this distribution over here which

Â is, 0.6 versus 0.4. And so, we use the trick that we showed a

Â couple slides ago. And say, it comes out to be zero.

Â I'm going to write down the zero over here.

Â Now I'm going to sample I, and I'm going to toss a coin with probability 0.7 0.3,

Â and say I won. Now I get the sample grade.

Â And because I previously sampled difficulty and intelligence,

Â I know exactly what distribution grade needs to be sampled from.

Â It's the distribution I1D0. And so I now sample one of the three

Â values of grade from the distribution in this row that I've picked out in the

Â [INAUDIBLE]. And say, it comes out G1.

Â 17:04

And so if I want to use this process, which is very naturally called forward

Â sampling. Where we say, want to compute some, to

Â estimate the probability of some assignment, little y, to some set of

Â query variables, y. Then what we're going to do is we're

Â going to generate a bunch of samples from that Bayes net.

Â As many as we think is, is adequate. And then, if I'm interested in some

Â particular event. Then I simply compute the fraction of my

Â samples. Fraction of samples where y equals y.

Â And I can use that same procedure for computing other expectations, so

Â whatever, if I, if I have any other function of the sample, I can compute the

Â sample, the function on that sample and then average it out in exactly the same

Â way that we showed on the first slide. And, and now we can go ahead and apply

Â the bounds that we showed before. So for an additive bound we have this

Â many samples that we need. And for multiplicitive bound, we have

Â this many samples that we need. And these are just copying over the two

Â bounds that we showed before. So, that's great.

Â but that, notice, gave me queries without evidence.

Â And what do we do if we want to now ask a question.

Â Not just about the probability of y. But the probability of y, given some

Â evidence, E = e. Well, so now,

Â it's not that easy to sample from a Bayesian network if I have evidence

Â that's not at the root. If I have evidence that's at the root,

Â then that's fine. I know the value of that variable, and I

Â can just stick that in, and ignore it. Ignore everything else.

Â But if I have a variable where, that I observe that, somewhere in the middle or

Â the bottom of the network. Then what happens when I get there?

Â Do I sample it? Do I ignore it?

Â I mean it seems a bit unclear. And so the simple solution to that.

Â And we'll talk about others. Is an algorithm that is called rejection

Â sampling. And what it does, is it does the most

Â naive, thing that you could possibly imagine.

Â I generate samples from a basna. And then, if they don't agree with my

Â evidence. I throw'em out.

Â 20:42

So now, let's go back to, for example a medical diagnosis setting or, or for that

Â matter the the, the, message decoding example that we talked about previously.

Â How likely is your average evidence. So let's think about medical diagnosis.?

Â A person comes in, and they have their age, and their weight, and their gender.

Â And you know, 32 symptoms, and ten test results.

Â And you know, the fact that they were, they went on a trip to Europe last week.

Â And all of these are pieces of evidence. And the question is how likely is that

Â configuration of evidence? And the answer is vanishingly small.

Â How likely is your random person off the street to exhibit this exact combination?

Â Almost null. And similarly with the message example

Â how likely is, are you to get up a particular configuration of noisy bits

Â exponentially small also. So the number of samples needed grows.

Â Grows exponentially with the number of observed variables.

Â The more variables you have you've observed, the lower the probability of

Â the evidence. And this is a exponential decay.

Â And so that means that if you observe more then a couple variables, basically

Â this is not a good algorithm. So to summarize algorithmically, this is

Â a very simple procedure. It's very easy to generate samples for a

Â Bayesian network and we showed how all these pieces fit together.

Â And once you've done that, there's really elegance and theoretically compelling

Â epsilon delta bounds, as we've shown before; but their usefulness,

Â unfortunately, is limited. The additive bounds tend to be useless

Â for low probability events. The multiplicative bounds.

Â are not good either, because the number of samples grows as a function of one

Â over the probability of the event that we're trying to estimate.

Â And all of this is completely irrelevant as soon as you have evidence.

Â Because as soon as you have evidence the number of required samples grows

Â exponentially with the number of observed variables.

Â Which means this is not a feasible solution for most practical settings.

Â And one final note that's worth highlighting.

Â This is the only case in, I think this, the entire section on inference, where I

Â talked about bayesian networks specifically.

Â And that is because, forward sampling, unlike any of the other inference

Â algorithms that we are going to talk about, is just not feasible for markup

Â networks. Because there's no, you know, there's no

Â notion of starting at the root. There is no root.

Â There is no, sort of Any variable that you would naturally

Â start within whose probably you actually know.

Â and so we don't apply forward sampling to Markov networks.

Â