0:00

We've already said that there's many different algorithms for inferencing

Â graphical models, but the simplest and most fundamental is an algorithm

Â typically known as variable elimination. Let's consider the variable elimination

Â algorithm in the context of a simple example of a graphical model structured

Â as a chain. Here we have we're interested maybe in

Â computing the distribution over variable E.

Â So we're interested in P(E). And as we've already said that

Â probability is proportional to the un-normalized measure p tilda over A B C

Â D and E summing up all of the variables except for E.

Â So now let's see how we can do this more efficiently than simply constructing the

Â joint distribution and then summing things out.

Â So the first thing we do is we write up this unnormalized measure as a product of

Â the constituent factors. And for the moment we're going to assume

Â that we only have pairwise factors for the edges in, this graph.

Â So we have a factor for AB, a factor for BC, CD, and D.

Â So those are the factors phi one up to phi four.

Â Now, what is the first observation that we have when we see the summation over A

Â B C and D of a product of factors? Well, we've already done this exercise

Â previously when we were doing some proofs related to graphical models.

Â That if we have a factor that doesn't mention a particular variable in its

Â scope we can move it out of the scope for the summation.

Â So specifically, phi two of BC can be moved out of the summation over A, as can

Â phi three of CD and phi four of DE, which leaves us only with the summation over A

Â of phi one AB. So that gives us the expression over

Â here. Now this is.

Â Now this is a summation over a para-wise factor.

Â And the result of this is a factor over a single variable B. which we're going to

Â call tau one of B. And so we end up with an expression that

Â looks like this. So now let's continue this expression the

Â developing this expression further. noting that we now have an expression

Â that doesn't involve A, only the variables B, C, D and E, so that

Â effectively, we have eliminated E from our graph, A, we've eliminated A from our

Â graphical model. So let's go back to this expression, we

Â now have this, this product of four factors and once again we can look at

Â what factors involve the variable B and which ones don't.

Â And the ones that don't can be moved out of the summation, just as before, giving

Â us this where now we have an expression which is a product of these two factors.

Â Sum that over B. And this is going to give us an

Â expression. How to, whose scope is C.

Â And so we now have an expression that does not involve the variable B.

Â And so now we've eliminated B from this graphical model.

Â And we can similarly continue to eliminate C and D.

Â So that, ultimately, we will end up with an expression that involves only the

Â variable E. And that expression is going to be

Â proportional to the probability of the, to the marginal probability of E.

Â Now let's do this algorithm in the context of a somewhat more complicated

Â example which is our enhanced student network that we've played around with

Â before. So let's imagine that our goal is to

Â compute the probability of a variable J as one, and in order to do that we're

Â going to have to eliminate some of the joint distribution all of the other

Â variables except for J. So this our, our expression.

Â And note that we have this product of factors.

Â I've taken, already, in this expression, the factors that were, the CPDs, and

Â turned them into factors so that we can have a consistent notation.

Â And now we need to eliminate every one of the variables except for J.

Â So we're going to start with eliminating the variable C first.

Â And and so once again we are going to take the summation over C and, we are

Â going to push it in, leaving in the summation only the factors that involve

Â C. And those are phi D and phi C.

Â Multiplying them together and eliminating C is going to give us a factor which

Â we're going to call tau one, and whose scope is D.

Â 4:34

Okay? And, by putting tau one into this

Â expression and removing the ones that we've just multiplied together, we end up

Â with this expression over here. Having eliminated C, we now go ahead and

Â eliminate D. So, here we have the variable, D.

Â And which factor is, involve D? We'll tau one of D.

Â And phi G of G, I, and D. And so everything else is taken out of

Â the summation. And we just have this expression over

Â here. And that's going to give us tau two whose

Â scope is G and I after having eliminated the variable D, from this product whose

Â scope is G, I, and D. And tau two gets put back into the bucket

Â together with everything else. Moving forward, we're now interested in

Â eliminating I. And so the factors that involve I are tau

Â two of G and I and phi S of S and I, and phi I of I, and so, we go ahead and

Â multiply them together to give us a factor whose scope is G, I, S and

Â eliminating I gives us a factor tau three whose scope is S and G.

Â And so the process continues and let's just finish it all the way to the end.

Â Now our goal is to eliminate H. Well H is a little bit of an interesting

Â case. They only factor that mentions H is this

Â factor 5H. and if we think about what 5h is,

Â 5h as it happens is P of H given G and J. And so if we're summing that up over H,

Â we're actually summing up what is, in fact, the conditional distribution.

Â And since we know that a conditional distribution when you sum up the, the

Â values on the left-hand side, the summation necessarily is equal to one.

Â So in principle we could have taken this entire expression.

Â Erased it and written one instead and that would have given us something that

Â is a factor that doesn't depend on anything would have been, would have just

Â disappeared. But for purposes of demonstration we're

Â not actually going to do that because in fact not every algorithm is clever enough

Â to notice these kinds of coincidences, it depends on whether they were designed to

Â look for that. And so we're going to do this in, in the

Â same way, in the same sort of naive way that we've done before which is just to

Â turn this into a factor which is going to be tau four of G.

Â 7:38

J. So so now we have that factor which

Â really is one. But we're not going to pay attention to

Â that particular aspect for the purposes of demonstrating how the algorithm would

Â work. Okay?

Â Next is eliminating D. And we have this expression, which we

Â inherited from the previous line. And G is one of the big ones, because it

Â appears in phi L tau three and tau four.

Â So when we think about the variables here in the scope, we see that this one

Â actually has, this product over here, actually has a scope of L, G, S, J which

Â is the largest factor, the one with the largest scope that we've encountered so

Â far. Summing our G we end up with a factor

Â whose scope is L, S, and J. So we're missing an S.

Â And and now we put that into. This expression, and out comes a now

Â product of two factors. And really at this point we may just as

Â well multiply them and end up with a factor over J and.

Â 9:09

Hold on. So that gives us variable elimination in

Â it's naive form. What about variable elimination with

Â evidence? Well, we've already basically established

Â how to deal with evidence. if we're interested in, do in, for

Â example, solving the query probability of J, I equals little i.

Â Comma H equals little h. The way in which we do that is by, is by,

Â computing the probability of the joint event, J comma I equals little i comma H

Â equals little h. And the way in which we do that is by

Â reducing the factors to correspond to this scope.

Â And so if these were, If, so we take each of the factors that

Â involved I. And we basically instantiated to take a

Â particular value for that I. The value, little I, and similarly, for

Â h. And so we see here, for example, that,

Â whereas phi I initially depended on i. Now it doesn't depend on anything.

Â Because this is simply the value phi I of little i which is a constant.

Â And where as for example, G depended on D and I as we can see in the original

Â diagram, here, phi G, in a reduced factor doesn't depend on I, and is really

Â probability of G, given little i and, D. And the same reduction occurs for H

Â equals H and so we end up with the following set, of reduced factors. And

Â now once we have that set of reduced factors we do the as, exactly as before.

Â No changes whatsoever to the algorithm. The only aspect that's a little bit

Â different is notice that we don't eliminate.

Â No H, and no I, because there's no point eliminating a, a variable that has a

Â single value. There's no need to sum up over it when it

Â only has a single value. So these give us a unified framework for

Â dealing with with queries whether they involve evidence or not.

Â And then how do we get the probability of j given the evidence?

Â Well, this is straight forward. We simply renormalize.

Â 11:28

Because we, because we can take this. And simply divide by what it turns out to

Â be the probability, normalizing constant. Okay.

Â So, let's see whether the same idea applies to Markov networks.

Â So here is our simple Markov networks with, network with four variables.

Â Let's imagine that our goal is to compute P of D and so in order to that we need to

Â eliminate A B and C from the unnormalized measures.

Â So we have this being the unnormalized measure.

Â And we're summing up over A, B and C. And the process works in exactly the same

Â way. So if we want to sum out a first, then

Â here is, the factors that involve A. Phi one of A, B, this one.

Â And phi four of AD. Multiply them together.

Â We get a factor whose scope is ABD and then we sum out A to get a factor whose

Â scope is BD. That gives us a new set of factors where

Â A has effectively been eliminated from the graphical model.

Â And at the end of the process, we get a factor over the single remaining variable

Â D. So tau three of D and that factor is not

Â the probability of D, it's proportional to the probability of D.

Â It's actually equal to the tilda of D which is the unnormalized measure and so

Â in order to get P of D we renormalize. So to summarize this the main routine in

Â this algorithm is something is a routine which we call eliminate variable Z from a

Â set of factors phi, and what it does is the following.

Â We first look within phi, and we define a set of factors phi prime which are all

Â factors that involve Z. And that's what this mathematical

Â expression says. The factors phi I, so that Z is in their

Â scope. We take all those factors, and we

Â multiply them. And then, we sum out the variable Z,

Â which is the one that we want to eliminate.

Â Now, and here is the important point, we've already used up these factors.

Â These ones over here, have now been used. We don't want to reuse them.

Â And so we take them out of the set of factors.

Â And instead we introduce the one we just created by multiplying those factors, and

Â summing them up. This basic operation is what we use in

Â the context of the algorithm as a whole. we begin by reducing all factors by the

Â evidence, in, but which is just eliminating the, rows that don't that are

Â not consistent with the observations, and that is what gets us are set of factors

Â phi. Now for each non query variable we need

Â to eliminate it. And so we run one at a time,

Â something that eliminates the variable Z from the set of factors phi.

Â Each subset changes my set of factors. It adds factors and removes.

Â So actually starts by removing factors which we, phi prime from the previous

Â one, from the previous line and it adds a new factor tau.

Â 15:33

And then finally at the very end, when all variables have been eliminated, there

Â may be one or more factors remaining so that point we multiply all the remaining

Â factors and then we normalize to get a distribution.

Â So, to summarize, this is a very simple algorithm.

Â It works equally well for Bayes nets and Markov nets.

Â and it uses. And it's a factor product and factor

Â summation steps. And it does that be ensuring that when

Â you mult-, that when you sum out a factor,

Â when you do the summation step over variable Z,

Â then all factors involving z have been multiplied in.

Â Which is the critical piece of correctness of this algorithm.

Â