0:02

So as we said at the very beginning, there's two main families of graphical

Â models. There's those that are based on directed

Â graphs, directed acyclic graphs and those that are based on undirected graphs.

Â The undirected graphical models are typically called Markov networks, they're

Â also called Markov random field. We're going to start by talking about the

Â simplest subclass of those which is pairwise Markov networks and then we're

Â going to generalize it. So let's look again at a toy example just

Â to illustrate what's going on. This is an example of four people who are

Â studying together in study pairs. And notice that Alice and Charles don't

Â get along. And, you know, Bob and Debbie had a bad

Â breakup so they don't talk to each other either.

Â And so really, we only have the study pairs that are marked by the edges on

Â this diagram. And, what, we are, and the random

Â variables here indicate whether the students have a particular misconception

Â because the material was a little confusing so the random variable says,

Â does the student have a misconception or not this particular misconception, and

Â the, the intuition here says that if two students study together then they kind of

Â influence each other and so we have so for example if Alice and Bob study

Â together, then this edge indicates. that one, if one of them has the

Â misconception, the other ones likely to have the misconception.

Â Now, this doesn't fit neatly into the perview of a directed graph, because the

Â influence flows in both directions, and so you can't really point an arrow from

Â Alice to Bob or from Bob to Alice and so we're going to use an undirected graph to

Â represent this. That's great but how do you parametrize

Â an undirected graph, because you no longer have the notion of a conditional

Â probability distribution because there's no variable that's condition in one that

Â you condition on? And so and so how do we do this?

Â And we're, so we're going to use the general notion of a factor which we

Â defined previously. And notice that this really is a general

Â factor in that the numbers don't even are not even within the range 01.

Â Now what do these factors mean? These factors have many names.

Â They're called infinity functions or compatibility functions.

Â 2:24

So what these numbers mean is the, is the local happiness of the variable A and B

Â to take a particular joint assignment. So here we have the, you know we can see

Â that the happiest assignment as far as A and B are concerned in isolation of

Â everything else, is a A0B0. The way this is the case where neither

Â student has the misconception, that's the happy assignment.

Â We see that the second happiest assignment is A1B1 where again the

Â students agree, and in this case they both have the misconception.

Â And finally, the other two in the middle are are the least happy of all.

Â Now this is a local happiness and we have similar notions of happiness for for the

Â other pairs in the, in the graph. So in this case we see not only that

Â there is a strong sentiment in favor of agreement, it's much stronger than in the

Â AB case. So B and C really like to agree with each

Â other. They, you know, it's very difficult for

Â them to have opposing opinions, okay?

Â On the other hand Charles and Debbie like to argue with each other all the time.

Â And so, if one of them says that, it's going to rain today, the other one is

Â going to say that it's sunny today. And so really you can see that the

Â preferred assignments for their local opinion is the one that they disagree

Â with each other. Okay?

Â And again, A and D like to agree. So this is sort of a, you know describing

Â a, the overall state by a bunch of little pieces and how we go to put these pieces

Â together to define a joint probability distribution, we are going to use the

Â notion of product of factors and so here we, here we are and we're going to take

Â all these factors and we're going to multiply them together.

Â That's great. Except that there is, this is in no way,

Â shape, or form, a probability distribution.

Â Because its numbers aren't even in the interval, 01 which is why you'll notice

Â there is a little tilda on top of a p. This tilda formalized measure,

Â 4:38

okay? So how do we turn an unnormalized measure

Â into a probability distribution? Well, we normalize it.

Â actually sorry, before that here is the unnormalized measure so that you can see

Â it here this just to sort of highlight the point.

Â How do we this unnormalized measure into probability distribution?

Â We normalize it and that's normalization here.

Â It has a name. It's called the partition function for

Â historical reasons that it's comes from its origins and statistical physics and

Â I'm not even going to describe why it's called that, but that's what it's called.

Â But you can think of it simply as the normalizing constant that is going to

Â make all of these sum to one. So we're going to get it by simply

Â summing up all these entries and it's going to give us the value z.

Â And if we divide all of these entries by z, we get a normalized probability

Â distribution and that is the probability distribution as defined by this graph.

Â 5:35

Now, now let's think about what these factors mean, and let's think about this

Â factor phi one of AB, which is this local happiness between A and B.

Â And let's think about how it relates to the probability distribution.

Â So we might think that, this is the marginal probability of A and B in the

Â joint distribution, or maybe it's the conditional distribution of A given B, or

Â maybe B given A, or maybe it's the joint probability of A and B given C or D.

Â The answer is none of the above. Let's go back and look at what this

Â actually mean in this particular context. So here we have the set of factors that

Â we used to construct this distribution. And here, trust me, is the marginal

Â marginal probability of A and D as defined by the set of factors, phi.

Â I'm going to use a little phi here, to denote the fact that it was derived from

Â the set of factors phi. Equals phi one [INAUDIBLE] [SOUND] Oh,

Â come on. Well,

Â 7:04

Let's compare this distribution to the factor phi one.

Â We can see that not only does this not respect the fact that a and b like to

Â agree with each other. here A and B like to agree with each

Â other by a lot. I mean, remember, this is the, this is

Â three times higher than the next highest value.

Â Here, that probability has 0.13 and the other high value assignment in, on this

Â side, the one that had 10, has only 0.04. What is the single highest assignment

Â here? It's this one.

Â 7:46

Let's think about why that is. This probability distribution is

Â constructed by multiplying all four of these factors, and if you think about

Â what's going on here, you see that B really, really likes to agree with C, so

Â these guys are really closely tied together.

Â And, actually this should probably be and A

Â and D similarly like to agree. So these are really, really closely tied

Â together. These guys, C and D strongly like to

Â disagree. It's nice to have opposite values.

Â Now all three of these factors are actually stronger, that is the

Â differences between the assignments are bigger than in phi one.

Â So where are you going to break the cycle?

Â You can't have D agreeing with A, A agreeing with B, B agreeing with C and C

Â disagreeing with D. It doesn't work.

Â And so somewhere this cycle has to be, this loop has to be broken, and the place

Â where it gets broken is A and B because it's the weaker factor.

Â So the A and B probability is actually some kind of complicated aggregate of

Â these different factors that are used to compose the Markov network.

Â And this is actually an important point, because it is going to come back and

Â haunt us in later parts of the course. There isn't a natural mapping between the

Â probability distribution and factors that are used to compose it.

Â You can look at the probability distribution and say ha.

Â This piece of it is what's [INAUDIBLE] off the beat.

Â This in direct contrast to Bayesian network, where the.

Â Actors were all conditional propabilities and you could just look at the

Â distribution and compute them, here you can't do that and that often turns out to

Â affect things like how we can learn, these, facts for from data, because you

Â can't just extract them directly from the propability distribution.

Â So, with that definition we can, with that intuition we can now go ahead and

Â define a peerwise Markov network. And I'm defining it explicitly, because

Â peerwise Markov networks are sufficiently, commonly used as a class of

Â general Markov networks. That that it's worth giving them their

Â own place. So a peerwise Markov network.

Â Is an undirected graph, who loads other random variables.

Â x11) up to xn.n). And we have edge, xi connecting to xj.

Â And each one of them is associated with a factor.

Â Also move into potential. Phi IJ.

Â Oops. Xi [INAUDIBLE].

Â Okay? That's what.

Â this wouldn't be an edge. This would be a comma.

Â Mm-hm. That's a peerwise Markov network.

Â And from that. and here's an example of a slightly

Â larger markup network. This is a markup network that is in the

Â form of a grid. and we, this is the kind of network

Â that's used for example when we're doing various operations on images, because

Â then the variables correspond to pixels for example.

Â and this is the Markov network that corresponds to the image segmentation

Â when we'ere using super pixels in which case it's no longer a regular grid.

Â