0:00

So we've previously seen the notion of Pairwise, Markov networks.

but now we're going to define a much more general notion, that is considerably more

expressive than the Pairwise case. And that definition is called the Gibbs

distribution. So in order to motivate the notion of a

Gibbs distribution, let's look at the most expressive Markov network that we

could possible define in the context of pairwise interactions.

So here we have four random variables, a, b, c, d, and I've introduced all of the

possible pairwise edges between them. And so the question is, that we'd like to

ask ourselves is, is this good enough? So, is this fully expressive?

0:49

Or in other words, can it represent any probability distribution over four random

variables. So one way to convince ourselves of, of

whether it can or it can't is to go and do the general case and just look at a

little bit of asymptotics. So consider a fully connected pairwise

Markov network over n random variables, and let's assume that each variable xi

has d values. How many parameters does the network

have, and for or the moment, let's just focus on pairwise interactions.

sorry, just the pairwise potentials. Let's ignore the potentials associated

with single denotes, and also, let's do a little bit of

analysis. If we have n variables, how many edges

are there in the network, how many parameters per edge?

2:06

Right, can we represent any propability distribution, using O of N squared times

D squared parameters? How many parameters are there in a general probability

distribution over N random variables, where each has D values?

How many free parameters do we have? Well, this is much, much bigger than

that, which means that, if you just even

thinking about it intuitively without getting into formal arguement, Pairwise

Markov networks are not sufficiently expressive to capture all probability

distributions. So how do we increase, the, the coverage

of this undirected representation? We need to move away from Pairwise edges.

2:55

So, in order to parameterize what we call a general Gibbs distribution, we're going

to parameterize it using general factors, each of which has a scope that might

contain more than two variables. So whereas before we just had factors

over pairs, now we have factors over triplets, quadruplets, and anything else.

Can we now represent every probability distribution?

Well, sure, because we can have a factor over all N variables together, and by,

you know if I immediately define, allows us to define a general probability

distribution. In fact we even said that a probability

distribution is a factor who's scope is X1 after XN.

3:52

And we're going to define this distribution in two steps.

Three steps, even. The first thing we're going to do, just

like in the case of Pairwise Markov networks, we're going to take all of the

factors. So here we have k factors, and we're

going to multiply them. And this is just the familiar operation

of factor product, which we've seen in multiple contexts before.

Now, this is perfectly fine, but the problem is, that this is not necessarily

a probability distribution. In fact, it almost never will be a

probability distribution, because we have put no constraints on the factors, and

so, and so what we, that's why we have this tilde here, that highlights the fact

that this is what we called previously a unnormalized measure.

4:58

which we get by summing up over all possible assignments, the variable of X1

up to XM. That partition function can then be used

to divide all of the entries in the unnormalized measure to give us, oops,

something that shouldn't have a tilde on it, but rather is a probability

distribution P sub phi if X1 up to XM. Now that was just a definition of the

distribution in terms of the set of parameters. where is the Markov network

in all this? So let's think about what is the Markov

network that we would like to have for a Gibbs distribution with a certain set of

factors phi? So in order to get the intuition let's

look at a distribution that innvolves two factors, phi 1, whose scope is A, B and C

and phi 2, whose scope is B, C, and D. And, I guess I'm going to use a different

color, for phi 2. So, what, edges should the mark up

network have when, when we wanted to encode the fact that A, B, and C all get

to interact with each other, together? And intuitively, what we then like to

have is a network that has an edge from A and B, an edge between B and C, and an

edge between A and C. Because that captures the fact that

there's direct probabilistic relationships between all of them.

What about the other one? Here we have, between B and C, C and D,

and B and D. Mm-hm.

6:59

More generally, if we have a set of factors phi,

each of which has, or each phi I has a particular scope, DI.

The induced Markov network, which we're going to call H sub phi, has an energy

between a pair of variables, XI and XJ. Whenever there exists a factor phi I in

phi, such that, oops, phi, the phi such that XI, XJ are both in the scope of the

factor phi M. That is two variables are connected

whenever they appear together in the same scope.

7:56

Now we can go ahead and turn this around and define a notion, just like we have

for Bayesian network, of when a probability distribution P factorizes

over a graph H, that is, at what point can I can I represent P over a particular

graph H? And, this basically asks the question of

is there a set of parameters phi, that are going to let me represent the

probability P. So, this is just of a straightforward

going through the definition that we had before. Does there exist a phi such that

P, is equal to P of phi where P of phi is defined as I, we defined previously as a

normalized product of factors. And such that H is the induced graph for

the set of factors phi. So P factorizes over H if there exists a

set of factors phi, such that I can represent P using those

factors, and H is the induced graph for that set of factors.

So that's when I can encode a distribution P over a graph H.

9:25

So, now let's ask ourselves the question. if you give me a graph h.

Can I tell you, what the factorization of, a distribution p over that graph

would be? That is, which, which of the

representation of the distribution that I had in mind when I drew this graph?

So here we have this graph, over A, B, C and D.

And, which Gibbs distribution would induce the graph H?

Well let's look at them, one af-, one at a time.

So here we have phi one of, ABD and phi two of BCD.

And, we see that there's an edge in D, between A and B.

A and B, A and D.

And conversely between B and C, B and D, and, C and D.

So, the answer here is yes. This distribution would induce this, this

set of factors would induce this graph. Okay what about the next one.

10:32

This, and ask yourself the same question. Well, if I wanted, AB would induce this

edge. B, C, yep.

C, D, yep. A, D, and B,

D. Well, huh.

Here's another distribution with a different set of factors that induces the

exact same graph. The third one is the same principle.

We have the edges A, B, D, A, B, B, B, and A, D.

And then we have B, C and C, D. So here's another distribution that

induces the exact same graph. What that tells us is that we cannot, and

this is important, cannot read the factorization from a graph.

11:34

That is, we have different factorizations, that are quite different

than their expressive power, all of which induce the exact same graph.

And we've already seen an example of that.

When we had the fully connected Pairwise Markov network, we had one

parameterization that had old n2, squared d2 squared parameters.

And we had another parameterization that had a fully, a full factor over all n

variables, so it had vn to the n parameters.

And these are two very different representations, with very different

expressive powers that never the less induce the exact same graph.

But we must ask the question of why then is the graph the same?

What does the graph really tell us? Given that given that it's not telling us

the structure and factorization. So, here's is the, here is this, going

back to the example on the previous slide.

We have these, two factorization, one of which uses triplets, factors and the

second one uses Pairwise factors. And lets think about what is the flow of

influence in these factors. So when can one variable influence

another? And we can see,

and we think about this intuitively, when can B influence D?

Is this is this different in the two graphs,

in, in the two distributions? And the answer is, well, not really.

I mean once we have a factor. Here in this case it's phi one, in this

case it's phi five, that ties b and d directly, and the fact is that D can

influence B. What about can B, can A influence C?

Well, so let's, so can A influence C? A can influence C via D by going through,

in this case, the ABD factor, and then subsequently uti, utilizing the

dependencies within the BCD factor. and in this case, it can use the AB

factor. And then the CD factor, and so the point

is although the parametrization of the two distributions are different, the

paths in the graphs, the trails in the graphs through which influence can flow

is the same regardless of this finer grain structure of the factorization,

which is why the graphs in those two cases are the same.

So let's formalize this definition. were going to define a notion of an

active trail in in a Markov network. And this is actually a very simple

definition. It's much simpler than the analogous

definition in the context of Bayesian network.

We have that a trail going from X1 up to XN is active, given, and of, a set of,

observe variable Z. If basically no XI along the trail is in

Z, because an active trail has to, only flows through variables that are

unobserved. Once we observe a variable along the

trail, influence kind of stops because that variable is now set and so you can't

really influence it. And if you can't influence it, you can't

influence anything subsequently along that, along that path.

So for example, the trail from B. The D, is active, so this is active,

but not if a is observed. So once I observe A, I can no loner

influence, B can no longer influence D via a.

15:13

So, to summarize, we define those in the Gibbs distribution,

which represents the distribution p as a normalized product factors.

we connected that to a graph structure which is the induced Markov network,

which connects every pair of nodes that are in the same factor, and the

motivation and although we noted that the Markov network structure doesn't, fully

specify the factorization of P, the justification for why the graphs for

different factorizations are the same, because the active trails in a graph

depend only on the graph structure.