0:00

Having to find a set of structured scoring criteria, we now turn to the

Â following kind to optimize, that structure score over a set of possible

Â candidate structures. And, we're first going to consider the

Â specific case of trees. So just as a reminder, we define a

Â scoring function that evaluates how well a structure matches the data.

Â So here is a set of different structures, and here is a graphical depiction of our

Â score. And we would like to search for a

Â structure that maximizes the score. So this is an optimization problem.

Â where the input is a trai, is a training set, a scoring function, and possible

Â structures. And the output is a network.

Â And it turns out that absolutely the key property for computational efficiency is

Â the composability of the score. Which is the decomposition of a score as

Â a sum of family scores, which we've seen before.

Â 1:03

Now the first problem that we're going to focus on is learning either a tree or a

Â forest. So what is, what is a forest?

Â A forest is at most one parent per variable, so here for example is.

Â One-fourth. And notice that a fourth doesn't have to

Â be connected. That is it can be separate disconnected

Â components. A tree on the other hand would require

Â that the graph be connected so we might have this edge for example as well.

Â And notice that, it's, a directed tree. So why should we care about learning

Â trees, given that trees are so degenerate, as a you know they're so

Â restricted in their expressive power? Well, there are reasons, there are

Â different types of reasons why we like trees.

Â First of all it turns out the math is very elegant.

Â Now that might not be a convincing reason but that elegant math gives rise to what

Â turns out to be a very efficient optimization over the set of trees that

Â allows me to solve it extremely effectively even for very high

Â dimensional problems. The, and then, finally, the third reason

Â is that trees are limited expressive power, but that also implies that their

Â parameterization is very sparse, and because their parameterization is sparse,

Â it means they are less susceptible to over-fitting.

Â And therefore, at least in cases where m is small relative to the network

Â complexity. Delta 2n.

Â then it turns out that it can provide better generalization.

Â Even though it doesn't capture the full expressive power that we would hope to

Â have in our network. So, let's now talk about the problem of

Â learning force. And we're going to define a little bit of

Â notation. We're going to define P of I for variable

Â I to be the parent of XI. Or zero, in case XI has no parents.

Â So, note that each variable has at most one parent, so this works.

Â So let's look at our decomposable score. And, again, this can be any of the three

Â scores that we talked about because they all have this property.

Â So the score is the sum of score of individual families Xi and their parents.

Â And now the parents are very simple because we have only at most, one parent

Â per variable. So we're going to divide this into two

Â cases. The first case is those that have a

Â parent. And the second one is those that don't.

Â 3:46

And so, for the nodes that have a parent, what we have is the score of XI given its

Â one parent, and for the nodes that don't have a parent, it's just the score of XI

Â as a standalone variable. Now we're going to shuffle things around

Â a little bit and we're going to do the following, we're going to introduce

Â either this expression a score of XI and then we're going to add it back in.

Â Over here. So now we have a sum over all variables

Â to know that this is I equals one to N of a score of XI isolation.

Â But for the XI that have appearance I've subtraction that off in the first

Â summation. And So I've compensated for them both

Â sides. The important property of this expression

Â is that this is the same for all trees. And so it doesn't affect the comparison

Â between one tree and the other. And so the only, the only expression that

Â I need to consider when comparing two trees is this expression over here, which

Â is the sum over I of the score of XI with its parent minus the score of XI without

Â its parent. So you can think of this as the score of

Â the empty network, and this, the first ex-, and the first expression, the first

Â summation, is the improvement that a tree gives me over the empty network.

Â 6:51

For BIC or BDE, we know that there is a penalty score.

Â So, even if introducing an edge from X-j to X-I increases the likelihood, we might

Â end up paying for it more in the penalty score than what we gained.

Â And so the weight, this weight, can actually be negative.

Â At which point the optimal structure might actually be a forest.

Â That is, it might be detrimental to our overall score to make the graph fully

Â connected. Mostly connected, I mean, graph

Â connected. That is, a tree rather than a forest.

Â Okay. Now a second observation.

Â A score remember satisfies score equivalence, this is a property that we

Â defined before if I equivalent structures have the same score.

Â And all of the scores that we talked about are score equivalent.

Â It turns out this first score equivalence score we can show that the weight from I

Â to J is equal to the weight from J to I. That is we can compute this expression

Â over here. For J to I or for I to J, we're going to

Â get the exact same value, and you can see that in the likelihood square directly

Â because mutual information doesn't have any kind of directionality associated

Â with a mutual information of XI to XJ is the same as mutual information form XJ to

Â XI, so for likelihood square you could just see that and it turns out that for

Â the other cases it's also it also follows.

Â And so that means that we can define an undirected graph, where it doesn't really

Â matter where I, whether I go from I to J, or from J to I.

Â And because the weight is the same for both.

Â And so 4-score equivalent scores, we therefore get the following algorithm.

Â We can define an undirected graph with nodes being the variables in my graphical

Â model. And then I define a weight, WIJ, to be,

Â the max, of the score XI, between XI and XJ and zero, and remember that it doesn't

Â matter whether I do IJ or JI, here, we're going to get the exact same thing for

Â score global networks. The zero, is going to be, a way of

Â eliminating, negative edges so that I can optimize things, efficiently without

Â having to worry about negative edge cost. Now I can go ahead and find a forest or a

Â tree that has the maximal weight. And in, and the nice thing is it turns

Â out that you could use any of the standard algorithms for maximum weight

Â spanning trees in order to do this. So for example, either Prim's or

Â Kruskal's algorithm can be used to solve this and we can in fact find the solution

Â in prime that has all been squared which is an inevitable cost given that I'm

Â considering all N squared combinations in in terms of pairs of edges.

Â And then, if I end up having edges of weight zero that indicate that the edge

Â they're, potentially contribute was derived from a negative component of the

Â score. So I remove all of the edges whose weight

Â is zero to produce a forest. And that is an O of N squared time

Â algorithm for finding the absolutely optimal tree relative to any of these

Â three score for those scores that we've defined.

Â So let's look at an example. This is the ICU alarm network again, this

Â is a picture of the original of the original network and if we apply this

Â tree learning algorithm, we end up with the following structure.

Â Where the edges that have been marked in red are edges that existed in the

Â original network and the blue ones are edges that are spurious, that is, they

Â weren't in the original network. So this blue one over here, for example,

Â isn't in the original network. And so, we see that although many, mostly

Â the num, the edges that we end up with are edges in the original network, some

Â aren't, some are derived from, correslations that, went through indirect

Â paths in the original network. And the other thing that's important to

Â see, looking at this, is that the infered edges in the tree are intrinsically

Â undirected, it doesn't mean that I can't, force a direction on them, but the

Â direction isn't determined by my optimization algorithm and that means

Â that I'm not able to determine what came before what in the original graph when

Â I'm learning a trait. Or to summarize structure learning as is

Â an optimization problem over the communitorial set of space of graphs the

Â decomposibility property allows me to break that up as a sum of terms for

Â different families and that allows me to optimize in the specific case of tree

Â structured networks using standard maximum weight spanning tree algorithm

Â very efficiently in quadratic time.

Â