We talked about how the problem of learning a general Bayes net structure

can be viewed as a combinatorial optimization problem in which we are

trying to find a high scoring network structure and how that's problem, that

problem is usually solved by some kind of heuristic search over the space of Bayes

net structures. Let's talk now, about the computational

cost of this algorithm and how we can use the property of decomposability, which we

also previously used in the context of tree learning to considerably reduce the

computational cost of this of this algorithm.

So as a reminder, the heuristic search procedure that we discussed iteratively

moves over the space of networks. And so if this is the network that we're

currently at, we need to evaluate several different moves away from that network

that [INAUDIBLE] usually local ways by adding, deleting, or reversing an arc.

We then score each of these subsequent networks and see which of those has the

highest score and take that move. And one can also have other algorithms

that evaluate that, that don't necessarily make the greedy move at each

point, but basic idea is the same. So what is the computational cost of this

algorithm? Let's do the naive computational

analysis. what we would get if we were to do the

naive implementation of this. So, how many operators are there in each

search step that we need to evaluate? so in a Bayesian network with n nodes, we

have n(n - 1) different possible edges.

[SOUND] One now, each of those edges is either

present in the graph or not present in the graph.

If the edge is present in the graph, we can either delete it,

[SOUND] reverse it. And if it's absent,

we can add it, which means that effectively, we have

either two or one possibilities for each of those n(n - 1) possible edges.

And so, we have O(n^2) possible operators that we need to evaluate at each point in

the step, each, each step in the algorithm.

What is the cost of evaluating each of the candidate successors that we would

get by taking one of these operators? So, reminding ourselves that there are

multiple components in the scores, one for each variable in the network,

because of the decomposability property. So we have n different components, and

for each of those, we have to look at the sufficient statistics and compute the

resulting entry in the score corresponding to that variable.

Computing sufficient statistics requires a traversal over the training data and so

that's something that takes O(M) time where M is the number of training

instances. So altogether, this step requires O(n *

M). Now, we haven't talked about this, but

one also needs to make sure that the resulting graph from this operator is in

fact acyclic and so we need to do an acyclicity check.

This is something that, in general, requires O of little m time, where m is

the number of edges in the graph. So, altogether, if we sum up all of these

different pieces of the cost, we end up with a computational cost which is O(n^2

* (Mn + m), whereby in large, little n is usually dwarfed by M, by big M times n

and that is the cost per search step. Now if you think of a network, it's not

even necessarily that large, something that has 50 or a 100 variables, so that n

is 50 to a 100, and the number of training instances is 10,000, this can

get really, really large and generally intractable to do in a lot of situations.

So how can we improve on this? Let's see how we can exploit the

decomposability property to get a significant computational savings in the

search process. Let's first look at a single small

operator such as the one where we add an edge between B and D to this network

where where such an edge did not exist before.

And, let's consider the score that we had for

the original network, which in this case is, because of decomposability property,

is the sum of family scores for the individual variables.

So we have a component that lists the score of A relative to its empty set of

parents, the score of B relative to the empty set, C relative to its parents A

and B, and D relative to its parent C. Let's compare that to the score of the

network following the move and we can see that this score for the new network is

actually very similar to the score of the original network, because, for the same

decomposability property we can break up the score into these little components

and since most families haven't changed, that component in the score is going to

be the same. Specifically, we're going to have the

exact same score for A relative to its empty parent set, the same score for B,

the same score for C, and only this only this last score for D is now going to be

different, because of the fact that we've modified the family for D.

But that immediately suggests that we don't really, really need to recompute

these earlier components of the score, because they haven't changed. We only

need to compute the last component corresponding to D.

In fact, we can do better than that by, instead of considering the absolute

score, instead, we're going to compute what's called the delta score which is

the difference between the score of the network following the change, this

network, and the score of the original network.

And we're going to compute the difference between those two scores, which as we can

see, is just the difference between the scores of the two families for D the B, C

family in, in the following, in the new network versus the C family and the

original network. And, that delta score is going to be

something that we can compute just looking at a single family, ignoring the

rest of the network. The same thing happens for other local

operators. So for example, if we consider now the

deletion operator that deletes an edge between B and C the, and we look at the

delta score, that delta score only cares about the the family C, because that's

the only family to have changed. And that's going to be, again, the

difference between the score of C with the single with the single edge A minus

the score of C with the family A, B.

So again, only one family gets affected by this change.

For an edge reversal, the situation's a little bit more complicated, because we

can see that by flipping the edge from B to C and making it go from C to B,

there's actually two families that are affected, the B family and the C family.

But that's still just two as opposed to the entire network and so we can see that

that delta score is going to have two components,

one is the delta score for the C family and the second is the delta score for the

B family. But in either case, we only end up

affecting either one family in the case of edge addition or in that case, case of

edge deletion or at most two families in the case of edge reversal.

And so, that means that we only have to consider a much smaller fraction of the

network when computing the delta square. A second use of the decomposability

property comes up when we consider multiple consecutive steps of the search

algorithm. So let's imagine that in the previous step, we decided to to take

that, step that, added the edge, from B to C, and now we have to consider the

next step in the search. Well, what are the operators that are

conceivable in this in this next step of the search? For example, one such

operator is to delete the edge from B to C [SOUND] this one.

[SOUND] And notice that that edge deletion operation is in fact an operator

that we also considered in the previous step of the search before we decided to

add the edge from B to D. Now, normally is this operator the same,

notice that the family of C hasn't changed between those between those two

cases, in both of these cases when we're considering the move C has the parents of

A and B. And so, the delta score for that particular move is not going to change

either, specifically, if in this case, the delta score was this score of C

minus, given family A minus the score of C given the family A,

B. We have exactly that same score,

the same delta score in in the previous step.

That is these two delta scores in the previous step and the new step are

exactly the same and so there's really no point to recompute it.

[SOUND] So which scores do we need to recompute?

The only scores that we need to recompute are the ones that were affected by the

step that we currently, that we just took.

So specifically, if we took a step that modified the family of D, then any step

that involves an additional change to the family of D will have a different delta

score, because the family is now different in doing the comparison.

However, families that were not affected by the move, don't need to be recomputed.

So to summarize, we only need to rescore delta moves, delta scores for families

that changed in the last move. [SOUND] So let's summarize the

computational cost of this procedure. What's the cost per move?

Having decided to take a move, we have only one or two families that were

affected by that move, that means that only O(n) delta scores

need to be computed, because for a given family, there is only n possible edges

that that impinge on that family. So only O(n) delta scores need to be

computed, and for each of those, we need to compute sufficient statistics which

takes O(M) time. So altogether, we have O(n * M) as the

cost of doing this step, which is actually two orders of magnitude better

than the old n^3 * M that we had for the original naive computation.

Now this tells us the cost after we pick a move.

What about the cost of picking a move? Now, naively, we might say that there is

n^22 possible operators that we can consider any in given move,

and so, we need to evaluate each of them, and pick the best one or consider each of

them, and pick the best one. But, in fact, we can do considerably

better by the use of clever data structures.

So specifically, we can maintain a priority queue of operators sorted by

their delta scores. Now, when we rescore those O(n)

operators, in this step over here, we need to modify

the score of those operators and reinsert them into the priority in their

appropriate place. But that's a computation that requires O(n log n) time

because there's only n of M. And, once we have done that, the best

operator will be at the top of the list which we can then take identify and take

in constant time. And so, this priority queue saves us

complexity by taking us from O(n^2) time for picking this for traversing the set

of operators to something that requires O(n log n).

And so, altogether, we've actually saved a considerable cost in both of these

steps. [SOUND] It turns out that one can get an

even higher level of computational efficiency based on a different

observation. So, it's a fact that in most network

learning algorithms, the plausible families, the ones that have some chance

of being selected are variations on a theme.

That is, for a given variable A, there might be some number of variables you

know, B1, B2, up to Bk for a very small k that are

reasonable parents to be selected as parents for A.

And so, how do we exploit that property in

computational, to get computational savings?

Turns out there's two different ways that one can utilize this.

The first is the fact that because we have the same set of variables being

constantly considered as being candidate families,

it means that we can use sufficient statistics that we computed in one step

and reuse them in a different step if we cash them,

because, because we're likely to encounter the same family more than once.

We might encounter B as a parent of A and A as a par, as a possible parent for B,

and for both of these, we have the sufficient statistics A, B that are going

to be utilized for both of them. And, so, if we compute this once and then

cache it, these sufficient statistics, we don't have to recompute them again.

That turns out to make a huge difference in terms of the computational cost of

this algorithm. The second way in which this could be

used is that if we can identify in advance the set of B1 up to Bk that are

reasonable parents to consider for A, we can restrict in advance that set and not

consider other parents at all, which reduces our complexity from O(n)n) to

O(k),k), where k is some bound on the number of plausible parents that we're

willing to consider. Now this, now is the heuristic in the

sense that this is a restriction that can actually change the outcome of our

algorithm. It's not just a way of reducing the cost,

but it turns out to be a very good approximation in practice.

To summarize it turns out that even the fairly simple heuristic structure search

that we employ in in, in such as greedy hill climbing can get expensive when n is

large, because the naive implementation has a cubic dependence on n.

But we can exploit the decomposability property, that we also exploited in the

context of tree learning, to get several orders of magnitude reduction in the cost

of the search. We can also exploit the recurrence of

plausible families at multiple steps in the search algorithm to get further

computational savings and also restrict the search space to to get better better

computational property.