0:00

But we've previously talked about the importance of

Â allowing CPD representations that encode additional structure in the local

Â dependency model of a variable on its parents.

Â And we talked about the cases of tree CPDs.

Â Which allow us to depend on different variables in different contexts.

Â But none of that helps us deal with the situation that we used as a motivation

Â for this. Which is where we have a variable such

Â as, for example, cough. That depends on multiple different

Â factors: pneumonia, flu, tuberculosis, bronchitis, and so on and so forth.

Â This doesn't lend itself say, to a tree CPD.

Â Because it's not the case that you depend on one only in certain contexts, and not

Â and not in others. Really, you depend on all of them.

Â And all of them sort of contribute something to the probability of

Â exhibiting a cough. So one way for capturing that kind of

Â interaction is a model called the noisy OR CPD.

Â And the noisy OR can, is best understood by simply considering a slightly larger

Â graphical model where we have, where we're trying, where we break down the

Â dependencies of Y on its parents, X1 up to XK, by introducing a bunch of

Â intervening variables. So let's imagine that this is, again,

Â like cough variable. And this is different diseases for

Â example. And what we're doing here is we're

Â introducing a intermediate variable that's captures the event at this

Â disease, X1 if present, causes, a cough, by itself, so this is X1 by itself causes

Â a cough, or causes Y. You can think of it each of diseases is a

Â noisy transmitter. If you have the disease,

Â say if X one is true, then Z1 sort of says, fine X one succeeded in its intent

Â to make it Y true. X2 has its own little filter called Z2

Â and Z2 makes that same decision relative to X2.

Â So ultimately Y is true so Y is true. If, someone succeeded in making it true.

Â 5:02

yeah, but os that's the noisy or CPD.

Â And you can generalize this to a much broader notion of independence of causal

Â influence. This is called independence of causal

Â influence because it assumes that you have a bunch of causes for a variable and

Â each of them acts independently to affect the truth of that variable.

Â And so, there's no interactions between the different causes.

Â They each have their own separate mechanism and ultimately it's all

Â aggregated together in in a single in a single variable, Z from which the truth

Â of Y is then is then determined from this aggregate effect of all of the.

Â All of the effects, ZI's of the different causes.

Â So, one example of this is, we, we've already seen the noisy orbit.

Â You can. Easily generalizes to a broad range of

Â other cases. There's noisy ands where the aggregation

Â function is an and. There's noisy maxes which apply in the

Â nonbinary case when causes might not just be turned or off but rather they have

Â different sort of extents of being turn on and then z is actually sort of the

Â maximal extent of of, of the, if, the independent effect of

Â each cause, and so on. So there's a lot, a large range of

Â different models all of which fit into this family, meesie order is probably the

Â one that's most commonly used but the other ones have also been used, in other

Â settings as well. One model that might not immediately be

Â seen to fit into this framework but actually does, is a model that

Â corresponds to the sigmoid CPD. So what's a sigmoid CPD?

Â A sigmoid CPD says that each XI induces a continuous variable which represents WI,

Â XI. So imagine if each XI is discrete, then

Â ZI is just a continuous value, WI, which parameterizes this edge, and it tells us,

Â sort of, how much force, XI is going to exert on making Y true.

Â So if WI is zero it tells us that XI exerts no influence whatsoever.

Â If WI is positive, XI is going to make Y more likely to be true and if WI is

Â negative it's going to make Y less likely to be true.

Â All of these influences are aggregated together in this expression for the

Â variables Z which effectively adds up all of these different influences plus an

Â additional bias term. W0.

Â And now we need to turn this ultimately into, the probability of the variable Y,

Â which is the variable that we care about. And in order to do that, what we're going

Â to do is we're going to pass this continuous quantity Z, which is a real

Â number between negative infinity and infinity, through a Sigmoid Function.

Â The Sigmoid Function is defined as follows, and it's a function that some of

Â you have seen before in the context of machine learning, for example.

Â So Sigmoid takes the value, the continuous value Z, exponentiates it, and

Â then divides by one plus that exponent of Z.

Â And. Since E of Z, since E to the power of Z

Â is a positive number, this gives us a number that is always in the interval of

Â 0,1. And if we look at what this function

Â looks like. It looks like this.

Â So, this is the sigmoid function. The X axis here is the value Z.

Â And the Y axis is the sigmoid function. And you can see that as Z gets very

Â negative, the probability goes to zero. As Z gets, close, very high, the

Â probability gets close to one, and then there's this interval in the middle where

Â intermediate values are taken. You can.

Â So this is kind of like a squelching function that that sort of squashes the

Â function on both ends. Let's look at the behavior of the sigmoid

Â CPD as a function of different parameters.

Â So here is a case where all of the X Is have the same parameter W.

Â And so what we see here is the value of this parameter W, and over here is the

Â number of XI's that are true. So let's look at, first this access over

Â here, the more parents that are true, the more parents that are on, the higher.

Â The probability of Y to be true, okay. And this, it holds for any value of W

Â because these are all positive influences.'Kay.

Â So the more parents are true, the more things are pushing Y to take the value

Â true. This axis over here is the axis of the

Â weight and we can see that for low weights, you need an awful lot of X's to

Â get Y to be true but as W increases, Y becomes true with a lot of fewer positive

Â influences. This graph on the right now.

Â Is what we get when we basically just increase the amplitude of the whole

Â system. We multiply both W and W0 by a factor of

Â ten. And what happens is that, that means that

Â the exponent gets pushed up to extreme values much quicker.

Â Z gets dissect, effectively multiplied by a factor of ten.

Â And that means that the transition becomes considerably sharper.

Â That gives us a little bit of an intuition on how the sigmoid function and

Â how the sigmoid cpd behaves. So what are some examples of this kind of

Â a of an application of this. So I showed this network in an earlier

Â part of this course it's the CBCS network and it's used for the it was developed

Â for the here at Stanford Medical School for diagnosis of internal diseases.

Â And so, up here we have things that represent predisposing factors.

Â 11:28

And there's actually a fairly eclectic range here.

Â So for example one tree disposing factor is intimate contact with small rodents.

Â because that's the contributing factor for the antivirus.

Â and so there's a whole range of predisposing factors.

Â Down here in the middle we have diseases. An down at the bottom, we have symptoms

Â and test results. Now as I previously mentioned there's

Â approximately 500 variables in this network and they take on network about

Â four values. So the total number of entries in a joint

Â distribution over this space would be approximately four to the 500 different

Â parameters, which is clearly an intractable number.

Â If we were to take this network, and just, if you take this distribution

Â represented with the network shown in this diagram we get considerable

Â sparsificaton and the factorized form. As approximately 134 million parameters,

Â which is still much too many that have a human estimate.

Â By using as in this case, they use a noisy max CPD.

Â They brought the number of parameters to about 8,000 total parameters for this

Â network. Which is a much more attractable number

Â to deal with.

Â