0:00

So far we've talked a lot about the representation of probabilistic graphical

Â models, we define different classes of probabilistic graphical models, Bayes

Â net, Markov net and talk about their independence assumptions.

Â Now let's operationalize probabilistic graphical models and figure out how to

Â use this the clarative representation to answer actual queries.

Â It turns out there is many queries that one can use a PGM to answer but perhaps

Â one of the most commonly used one is what's called a Conditional Probability

Â Query. So let's define a Conditional Probability

Â Query. In a Conditional Probability Query we

Â have some set of observations. E, little E, about a set of variables,

Â big E. These are the variables that we happened

Â to observe. We also have a particular query that we

Â care about, which is, usually which we are going to denote as a set of

Â variables, Y. So our goal here is to compute the.

Â conditional probability of the variables Y given the evidence equals little E or a

Â conditional probability. This type of query is useful for a

Â variety of different applications. For example, in the medical or fall diagnosis

Â domains that we've spoken about we might have observations about certain symptoms,

Â or test results and we're interested in predicting the probability of different

Â failure modes or different diseases. In the pedigree analysis example that we

Â also talked about we might have observations about the phenotype or maybe

Â even the genotype of some individuals. Family and are interested in reaching

Â conclusions about other individuals. So, unfortunately, like most interesting

Â problems the problem of inference and probabilistic graphical models is NP

Â hard. So, before we talk about that, let's

Â first remind ourselves what NP hardness is.

Â I'm not going to define it formally in this course.

Â But a, as a rough intuition, if a problem has been shown to be NP hard, it means

Â that it, it extremely unlikely to have an efficient solution.

Â Slightly more formally, it means that all algorithms that people have devised for

Â this problem, and a bunch of problems that are equally hard, require at the

Â very least time exponential in the size of representation of the problem, which

Â means that it's unlikely to be solvable for any problem that is bigger than a, a

Â small handful of say variables in our context.

Â So which particular problems in the context of PGM inference or MP hard?

Â Well, first, the basic problem of exact inference in the PGM is MP hard.

Â So given a PGM, P of Pi defined by a set of factors phi.

Â And a particular target variable x. And a value, little X.

Â Computing the probability of the event, Xlittle equals X, is MP hard.

Â In fact even a simpler special case of this problem.

Â One where we just want to figure out whether is probability is even positive.

Â That too is [INAUDIBLE]. So one might then say well okay, exact

Â inference is is intractable. But what about if we compromise a little

Â bit on accuracy. What if we're just looking for an

Â approximate answer. After all, we don't necessarily care

Â about the tenth significant digit in this probability.

Â Does that make the problem easier? Unfortunately, turns out that the answer

Â is no. And there is many different variants

Â empty hardness results for approximate inference.

Â This is just one of them. This one says that, again, given a PGM,

Â an event X equals little X and some set of observations.

Â The problem of finding. A, an approximate answer P.

Â For which we are guaranteed that this approximate answer is within epsilon of

Â the truth, is empty heart. And this is true for epis, for any

Â epsilon that's less than 0.5. And note that epsilon equal to 0.5 can be

Â obtained by simply random guessing. Or guessing the value equal to 0.5.

Â So any non-trivial approximation for this kind of Conditional Probability Query is

Â also intractable. It's also a rather depressing

Â observation, and might want, might want to make us give up on using PGMs for

Â inference. But it turns out that one shouldn't get

Â depressed too quickly, because importantly, this is a worst case result.

Â Which means we can construct these bizarre PGMs for which exponential time

Â is the best we can do. But that doesn't mean that in the general

Â case one can't do better. And the rest of the inference section of

Â this course will be devoted to showing a variety of algorithms that whether for

Â exact inference for approximate inference, can do considerably better

Â than this worst case result can suggest. So, first let's to understand where these

Â results might be getting their, where these algorithms might be getting their

Â power, let's drill, drill down into the inference question a little bit more.

Â And here is a slightly elaborated version of our student network where we now have

Â additional variables. For example, the coherence of a course

Â influences its difficulty and a variety of other variables that we're not going

Â to talk about. So.

Â Difference in, this model and in general inference for probabilistic graphical

Â model uses, the notion of factors that we've talked about before and it turns

Â out that it's convenient to use factors because it means that the algorithms

Â apply equally well. To bas net and mark up networks, and, and

Â it's a very useful extraction. So let's think about this, basion network

Â in the context of a set of factors, so here for example we, had, initially, a

Â prior probability over the C variable P of C, that's, translates into a factor,

Â whose scope is C. We have, for example, here Q of G given I

Â and D, and that translates into. This factor whose scope is g, I, and d.

Â And in general each one of the CPDs in this network converts to a factor over

Â the scope of the family that is the variable and its parents.

Â So now let's assume that our goal is to compute the probability of the variable

Â J, and so we'd, we'd like to compute p of j.

Â And what we see here is the joint distribution.

Â So this is the joint distribution which we have defined using the chain rule for

Â Bayesian networks. In order to compute P of J what we need

Â to do is only to eliminate or marginalize out all of the variables except for J and

Â so we end up with a summation that looks like this which is why this problem of

Â inference of con of conditional probability inferences often called sum

Â product because we have a sum over product to factors.

Â 7:04

The inference problem for mark of networks takes exactly the same form so

Â here we have once again a product of factors and in this case the factors are

Â the formen which the network are is originally defined.

Â So we have the factors fi one over AB fi two fi three and fi four and if we are

Â interested in computing the probability P of D then once again we need to compute

Â the product of these factors and then marginalize out.

Â Now but what I wrote here is not actually quite right because the pro the product

Â of the factors isn't actually in mark of networks P of ADCD rather its P tilda of

Â ABCD which is the unnormalized measure and in order to get the normalized

Â measure we need to norm we need to normalize or.

Â Divide by the partition function. So, how do we deal with that if our goal

Â is to compute p o d? Well, the, point is that if what we have

Â computed, if we ignore the partition function, and rather we compute p

Â [INAUDIBLE] of d. We can infer that p of d is actually

Â equal to one over Z, p [INAUDIBLE] of d. Because the normalizing constant is a

Â constant. And so if we've computed p [INAUDIBLE] of

Â d, we can obtain p of d, by simple process, of re-normalization.

Â Of p tilda of d. What about evidence?

Â Well, evidence it turns out can be done by simple pre processing step of factor

Â reduction. So here if we're trying to compute the

Â probability of a set of variables y given evidence.

Â That, by definition is the ratio between the probability of y and the evidence,

Â divided by the probability of the evidence.

Â And, if we look at the numerator of the expression over here.

Â And define a set of variable, define w to be all the variables that are neither

Â query. Nor evidence.

Â 9:26

Then, we can, once again, consider this as a sum-product expression.

Â So, P of Y and E. We are going to introduce the variables W

Â into this expression. So this probability over here is the sum

Â of this probability marginalizing out the W.

Â Now if we now ugh rewrite this expression.

Â Over here we can view it as a product of factors.

Â And that's case whether it's a Bayesian network or a Markoff network.

Â And that pro, product of factors is only those factors which are only the

Â components of those factors that are consistent with my evidence E equals

Â little E. Which means I.

Â reduce the factors by the evidence. So to understand what that means let's

Â look at this example over here. and let's imagine for example, that we

Â have an observation Say A equlas little A and we want to

Â compute the probability of the distribution in the context of this

Â observation. What that means is we're going to take

Â every single one of those factors say A equals say A equals A0. We're going to

Â remove from every factor that involves a, the entries of course correspond to A

Â equals A1. because those are not going to be

Â consistent with our observation A equals A0.

Â Once we've reduced those factors to be consistent with those evidence, we have

Â now a, still a product of factors and we can go ahead and treat it in exactly the

Â same way as we did before. And so we now have a sum over the

Â variables W that need to be eliminated of the product of the reduced factors and

Â once again we can ignore the partition function by computing this probability

Â and then re-normalizing at the end. , This applies equally well in the

Â context of Bayesian networks. So here we might have, say the

Â observation is I equals little I and H equals little H.

Â So now this is no longer an equality, but rather a because we have not yet

Â conditioned on the evidence. And so if we want to make this equal we

Â have to reduce each one of the factors involving I and H, based on the evidence.

Â And so this turns into phi I of little I. Which, as it happens, is a constant

Â because there's no other variables in the scope, and similarly here.

Â And here and for the H equals little H. We do the same thing, which in this case

Â involves only this factor over here. And now, it's back to being an equality.

Â And if we want to compute probability of j, given.

Â 12:42

I equals little I and H equals little H, we take this summation.

Â And re-normalize. Just like before, so to summarize the sum

Â product algorithm can be done as follows. It looks at, you convert the conditional

Â probability into a ratio that the numerator of this.

Â Ratio is a product of reduced factors summed out over the remaining variables.

Â the nominator. Of this is simply the numerator, summed

Â up over the variables, over the query variables.

Â Why? And if we divide these two together, we

Â realize that we can get away with computing simply this product of reduced

Â factors, and normalizing at the end. There are many algorithms for computing

Â Conditional Probability Queries. one of those involves pushing the

Â summations into the factor product, this gives rise to an algorithm called

Â variable elimination, it turns out to be a special case of a class of algorithms

Â called dynamic programming. And it's a form of exact inference.

Â A generalization of this algorithm performs message passing over a graph

Â that also effectively deals with summations and factor product and factor

Â summation. And there is many variants of this

Â algorithm, some which are exact and others are approximate.

Â Here are some names of algorithms only some which we'll have an opportunity to

Â discuss. And then finally there's a very different

Â class of algorithms that uses random sampling as the key for as the key

Â technique and it's sample. Complete instantiations or assignments in

Â a variety of different ways. And uses those assignments to approximate

Â the probability of, of a particular query.

Â So here we have. Both exact and approximate.

Â Methods and this is an approximate method.

Â And we'll talk about some of these in the remaining section of course.

Â