0:00

We've spent the entire course talking about probabilistic graphical models,

Â we've talked about variants of probabilistic graphical models on the

Â presentation site, different inference algorithms that can be used to answer

Â various types of questions about probabilistic graphical models, and

Â algorithms that can be used to learn probabilistic graphical models from data.

Â Now let's take a step back and think about probabilistic models and look at

Â what they might do for us. So, first,

Â why probabilistic graphical models? So probabilistic graphical models are in

Â some sense the synthesis or the marriage of foundational ideas from statistics and

Â important techniques from computer science.

Â On the statistics side, we get sound probabilistic foundations that tell us

Â about probability distributions and how they can be used to think about

Â uncertainty and various manipulations on the probability distribution such as

Â conditioning and techniques such as value of information or decision making that

Â can be used to build on top of probability distributions to make

Â decisions. From the computer science side, we get

Â the idea of taking this probability distribution over very high dimensional

Â space and using ideas such as graphs, as a data structure for exploiting

Â algorithms that, that. That, allow us to manipulate these high

Â dimensional objects very efficiently. And it is these ideas that we got from

Â computer science. That allowed us to take, these sound

Â probabilistic foundations and make them applicable to the kinds of complex real

Â world problems that we're now facing, in many scenarios.

Â Now, a key component of the ability to take a high dimensional probability

Â distribution and considering, consider it as an object, is the fact that we have

Â now obtained a declarative representation of our knowledge about the world.

Â So that declarative representation, which is a model, a probabilistic graphical

Â model, is a stand alone unit that we can consider separately of any algorithm that

Â we then use to manipulate it. Now that's important because it allows us

Â to refine the model, separately from refining the algorithm, that that we're

Â using. So we can develop the model, and consider

Â whether it's the appropriate one for application.

Â And then we can apply one or multiple different algorithms to try and answer

Â questions regarding that model. It also allows us to separate out the

Â model construction phase from the inference phase, and the model

Â construction can involve one or both of elicitation from a domain expert, or

Â learning from Dana, and we talked extensively about learning methods and

Â also to some extent about elicitation methods.

Â Now, when would we apply probabilistic graphical models?

Â So we would apply probabilistic graphical models when we have noisy data and

Â uncertainty. That's pretty much true in any

Â application. So that, that's pretty much a given.

Â they're also particularly useful when we have lots of prior knowledge that we want

Â to embed into our model it's often easier to do this in the context of probilistic

Â graphical models then in a variety of other approaches for example in many

Â traditional machine learning settings its actually quite difficult to incorporate

Â prior knowledge in as natural a way as one can do here.

Â probabilistic graphical models are particularly useful when we have, when we

Â wish to reason about multiple interrelated variables.

Â So, if you're trying to predict just a single variable probabilistic graphical

Â models might not be the most ideal solution and there might be.

Â A range of other methods that are equally or more competitive from, say,

Â traditional machine learning. But when you're trying to reason about

Â multiple variables simultaneously, then the ability to incorporate the

Â interrelationships between them is often very powerful, and, and here,

Â probabilistic graphical models can really shine.

Â Another useful place for probabilistic graphical models brings significant value

Â is when we want to construct richly structured models that are constructed

Â from modular building blocks. So when we're trying, for example, to do

Â fault diagnosis or medical diagnosis, we often have this little building blocks of

Â models that can recur across different models, for different diagnostic problems

Â and this architecture allows us to take these building blocks and construct very

Â rich models from those very quickly by reusing components.

Â 4:44

So, we've talked about probabilistic graphical models in terms of the set of

Â design choices. There is design choices on the

Â representation side. the inference algorithm that we choose to

Â use, and on the learning algorithm that we might adopt should learning be

Â required. It turns out that these design choices,

Â although the declarative representation allows us to separate them and consider

Â each them in isolation, and develop each of them separately, they're not actually

Â totally separate. And it's important to consider the

Â interactions between them. So on the representation side it's clear

Â that which representation we choose affects the cost of both inference and

Â learning. We've talked about differences in the

Â cost of inference for different types of, of graphical models and also we saw that

Â in the learning setting there is very different learnability and complexity

Â issues when you're learning say directed or undirected models.

Â The inference algorithm is also interrelated with everything else.

Â So the inference algorithm for example is used as subroutine in many of the

Â learning algorithms that we showed, for example in MRF or CRF learning even in

Â the complete data case and in any setting when we're learning with incomplete data.

Â And so, thinking about the complexity and properties of our inference algorithm is

Â critical in designing good learning algorithms.

Â And we also know that some inference algorithms are only usable in certain

Â types of models and so we're aiming for a particular inference algorithm that

Â affects our choice of representation. And finally the learning algorithm also

Â imposes significant constraints on the modelling.

Â Because our ability to well identify the, the correct patterns from the data impose

Â significant constraints on the expressive power, for example, our model.

Â And only certain types of patterns might be identifiable given the amount of data

Â we have, and the type of data we have, complete or incomplete, for example.

Â So lets look at one example of some of these trade-offs in the context of image

Â segmentation. So here one of the first design decisions

Â that we need to make is whether we're going to try and model this as a Bayes

Â net, so it's a directed graphical model.

Â And MRF or a CRF, where we're trying to predict the segment labels or the, the

Â pic, the super pixel labels S sub I, given features.

Â X. So we're trying to do P of S, given X,

Â rather, than a join distribution. What are some of the tradeoffs that we

Â might want to consider for this kind of decision?

Â Well one is the naturalness of the model. So initially just thinking about this.

Â It seems fairly clear that a directed model where we have directed edges from

Â one super pixel to the other doesn't seem like a particular natural, a particularly

Â natural choice in this context because which direction would we pick.

Â And what implications would this have if we went, for example, using directed

Â edges from the top right corner of the image to the bottom right corner of the

Â image, it just doesn't seem to make sense.

Â And so a more natural modeling paradigm here seems to be the MRF or the CRF.

Â But that still leaves us the question of which of those should we use.

Â And here for example, one possible design criterion that we ought to consider is

Â that we might want to use a very rich feature representation.

Â That is that these features X should be richly structured and count different

Â gradients and different gradients of texture and color across the image for

Â example, different histograms. These rich features are often highly

Â correlated with each other and we know that the correlated features are hard to

Â model in the sense of capturing correctly their correlation structure and so in

Â this case. A CRF seems like it might be a better

Â choice. Because it avoids the issue of how do we

Â model the correlation between these, the correlation structure between all these

Â rich features. And so it allows us to use rich features.

Â Without falling into traps of incorrectly modeling the correlations, which might

Â diminish our performance. Now another related question is that we

Â need to consider is the cost of inference in this kind of model, and that comes

Â back and influences the modeling. So for example for certain types of

Â models as we've discussed for example when the when the connectivity when the

Â structure in the potentials in the MRF or the CRF is associative or regular.

Â 9:49

Then, we know that we can apply very efficient inference algorithms that

Â utilize, min cuts and other combinatorial graph algorithms that are much, much

Â faster than applying say, message passing schemes.

Â So that's all very but then it imposes on us the constraint of making sure that

Â our, that our potentials in fact are assocative, and that does have

Â implications about the baility of our model to correctly capture the structure

Â and the distribution, so for example it is very easy to capture using

Â associatives potentials the fact that we espect adjacent.

Â Pixels to have the same class, to have the same label.

Â That's very easy to capture using this type of potentials, but not quite so easy

Â to capture is for example the, the constraint, not the constraint but the

Â tendency that cows are on top of grass. As opposed to on top of water or on top

Â of road or on top of sky. That's a potential that goes between one

Â class and the other, and it's very difficult to come up with with a with an

Â associate to function that actually captures these kinds of targets.

Â Another thing to consider is training costs when you think about this.

Â So, here in fact, MRF or CRF are more expensive than Bayesian networks.

Â Now we might still in this case want to take the hit because of the, because of

Â the benefits on the modeling side, but we've already seen that in the case of

Â complete data tr, training a Bayesian network is considerable more

Â computational effective than training an MRF or a CRF and so, it's a place where

Â we're taking a pay, we're taking a hit for a design decision we made on the

Â representational side. And then finally.

Â if we have to learn with missing data that sets up yet another constraint that

Â we want to think about and might not be obvious going in.

Â So say that we have a scenario where the segment labels are not observed.

Â That is we're trying to learn, to segment.

Â 12:15

In an unsupervised way. So from unsupervised data.

Â Well, in this case, the S's are unobserved,

Â they are latent. Which means that optimizing P of S given

Â X doesn't even make sense because s is never observed.

Â And so in this case we simply cannot learn a CRF.

Â In this context because there is no the objective of optimizing the labels S

Â given the features just is completely undefined in the setting and so at this

Â point we have to resign ourselves to learning using a different

Â representational scheme. And so this is another case where we see

Â the intertwining between different choices that we make and different

Â constraints that we have in that in this case learning with missing data interacts

Â with the choice of representation in potentially unexpected ways.

Â 13:19

Now, one important component,

Â one important benefit of probabilistic graphical modules and their modularity,

Â is that they allow us to mix and match different ideas within the modeling

Â framework. And the reason why that's important is

Â because it gives us a much greater richness of options than by considering

Â individual topics that we, that we learned as a solution in and of itself.

Â We can often combine those in interesting ways.

Â So to see one example of that let's go back to our image segmentation that we

Â just talked about and think about models that actually misdirected and undirected

Â edges. So here, we might, for example if we're

Â trying to learn image segmentation for unsupervised data. We might have

Â undirected edges over the labels S, as an MRF, just as we see in this diagram,

Â which is the natural directionality or rather the lack of directionality that we

Â discussed before. But we have directed edges from the S's

Â to the features XI. And this is a model that is half undirected, it's undirected

Â at this level and it's directed going from the S's to the X's.

Â Now why is that good? It's good because, first, it gives us a

Â meaningful optimization objective, that is we're trying to optimize using our

Â model parameters and our model structure, the ability to explain the images that

Â we've seen using a using the segment characteristics or the class label

Â characteristics. But why is it,

Â so in that respect it's a generative model that by forcing us to in some sense

Â generate the observed data can allow us to learn something about the statistics

Â or the patterns in, in the distribution. But by allowing, by making this model

Â directed as opposed to undirected which we could have done by doing an MRF over

Â the entire set which would also have been a generative model, it actually greatly

Â facilitates the learning because each of these potentials, P of XI given SI can be

Â learned separately without trying to optomize over the model as a whole and so

Â is considerably more efficient. So this is the case where mix and match

Â is, a, a better strategy than a unified model that is entirely directed or

Â entirely undirected. Another type of mix and match can occur

Â on the inference side. We learned about a variety of different

Â inference algorithms, and we might be tempted to just take a model and just

Â throw it into a black box of one of those inference algorithms or another.

Â But it turns out to be much more beneficial in certain cases, to consider

Â applying different inference algorithms to different pieces of the model.

Â So for example we might use something like belief propagation or Markhoff chain

Â Monte Carlo on some subset of variables, but on certain sets of variables we can

Â do exact inference which gives rise to much more accurate results over those

Â subsets and perhaps makes the convergence properties of our algorithms considerably

Â better as well as the accuracy that we get.

Â So, let's take an example, this is an bipartite gr-, MRF that we have seen

Â before where we have a set of variable of A on the one side and B on the other as

Â we have seen before and let's assume that there's

Â fairly dense connectivity between the variables and maybe even full

Â connectivity between the a's and the b's. Now, let's think which algorithm we might

Â want to use for this kind of a model. If there is a very small set of As or a

Â very small set of Bs, then we can use exact inference, but that's not usually

Â the case, and so we might have to resort to a proximate inference.

Â One possibility is to apply belief propagation, but if you look at this

Â graph, you can see that it has a very large number of fairly tight loops.

Â And we know the tight loops are a problem for belief propagation and so we might be

Â worried about the quality of the convergence and the quality of the

Â answers that we get. Another approach is to use something like

Â the Markov chain Monte Carlos such as Gibb sampling.

Â but in this case if we have a lot of A's and a lot of B's then we're sampling over

Â a large very high dimensional space. And we know that sampling methods in high

Â dimensional methods have, again, limitations in terms of the rate of

Â convergence and the quality of the answers.

Â So one possibility that we might consider here is a variant of of Markov chain

Â Monte Carlo where we only sample for example only, over the A's, say there's

Â fewer A's than B's. So we sample over the A's and then for

Â each assignment over the A's, so for each little assignment, A1.

Â 18:25

of M up to AK then, which is one of our particles, we maintain.

Â A distribution. Over the Bs, in closed form.

Â So we're effectively doing exact inference over the B's and sampling only

Â over the A's which is going to considerably improve the rate of

Â convergence as well as the statistical properties of our estimator.

Â And this is just one example that we in that we can talk about other examples of

Â current belief propagation that we can construct larger clusters in where in

Â each we run exact inference there's many, many such examples where mixing different

Â inference algorithms can be tremendously beneficial.

Â Mix and match on the learning side is equally useful.

Â So again, we can apply different learning algorithms to different parts of the

Â model. So, going back to our image segmentation

Â example, and now we're assuming we're back in the case where we're doing

Â learning from fully observed data. So we're, going to consider CRF.

Â it's very commonly used to to train, a high quality.

Â easily trainable models such as a support vector machine or some kind of logistic

Â regression for P of S given X. And we can train that very efficiently.

Â We can use sophisticated machine learning tricks like kernel methods or various

Â other ideas to get really high accuracy node potentials.

Â And then fix those and use CRF learning for the higher order say the pair Y's or

Â even high order potentials and it turns out that when you do that it's actually

Â less susceptible in many cases to over fitting and its much, much faster to

Â train. So by putting together different pieces

Â of functionality different ideas it turns out that we can build some very nice

Â integrated applications and I just want to end this discussion with one example

Â of a cool application that puts together a whole range of different of different

Â ideas into a single framework. So this is a

Â a work from Microsoft Research, which has a longstanding tradition in the use of

Â probabilistic graphical models, and it puts together ideas from, from speech

Â recognition, from diagnosis, and from decision making under uncertainty.

Â So let's watch the video and then talk about it briefly.

Â 21:04

Hi, thanks for coming to the triage kiosk I am in touch with the best doctors in

Â the world. Are you here for your child or yourself?

Â No, for my child. So, how can I help you?

Â She's been having diarrhea. Oh, no.

Â Sorry to hear that. Hi there.

Â What's your name? Manu /g And how old are you?

Â Five. So, five years old, right?

Â Yeah. Almost six.

Â Okay. Six years old.

Â Have you felt nauseous? I don't know.

Â So what we can see here is the use of value of information to figure out which

Â questions ought to be asked, that are going to be the most useful for improving

Â the quality of the diagnosis and the ability to make decisions in this,

Â Bayesian network that we see over here on the right.

Â I'm sorry. Has your child been vomiting at all?

Â No. Okay.

Â So lets see, has there been a fever? Slight.

Â Have you been feeling tired lately? Yes.

Â Lets see, how about abdominal pains? He's not been complaining about it.

Â Has your tummy been hurting? Yes.

Â Oh, I'm really sorry. I hope it will heal better soon. Another

Â component here is the use of the current model, the current scenarial situation to

Â guide the speech recognition system so as to get better quality of results from

Â speech recognition system. So instead of doing totally unguided

Â speech recognition system, one uses the current scenario as the model, as, as the

Â system is understanding it to guide the priors on what the person is probably

Â saying. Okay. Do you happen to know if your child

Â has been losing weight? No.

Â That's okay. Have you noticed whether your child's

Â stool looks funny? Is it black or darker then usual?

Â It is darker then usual. Okay, let's see.

Â Considering everything you've told me, I'd like to schedule an appointment with

Â the doctor just to be sure. However, not to worry.

Â I am not that concerned at this point. So to summarize.

Â Probabilistic graphical models provide us with an interpreted coherent framework

Â for reasoning and learning in complex uncertain domains.

Â And people over the years have developed an enormous suite of tools that can be

Â used for accomplishing different tasks within this framework.

Â And we can put them together within the single unified framework to solve much

Â harder problems than any single tool can do alone.

Â We've seen, over the course of different examples in this course that this

Â framework is used in a huge range of applications and there's many more in

Â which this can be used. And so overall, we can see that, there's

Â a lot of places where one can continue to work in this paradigm,

Â both in utilizing it, for applications that might be of interest and new ideas

Â that come up and also, improving the foundational methods for representation

Â inference and learning, that could allow us to address yet new problems.

Â