0:00

Hello and welcome to the class on Probabilistic Graphical Models.

Â Probabilistic Graphical Models are a bit of a mouthful, so before we define them,

Â let's first figure out what they might be used for.

Â So, one example application, which in fact is the one where probabilistic graphical

Â models, or PGMs as they're called, first made its way into computer science and

Â artificial intelligence, is that as medical diagnosis.

Â Consider a doctor who's faced with a patient.

Â The doctor has a fair amount of information at her disposal when she looks

Â up at a patient.

Â Predisposing factors, symptom, the results of various tests.

Â And from that, she's supposed to figure out what diseases the patient might have

Â and what the response to different treatments might be.

Â 0:41

A very different application where PGMs have also been used is that of image

Â segmentation.

Â Here, we might have an image such as this that has thousands or

Â even hundreds of thousands of pixels, and

Â what we'd like to do is we'd like to figure out what each pixel corresponds to.

Â For example, if we break up the image into these fairly larger

Â regions to have less stuff to reason about,

Â we want to take figure out which of these corresponds to grass, sky, cow, or horse.

Â What do these two problems have in common?

Â First, they have a very large number of variables that we have to reason about.

Â In the context of a doctor, it's all these predisposing factors,

Â test results, possible diseases, and so on.

Â And in the context of the image segmentation, it's the labels for

Â these different pixels or these larger regions called superpixels.

Â The second thing that these applications have in common is that fundamentally there

Â is going to be significant uncertainty about the right answer,

Â no matter how clever the algorithms that we design.

Â 1:42

So, probabilistic graphical models are a framework for

Â dealing with this kind of application.

Â So, let's first understand what each of

Â these words mean in the context of this framework.

Â So first, let's consider the word models.

Â So what is a model?

Â The model is a declarative representation of our understanding of the world.

Â So it is a representation within the computer that captures our understanding

Â of what these variables are and how they interact with each other.

Â And the fact that it's declarative means that the representation stands on its own,

Â which means that we can look into it and

Â make sense of it aside from any algorithm that we might choose to apply on.

Â 2:27

So, why is that important?

Â It's important because that same representation, that same model can then

Â be used in the context of one algorithm that answers any one kind of question.

Â Or other algorithms that might answer different kinds of questions or

Â the same question in more efficient ways, or

Â that make different trade-offs between accuracy and complicational cause.

Â The other advantage of having a stand alone model is that we

Â can separate out the construction of the model from the algorithms that

Â are used to reason over it.

Â So, we can construct methodologies that elicit these models from a human expert or

Â ones that learn it from historical data using statistical machine learning

Â techniques or a combination of the two.

Â And once again, the separation between the algorithm and the model and

Â the learning in the model allows us to tackle each of these problems separately.

Â 3:21

So, that was the word model, what about probabilistic?

Â The word probabilistic is in there because these models are designed to help us deal

Â with large amounts of uncertainty.

Â So uncertainty comes in many forms and for many different reasons.

Â So, first it comes because we just have partial knowledge of the state of

Â the world, for example the doctor doesn't get to measure every symptom or every test

Â result and she's certainly uncertain about the diseases that the patient has.

Â Uncertainty comes because of noisy observations.

Â So even when we get to observe certain things like blood pressure,

Â those observations are often subject to significant amounts of noise.

Â Uncertainty also comes in because of modeling limitations, so

Â we're going to have phenomena that are just not covered by our model.

Â All sorts of different obscure diseases for

Â example that might cause the same set of symptoms.

Â It's impossible for us to write down the model that is so

Â detailed that includes every possible contingency in every possible factor.

Â And so you're going to have uncertainty and

Â variability that is simply due to modelling limitations.

Â And finally, some people would argue that the world is inherently stochastic.

Â Certainly, if you go down to the quantum level, that's true.

Â But even at a higher level, the modeling limitations of complicated

Â systems are such that one might as well view the world as inherently stochastic.

Â Probability theory is a framework that allows us to deal with uncertainty in

Â ways that are principled and that bring to bear important and valuable tools.

Â So first, probabilistic models provide us again this word declarative.

Â A declarative representation, that is stand alone, where you could look at

Â a probability distribution and it has clear semantics that represent

Â our uncertainty about different state that the world might be in.

Â It also provides us with a toolbox comprising powerful reasoning patterns

Â that include, for example, conditioning on new forms of evidence or

Â decision making under uncertainty.

Â 5:55

Finally, the word graphical.

Â The word graphical is here from the perspective of computer science,

Â because probabilistic graphical models are a synthesis between ideas from

Â probability theory in statistics and ideas from computer science.

Â And the idea here is to use the connections computer science,

Â specifically that of graphs to allow us to represent systems that are very

Â complicated that involved large numbers of variables.

Â And we'd already seen those large number of variable in both of

Â the applications that we use examples.

Â Both in the medical example, as well as in the image segmentation example.

Â And so in order to capture probability distributions over spaces involving

Â such a large number of factors,

Â we need to have probability distributions over what are called random variables.

Â And so the focus of this class and what we'll do for

Â most of it is to think about the world as represented by a set of random variables,

Â X1 up to Xn, each of which captures some facet of the world.

Â So, one symptom that may be present or absent, or

Â a test result that might have a continuous set of possible values or

Â a pixel that might have one of several labels.

Â So each of these is a random variable and

Â our goal is to capture our uncertainty about the possible states of the world

Â in terms of their probability distribution or what's called a joint

Â distribution over the possible assignments to the set of random variables.

Â Now, the important thing to realize when looking at this,

Â is that even in the simplest case where each of these is, say, binary valued,

Â which is not on the case, but say just for sake of the argument.

Â If you have n binary value variable then this is a distribution,

Â 7:48

Over to to the n possible states of the world.

Â One for each possible assignment.

Â And so we have to deal with objects that are intrinsically, exponentially large.

Â And our only way to do that is by exploiting data structures that encode,

Â that use ideas from computer science in this case to exploit the structure and

Â distribution and represent and manipulate it in an effective way.

Â So what are graphical models?

Â Let's look at a couple of very simple examples, so

Â here's a toy Bayesian network,

Â one that will accompany us through much of the first part of this course.

Â A Bayesian network is one of the two main classes of probabilistic graphical models,

Â and it uses a directed graph as the intrinsic representation.

Â In this case, remember we had a set of random variables X1 up to Xn.

Â The random variables are represented by nodes in the graph.

Â So, to take a look at this very simple example which we'll discuss again later,

Â here we have a situation where we have a student who takes a course and

Â gets a grade in the course, and so that's one of our random variables.

Â We have other random variables that are also related to that.

Â For example, the intelligence of the student's in the course,

Â the difficulty of the course.

Â And others that might also be of interest, for

Â example the quality of the recommendation letter that the student gets in

Â the course which is dependent on things, perhaps the students' grade, and

Â these score that the students might receive on the SAT.

Â So, this is a representation of a probability distribution,

Â in this case over these five random variables.

Â And the edges in this graph represent the probabilistic connections between

Â those random variables in a way that is very formal as we'll define later on.

Â The other main class of probabilistic graphical model is what's called

Â the Markov network and that uses an undirected graph.

Â 10:04

And in this case, we have an undirected graph over 4 random variables A, B,

Â C, D and will give an example of this type of network maybe a little bit later on.

Â So these were toy examples,

Â here are some real examples of the same type of framework.

Â So, this is a real Bayesian network.

Â It's a network that's actually called CPCS,

Â it's a real medical diagnosis network.

Â It was designed at Stanford University for

Â the purpose of diagnosis of internal diseases and

Â it has 480 some nodes, and a little bit over 900 edges.

Â And it was used for diagnosing internal diseases by physicians here.

Â Another real graphical model, in this case on the Markov network side, is one

Â that's used for the image segmentation tasks that we talked about before.

Â Here, the random variables represent

Â the labels of pixels or superpixels.

Â So, one per each superpixel say.

Â And the edges represent,

Â again probabilistic relationships between the label of a pixel and

Â the label of an adjacent pixel since these are likely to be related to each other.

Â 11:27

So to summarize, the graphical representation gives us an intuitive and

Â compact data structure for

Â capturing these high dimensional probability distributions.

Â It provides us at the same time, as we'll see later in the course,

Â a suite of methods for efficient reasoning,

Â using general purpose algorithms that exploit the graphical structure.

Â And because of the way in which the graph structure encodes the parameterization of

Â the probability distribution, we can represent these high-dimensional

Â probability distribution efficiently using a very small number of parameters.

Â Which allows us though feasible elicitation by hand from

Â an expert as well as automatically learning from data.

Â And in both cases a reduction in the number of parameters is very valuable.

Â This framework is a very general one and has been applied to a very broad range

Â of applications and I'm not going to list all of the ones on the slide, and

Â there is many others that I could have put on the slide had there been more space.

Â Let me just very briefly mention a few of them on subsequent slides.

Â So, we've already discussed the image segmentation.

Â So, just to motivate the benefits of the PGM framework in this context,

Â let's look at these two images as an example.

Â Here is the original images,

Â this is the division of these images into what I mentioned were called superpixels,

Â which are these sort of slightly larger coherent regions.

Â And this is what you get if you apply a state of the art machine

Â learning framework.

Â 13:17

Individual super pixels separately.

Â So, just trying to classify each superpixel separately and

Â you can see that you get, especially in the case of the cow,

Â a total mess with different superpixels having drastically different labels.

Â You can't even see the cow in this segmented image.

Â Whereas if you construct the probabilistic graphical model to capture

Â the global structure of the scene and

Â the correlations, the probabilistic relationships between these superpixels.

Â You end up with a much more coherent segmentation that

Â captures the structure of the scene.

Â We've already discussed medical diagnosis as an example, so

Â here's a real world application.

Â This was something that was fielded on the Microsoft network for

Â helping parents figure out what was wrong with a sick child and

Â the site was called OnParenting.

Â And parents could enter the primary complaint and

Â then were led through a series of questions that allowed

Â the system to provide a probability distribution over

Â the most likely diagnosis with ailing the child.

Â A very different application is one of textual information extraction.

Â Where we might have an article from, say, a newspaper and we'd like to take this

Â unstructured data and convert it into a structured form, where we have some

Â representation of the people locations, organizations and perhaps relationships.

Â So one such task might be take this kind of sentence and

Â recognize that these two words together form a person,

Â which might not be that hard, given the presence of the word, missus.

Â But, this is a little bit harder because Green also a word and yet,

Â we want to identify it as a person.

Â We might want to then infer that this is a location and

Â perhaps that this is an organization.

Â 15:20

It turns out that the state of the art methodology for solving this problem is as

Â a probabilistic classical model where we have a variable for

Â each word that encodes the label for that word.

Â For example, here we have the beginning of a person unit and

Â an intermediate label for the person unit.

Â And here is another person unit whereas,

Â here in this variable is we would like to label it as a location.

Â But we would like to capture, importantly, the correlations between both adjacent

Â words as well as between non-adjacent words by using this occurrence

Â of the word Green to infer this occurrence of the word Green is also a name.

Â A very different example all together is one that implicates

Â data from multiple sensors.

Â This occurs in many applications, one such is for

Â integrating data related to traffic from both sensors that

Â we might have in the road or on the top of bridges, weather,

Â information, incident report that we might get in some form.

Â We'd like to take all this together and

Â use a model that as it happens was learned from data.

Â And that model is then used to predict both current and future road speed

Â including not only on roads that where we have sensors that measure traffic,

Â but even more interestingly, on roads where traffic wasn't measured.

Â And it turns out that this was a very successful application

Â that was fielded in several large cities with very good results.

Â From a totally different application domain,

Â probabilistic graphical models have been used very successfully for

Â discovering new knowledge that we didn't have before.

Â In this case, the application was biological network reconstruction,

Â where a biologist measured protein levels of a diverse set of protein

Â in different conditions under different perturbations.

Â And from that, they learned the relationship between those proteins and

Â discovered interactions between those proteins where one was

Â controlling another, including the ones that were not known before.

Â So, let's conclude this introduction with an overview of what we'll learn in

Â the class.

Â So, we will cover three main pieces related to probabilistic graphical models.

Â We'll cover representation of PGMs,

Â including both directed and undirected representation.

Â We'll also talk about higher level structures that allows to encode more

Â complicated scenarios.

Â Including ones that involve temporal structure as well as ones that involve

Â repeated or relational structure,

Â specifically a class of model called plate model.

Â We'll talk about inference or reasoning using these models.

Â Covering both exact reasoning, where exact probabilities are the outcome,

Â as well as approximate methods that provide the different trade off regarding

Â accuracy in computation.

Â And we'll also talk about how this class of models can be used for

Â decision making under uncertainty.

Â 18:41

And finally,

Â we'll talk about how you might learn these models from historical statistical data.

Â And we'll talk about learning both parameters,

Â as well as structure of these models automatically.

Â And dealing with both the somewhat simpler scenario where we have complete

Â data that is all of the variables are always observed as well as the more

Â difficult case where some of the variables might not be observed all the time,

Â or perhaps not at all.

Â And that introduces an interesting set of complications but

Â also a wealth of very exciting applications as we'll see.

Â