0:00

Our final big module in this course is that of learning a probabilistic

graphical model from data. Before we delve into the details of, of

learning, of specific learning algorithms,

let's think about some of the reasons why we might want to learn a probabilistic

graphical model from data. Some of the different scenarios in which

this learning problem might arise and how we might go about evaluating the results

of our learning algorithm. So, the set up here is that we assume

that we have some kind of true distribution, which is typically denoted

by P*. And, in many cases, although not always,

we might assume that P* is actually generated from a probabilistic graphical

model M*. And that assumption allows us to talk

about the differences between a learned model and the ground truth model and

start the generated the distribution. Now,

we're assuming that from this distribution, P*, we get a data set, D,

of instances, d1 up to dM and, we're assuming that those are sampled from the

distribution P*. [SOUND] Now, in addition to the data,

we may or may not have some amount of domain expertise that allows us to put in

some prior knowledge into the model. And in fact, the ability to put in prior

knowledge is one of the strengths of probabilistic graphical model learning,

as compared to a variety of other learning algorithms where this is not

always quite as easily done. So combining elicitation from an expert

and learning what we end up with is a network that we can then look at and use

for different purposes. So, to make this a little bit more

concrete, let's look at the, different scenarios in the context of Bayesian

network, the issues in the Markov network look fairly identical.

So in the case of known structure and complete data, we have a network which we

assume to be true. We have input data which is nice and

clean. You see that all of the variables have,

have values in every single instance and our goal is to produce this set of CPDs

for the network. In the case of unknown structure and

complete data, we have the same type of data set,

but notice that now the initial network has no edges in it and we now need to

infer the edge connectivity as well as the CPDs.

Incomplete data arises when, notice that here, we have some of the variables are

not observed in the training data and as we'll see this can actually complicate

the learning problem quite considerably and finally the unknown structure

incomplete data. Now,

in the latent variable case, notice that we have a situation where we know about

three of the variables, X1, X2, and Y. But our final model has an addition to

X1, X,2 and Y, and additional latent variable H that we didn't even know

about, it might have been here but we didn't observe any of the values for it.

We didn't even know of its existence. And we want to learn a model that

involves not only X1, X2 and Y, but also the variable H.

3:27

So, now let's think about the reasons why we might want to learn a probabilistic

graphical model. and who, the most obvious one is that we

want a model that we can use in the same way that we would use one that we

elicited by hand, to just answer probabilistic queries, whether

conditional probability queries or map queries about new instances that we

haven't seen before. Now, introducing concepts that we'll

study in more detail a little bit later on, the simplest possible metric that we

might envision for training a PGM is basically how probable are the instances

that we've seen relative to a given model?

So, this metric is called training set likelihood and it's formalized as the

following, it's the probability of the data that we've seen, our data set D,

relative to a given model M. And the intuition behind this is, is that

if a model makes the data more likely, that it was more likely to have generated

this data set, then it's a pretty good model or pretty good assumption about the

process that generated our data. And, in this I'm just opening up this

definition. This this just turns into the product

over over instance M of the probability of the individual instances given the

model given the candidate model M. And this is assuming that the instances are,

IID, Independent and Identically distributed from, the model M.

So, one important notion that will accompany us through out this discussion,

is that while training set likelihood seems intuitively like a pretty good

surrogate for a pretty good scoring function for picking the model, it, isn't

what we actually care about. Because what we really care about is new data, not the

data that we got before, we care about making conclusions about data we haven't

seen. And so what we really want to do is

evaluate our model on a separate test set.

And you've all already seen the notion of test sets in the context of other

learning problems and the same the same idea is fundamental here in PGMs as well,

is that are evaluation really should care about, not the original data set D but

rather a new data set D Prime, which gives us a surrogate for what's called

generalization performance. A related but somewhat different variant

in the notion of, of a learning task that you might want the PGM to perform is when

we have a specific prediction problem that we care about.

So, for example, we might so where we specifically care about predicting a

particular set of target variables Y from a set of observed variables X.

And we've seen multiple examples of this such as image segmentation where we have,

for example, X being the pixels in the image and Y being the predictant class

labels. Speech recognition is another such example where we have an acoustic

signal as X and the set, and a sequence of phonemes as Y.

So all of these are are cases where we have a particular

prediction task. Now, although, in this case, we often

care about a specialized objective. So, for example, a pixel-level

segmentation accuracy in the context of the image segmentation or in the context

of speech recognition, we might care about the word accuracy rate.

Even though that's often the case, it turns out that in many cases it's

convenient for for algorithmic and mathematical purposes to select their

model to optimize the same notion of either likelihood or conditional

likelihood where we try and predict where we're computing the probability of the Ys

given the Xs. And although that likelihood is not

always a perfect surrogate for the objective that the specialized objective

that we actually care about. It turns out to be mathematically

convenient and that's why it's often done.

However, it's important to evaluate the model performance on the true objective

over test data as opposed to just use likelihood as the evaluation of how

successful our learning algorithm was. A third setting where one might want to

use PGM learning is actually qualitatively quite different.

In this case, we might not care about using the model for any particular

inference task, but rather we hear about inferring the structure itself that is

what we care about is knowledge discovery or structure discovery where our goal is

try and get as close as possible to the generating model at the start.

Using PGM learning for this task might help us distinguish between direct and

indirect dependencies. So if we see a correlation between X and

Y in the data, we want to infer whether that corresponds to a direct

probabilistic interaction between them or something that proceeds via a third

variable C for example. In some cases, when we are learning a

Bayesian network, we might be able to infer the directionality of the edges and

thereby get some intuition regarding causality.

And, in other cases, when we learn models with latent variables, the existence of

those latent variables, their location, and often the way in which the values of

the latent variables get assigned to different instances gives us a lot of

information about the structure of the domain.

In many cases, although not always, when we want to, we solve this learning

problem by training using the same ideas that use a likelihood based objective for

training. Now, we know that that is not a

particularly good surrogate for structural accuracy,

but from mathematical and algorithmic perspective, it's a very convenient

optimization objective and therefore it's often used in practice,

although there are also other ideas out there.

However, it's important not to use likelihood, even likelihood of the test

set as the sole objective for evaluating model performance.

And in many cases, as we'll see for, as we'll see in the context of some

examples, the evaluation here needs to be done by comparing to whatever limited

prior knowledge we have about the model and star.

So, we can compare prior knowledge that was not given to the algorithm and see

whether the algorithm was able to adequately reconstruct this.

Now, we talked earlier in this module about the fac that that the training

likelihood tends to over fit the model. And that, in fact, is a general

observation, that when you select the model M to

optimize the training set likelihood, then that tends to overfit badly to

statistical noise. random fluctuations that happens when we

generate our training set. That happens in several different ways

that happens by overfitting at the level of parameters so where the parameters fit

random noise in the training data and that can be avoided by the use of

regularization or parameter priors over the parameters and we'll see how that

gets done. It also happens when we overfit the

structure. And specifically one can show that if we

optimize the training set likelihood, then complex structures always win.

That is, we would always prefer the most complicated structure that our model

allows. And so if we're training if we're trying

to fit structure, it's important to either bound the model complexity or

penalize the model complexity so that we don't learn models that are just

ridiculously complicated for no good reason.

Now, all of these different choices that we talked about are called

hyperparameters. So hyperparameters include things like

the parameter priors or the regularization over parameters,

the strength of the regularization. If we're doing complexity bounds or

complexity penalties, that's another hyperparameter.

All of these are things that we need to pick before we can actually apply our

learning algorithm. And, so how does that happen?

Well, we need to figure out a way to select that, and it turns out that, that

decision makes a huge difference in many cases to the performance of our learning

algorithm. And so, how do we pick these

hyperparameters? Well, one obvious choice is to pick them

on the training set. A few seconds of thought ought to

convince us that that is a terrible idea, because we just talked about the fact

that on the training set, the optimal thing to do is to have maximum

complexity. And so if we pick these hyper-parameters

in the training set they're going to become totally vacculous.

Another obvious choice is to pick them on the test site, that turns out to be

another terrible idea, because that basically makes us look, makes our

performance overly optimistic because we picked these very important parameters so

as to optimize performance on our test set.

So training set is bad, test is bad and so the correct strategy

is to use what is called a validation set, which is a set that is separate from

both our training set on the one hand and our test set on the other.

A variant on this is to use what is called cross-validation on the training

set where we split the training set iteratively into a training and a

validation component and use that to pick hyperparameters.

And these are all concepts that you've seen before in the context of other

learning algorithms and there equally important here.

Finally lets talk about why you might, why and when you might want to use PGM

learning as opposed to a generic machine learning algorithm.

. Pgm learning is particularly useful when

what we're trying to do is make predictions not over a single output

variables, such as a binary outcome the positive class or the negative class,

but rather we're trying to make predictions over structured objects.

For example, labeling entire sequences as in when we're trying to do

for example, sequence labeling in, in, in speech recognition or in natural language

processing. or when we're trying to label entire

graphs. For example, in the case of image

segmentation. Where we have, there's a grid of pixels.

And we're trying to label all the pixels simultaneously.

This allows us to exploit correlations between multiple predictive variables

often giving us significant improvements to performance.

A second reason to use PGM learning is that it allows us to incorporate prior

knowledge into our model in a way that many other algorithms have a bit of a

difficulty in, in allowing. And finally, this is particularly useful

when we're trying to learn a single model.

Single state PGM model for multiple different tasks where as traditional

learning algorithms you learn a particular XY mapping, here you can learn

a single graphical model and use it in multiple different ways for answering

different types of queries. And finally, the idea of using learning

for knowledge discovery is useful in other is also possible in the context of

other learning algorithms but is particularly useful in the context of

PGMs, because the form of the knowledge is often particularly intuitive.