0:00

Our final big module in this course is that of learning a probabilistic

Â graphical model from data. Before we delve into the details of, of

Â learning, of specific learning algorithms,

Â let's think about some of the reasons why we might want to learn a probabilistic

Â graphical model from data. Some of the different scenarios in which

Â this learning problem might arise and how we might go about evaluating the results

Â of our learning algorithm. So, the set up here is that we assume

Â that we have some kind of true distribution, which is typically denoted

Â by P*. And, in many cases, although not always,

Â we might assume that P* is actually generated from a probablistic graphical

Â model M*. And that assumption allows us to talk

Â about the differences between a learned model and the ground truth model and

Â start the generated the distribution. Now,

Â we're assuming that from this distribution, P*, we get a data set, D,

Â of instances, d1 up to dM and, we're assuming that those are sampled from the

Â distribution P*. [SOUND] Now, in addition to the data,

Â we may or may not have some amount of domain expertise that allows us to put in

Â some prior knowledge into the model. And in fact, the ability to put in prior

Â knowledge is one of the strengths of probabilistic graphical model learning,

Â as compared to a variety of other learning algorithms where this is not

Â always quite as easily done. So combining elicitation from an expert

Â and learning what we end up with is a network that we can then look at and use

Â for different purposes. So, to make this a little bit more

Â concrete, let's look at the, different scenarios in the context of Bayesian

Â network, the issues in the Markov network look fairly identical.

Â So in the case of known structure and complete data, we have a network which we

Â assume to be true. We have input data which is nice and

Â clean. You see that all of the variables have,

Â have values in every single instance and our goal is to produce this set of CPDs

Â for the network. In the case of unknown structure and

Â complete data, we have the same type of data set,

Â but notice that now the initial network has no edges in it and we now need to

Â infer the edge connectivity as well as the CPDs.

Â Incomplete data arises when, notice that here, we have some of the variables are

Â not observed in the training data and as we'll see this can actually complicate

Â the learning problem quite considerably and finally the unknown structure

Â incomplete data. Now,

Â in the latent variable case, notice that we have a situation where we know about

Â three of the variables, X1, X2, and Y. But our final model has an addition to

Â X1, X,2 and Y, and additional latent variable H that we didn't even know

Â about, it might have been here but we didn't observe any of the values for it.

Â We didn't even know of its existence. And we want to learn a model that

Â involves not only X1, X2 and Y, but also the variable H.

Â 3:27

So, now let's think about the reasons why we might want to learn a probabilistic

Â graphical model. and who, the most obvious one is that we

Â want a model that we can use in the same way that we would use one that we

Â elicited by hand, to just answer probabilistic queries, whether

Â conditional probability queries or map queries about new instances that we

Â haven't seen before. Now, introducing concepts that we'll

Â study in more detail a little bit later on, the simplest possible metric that we

Â might envision for training a PGM is basically how probable are the instances

Â that we've seen relative to a given model?

Â So, this metric is called training set likelihood and it's formalized as the

Â following, it's the probability of the data that we've seen, our data set D,

Â relative to a given model M. And the intuition behind this is, is that

Â if a model makes the data more likely, that it was more likely to have generated

Â this data set, then it's a pretty good model or pretty good assumption about the

Â process that generated our data. And, in this I'm just opening up this

Â definition. This this just turns into the product

Â over over instance M of the probability of the individual instances given the

Â model given the candidate model M. And this is assuming that the instances are,

Â IID, Independent and Identically distributed from, the model M.

Â So, one important notion that will accompany us through out this discussion,

Â is that while training set likelihood seems intuitively like a pretty good

Â surrogate for a pretty good scoring function for picking the model, it, isn't

Â what we actually care about. Because what we really care about is new data, not the

Â data that we got before, we care about making conclussions about data we haven't

Â seen. And so what we really want to do is

Â evaluate our model on a separate test set.

Â And you've all already seen the notion of test sets in the context of other

Â learning problems and the same the same idea is fundamental here in PGMs as well,

Â is that are evaluation really should care about, not the original data set D but

Â rather a new data set D Prime, which gives us a surrogate for what's called

Â generalization performance. A related but somewhat different variant

Â in the notion of, of a learning task that you might want the PGM to perform is when

Â we have a specific prediction problem that we care about.

Â So, for example, we might so where we specifically care about predicting a

Â particular set of target variables Y from a set of observed variables X.

Â And we've seen multiple examples of this such as image segmentation where we have,

Â for example, X being the pixels in the image and Y being the predictant class

Â labels. Speech recognition is another such example where we have an acoustic

Â signal as X and the set, and a sequence of phonemes as Y.

Â So all of these are are cases where we have a particular

Â prediction task. Now, although, in this case, we often

Â care about a specialized objective. So, for example, a pixel-level

Â segmentation accuracy in the context of the image segmentation or in the context

Â of speech recognition, we might care about the word accuracy rate.

Â Even though that's often the case, it turns out that in many cases it's

Â convenient for for algorithmic and mathematical purposes to select their

Â model to optimize the same notion of either likelihood or conditional

Â likelihood where we try and predict where we're computing the probability of the Ys

Â given the Xs. And although that likelihood is not

Â always a perfect surrogate for the objective that the specialized objective

Â that we actually care about. It turns out to be mathematically

Â convenient and that's why it's often done.

Â However, it's important to evaluate the model performance on the true objective

Â over test data as opposed to just use likelihood as the evaluation of how

Â successful our learning algorithm was. A third setting where one might want to

Â use PGM learning is actually qualitatively quite different.

Â In this case, we might not care about using the model for any particular

Â inference task, but rather we hear about inferring the structure itself that is

Â what we care about is knowledge discovery or structure discovery where our goal is

Â try and get as close as possible to the generating model at the start.

Â Using PGM learning for this task might help us distinguish between direct and

Â indirect dependencies. So if we see a correlation between X and

Â Y in the data, we want to infer whether that corresponds to a direct

Â probabilistic interaction between them or something that proceeds via a third

Â variable C for example. In some cases, when we are learning a

Â Bayesian network, we might be able to infer the directionality of the edges and

Â thereby get some intuition regarding causality.

Â And, in other cases, when we learn models with latent variables, the existence of

Â those latent variables, their location, and often the way in which the values of

Â the latent variables get assigned to different instances gives us a lot of

Â information about the structure of the domain.

Â In many cases, although not always, when we want to, we solve this learning

Â problem by training using the same ideas that use a likelihood based objective for

Â training. Now, we know that that is not a

Â particularly good surrogate for structural accuracy,

Â but from mathematical and algorithmic perspective, it's a very convenient

Â optimization objective and therefore it's often used in practice,

Â although there are also other ideas out there.

Â However, it's important not to use likelihood, even likelihood of the test

Â set as the sole objective for evaluating model performance.

Â And in many cases, as we'll see for, as we'll see in the context of some

Â examples, the evaluation here needs to be done by comparing to whatever limited

Â prior knowledge we have about the model and star.

Â So, we can compare prior knowledge that was not given to the algorithm and see

Â whether the algorithm was able to adequately reconstruct this.

Â Now, we talked earlier in this module about the fac that that the training

Â likelihood tends to over fit the model. And that, in fact, is a general

Â observation, that when you select the model M to

Â optimize the training set likelihood, then that tends to overfit badly to

Â statistical noise. random fluctuations that happens when we

Â generate our training set. That happens in several different ways

Â that happens by overfitting at the level of parameters so where the parameters fit

Â random noise in the training data and that can be avoided by the use of

Â regularization or parameter priors over the parameters and we'll see how that

Â gets done. It also happens when we overfit the

Â structure. And specifically one can show that if we

Â optimize the training set likelihood, then complex structures always win.

Â That is, we would always prefer the most complicated structure that our model

Â allows. And so if we're training if we're trying

Â to fit structure, it's important to either bound the model complexity or

Â penalize the model complexity so that we don't learn models that are just

Â ridiculously complicated for no good reason.

Â Now, all of these different choices that we talked about are called

Â hyperparameters. So hyperparameters include things like

Â the parameter priors or the regularization over parameters,

Â the strength of the regularization. If we're doing complexity bounds or

Â complexity penalties, that's another hyperparameter.

Â All of these are things that we need to pick before we can actually apply our

Â learning algorithm. And, so how does that happen?

Â Well, we need to figure out a way to select that, and it turns out that, that

Â decision makes a huge difference in many cases to the performance of our learning

Â algorithm. And so, how do we pick these

Â hyperparameters? Well, one obvious choice is to pick them

Â on the training set. A few seconds of thought ought to

Â convince us that that is a terrible idea, because we just talked about the fact

Â that on the training set, the optimal thing to do is to have maximum

Â complexity. And so if we pick these hyper-parameters

Â in the training set they're going to become totally vacculous.

Â Another obvious choice is to pick them on the test site, that turns out to be

Â another terrible idea, because that basically makes us look, makes our

Â performance overly optimistic because we picked these very important parameters so

Â as to optimize performance on our test set.

Â So training set is bad, test is bad and so the correct strategy

Â is to use what is called a validation set, which is a set that is separate from

Â both our training set on the one hand and our test set on the other.

Â A variant on this is to use what is called cross-validation on the training

Â set where we split the training set iteratively into a training and a

Â validation component and use that to pick hyperparameters.

Â And these are all concepts that you've seen before in the context of other

Â learning algorithms and there equally important here.

Â Finally lets talk about why you might, why and when you might want to use PGM

Â learning as opposed to a generic machine learning algorithm.

Â . Pgm learning is particularly useful when

Â what we're trying to do is make predictions not over a single output

Â variables, such as a binary outcome the positive class or the negative class,

Â but rather we're trying to make predictions over structured objects.

Â For example, labeling entire sequences as in when we're trying to do

Â for example, sequence labeling in, in, in speech recognition or in natural language

Â processing. or when we're trying to label entire

Â graphs. For example, in the case of image

Â segmentation. Where we have, there's a grid of pixels.

Â And we're trying to label all the pixels simultaneously.

Â This allows us to exploit correlations between multiple predictive variables

Â often giving us significant improvements to performance.

Â A second reason to use PGM learning is that it allows us to incorporate prior

Â knowledge into our model in a way that many other algorithms have a bit of a

Â difficulty in, in allowing. And finally, this is particularly useful

Â when we're trying to learn a single model.

Â Single state PGM model for multiple different tasks where as traditional

Â learning algorithms you learn a particular XY mapping, here you can learn

Â a single graphical model and use it in multiple different ways for answering

Â different types of queries. And finally, the idea of using learning

Â for knowledge discovery is useful in other is also possible in the context of

Â other learning algorithms but is particularly useful in the context of

Â PGMs, because the form of the knowledge is often particularly intuitive.

Â