0:03

Let's start with discussing a problem of fitting

Â a distribution P of X into a data-set of points.

Â Why? Well, we have already discussed this problem in week one,

Â when we discussed how to fit a Gaussian to a data-set of points,

Â we discussed it in week two,

Â when we discussed clustering problem,

Â and how we can solve it by fitting the Gaussian mixture model into our data.

Â And also, we discussed

Â probabilistic PC which is kind of an infinite mixture of Gaussians.

Â But now, we will want

Â to return to this question because it turns out that the methods we covered,

Â like Gaussian or Gaussian mixture model on the probabilistic PC,

Â are not enough to capture the complicated objects like images, like natural images.

Â So, you may want to fit

Â your data-set of natural images into a probabilistic distribution,

Â for example, to generate new data.

Â And, if you try to do that with Gaussian mixture model, it will work,

Â but it will not work as well as some more sophisticated models we will discuss this week.

Â And so, in this example for example,

Â we generated some fake celebrity faces by using a generative model,

Â and you can do these kinds of things if you

Â have a probability distribution of your training data,

Â so you can sample new images from this distribution.

Â And also you can, if you have such a model, like P of X,

Â you can also do a kind of Photoshop of the future applications, like here.

Â So you can, with a few brush strokes,

Â you can change a few pixels in your image,

Â and the program will try to recolor everything else,

Â so the picture will stay for the realistic.

Â So, it will change the color of the hair and etc.

Â So, one more reason to try

Â to fit distribution P of X into some complicated structured data like images,

Â is to detect anomalies.

Â So, for example, you have a bank and you have a sequence of transactions, and then,

Â if you fit your probabilistic model into this sequence of transactions,

Â for a new transaction you can predict how probable

Â this transaction is according to our model,

Â our current training data-set,

Â and if this particular transaction is not very probable,

Â then we may say that it's kind of suspicious and we may ask humans to check it.

Â And also, for example, if you have security camera footage,

Â you can train the model on your normal day security camera, and then,

Â if something suspicious happens then you can detect that by seeing that

Â some images from your cameras have

Â a low probability P of your image according to your model.

Â So, you can detect anomalies, detect suspicious behavior.

Â And, one more reason is,

Â you may want to handle missing data.

Â For example, you have some images with obscured parts,

Â and you want to do predictions.

Â In this case, if you have P of X,

Â so probability of your data,

Â it will help you greatly to deal with it.

Â And finally, sometimes people try to

Â represent some highly structured data in low dimensional embeddings.

Â And, this is not some inherent property for modeling data with P of X but,

Â as we will see in the models we will cover,

Â it kind of comes naturally.

Â So, it will give a latent code to any object it sees,

Â and then we can use this latent code to explore the space of our objects kind of nicely.

Â So, for example, people sometimes build these kind of latent codes for molecules and

Â then try to discover new drugs by exploring this space of molecules in this latent space.

Â Okay, so, let's say we're convinced,

Â we want to model P of X of some natural images or other types of structured data.

Â How can we do it?

Â Well, probably the most natural approach is to say that,

Â let's use a convolutional neural network

Â because it's something that works really well for images.

Â And, let's say that our convolution neural network will look at the image,

Â and then return your probability of this image, right?

Â It will like, it's the simplest possible parametric model

Â of something that returns your probability for any image.

Â And, to make things more stable,

Â let's say that CNN will actually return your logarithm of probability.

Â The problem with this approach is that you have to normalize your distribution.

Â You have to make your distribution to sum up to one,

Â with respect to sum according to all possible images in the world,

Â and there are billions of them.

Â So, this normalization constant is very expensive to compute,

Â and you have to compute it to do the training or inference in the proper manner.

Â So, this thing is infeasible.

Â You can't do that because of normalization.

Â So, what else can you do?

Â Well, you can use the chain rule.

Â If you recall from week one,

Â any probabilistic distribution can be

Â decomposed into a product of some condition distributions,

Â and we can apply it to natural images,

Â for example, like this.

Â So, we have an image, in this case,

Â it's a three by three pixel image, but of course,

Â in a practical situation you will use 100 by 100 for example,

Â or an even more high resolution image.

Â And, you can enumerate each pixel of this image somehow, like for example,

Â a row by row fashion and then you can say that the distribution

Â of this whole image is the same as the joint distribution of the pixels.

Â And, this joint distribution decomposes into

Â the product of conditional distributions by the chain rule.

Â So, the distribution on the whole image equals to the probability of the first pixel,

Â marginal probability, plus the probability of

Â the second pixel given the first one, and etc.

Â And, now you may try to build these kind of

Â conditional probability models to model your overall joint probability.

Â And, if your model for conditional probability is flexible enough,

Â you will not lose anything because you can

Â represent in this way any probability distribution.

Â And, the natural idea how to represent

Â these conditional probabilities is with recurrent neural network.

Â Which basically will read your image pixel by pixel,

Â and then outputs your prediction for the next pixel.

Â Prediction for brightness for next pixel for example,

Â and this approach makes modeling much easier because now normalization

Â constant has to think only about one dimensional distribution.

Â So, if for example, your image is grayscale,

Â then each pixel can be decoded with the number from zero to 255.

Â So, the brightness level,

Â and then your normalization constant can be computed by just summing up

Â with respect to all these 256 values, so it's easy.

Â It's a really nice approach,

Â check it out if you have time,

Â but a downside of that is you have to generate your new images one pixel at a time.

Â So, if you want to generate a new image you have

Â to first of all generate X1 from the marginal distribution X1,

Â then you will feed this just generated X1 into the RNN,

Â it will output your distribution on the next pixel and etc.

Â So, no matter how many computers you have,

Â one high resolution image can take like minutes which is really long.

Â And so, we may want to look at something else.

Â One more thing you can do, is to say that your distribution over pixels is independent.

Â So, each pixel is independent of the others.

Â In this case, you can easily feed this kind of distribution into your data,

Â but it turns out to be a too restrictive assumption to say about your data.

Â So, even in this simple example of a data-set of handwritten digits,

Â if you have like 10,000 of these small images,

Â and you train this kind of factorised model on them,

Â you will get really not nice looking samples like this.

Â That's because the assumption that each pixel is independent

Â of the others is really not held on true data.

Â So, for example, if you saw one half of the image,

Â you can probably restore the other half

Â quite accurately which means that they're not independent.

Â So, this assumption is too restrictive.

Â One more thing you can do is,

Â you can use Gaussian mixture model.

Â And, this thing is like really flexible in theory,

Â it can represent any probability distribution,

Â but in practice for complicated data like natural images,

Â it can be really inefficient.

Â So, we will have to use maybe thousands of Gaussians, of components,

Â and in this case

Â the overall methods will

Â fail to capture the structure because it'll be too hard to train it.

Â One more thing we can try is an infinite mixture of Gaussians like the probabilistic B,

Â C, E, methods we covered in week two.

Â So, here the idea is that each object,

Â each image X has a corresponding latent variable T,

Â and the image X is caused by this T,

Â so we can marginalize out T. And,

Â the conditional distribution X given T is a Gaussian.

Â So, we kind of have a mixture of infinitely many Gaussians, for each value of T,

Â there's one Gaussian and we mix them with weights.

Â Note here that, even if the Gaussians are factorized,

Â so they have independent components for each dimension, the mixture is not.

Â So, this is a little bit more powerful model than the Gaussian mixture model,

Â and we will discuss in the next videos how we can

Â make it even more powerful by using neural networks inside this model.

Â