0:03

In the previous video we sketched an algorithm

Â to fit Gaussian Mixture Model into a data set.

Â Now let's see an example of how it can look like on some particular set of data.

Â So assume this data again this one dimensional data set of points.

Â Let's try to find two Gaussians which explain this data well.

Â So let's start with initializing,

Â initialization here means to choose two random Gaussians and place them somewhere.

Â So let's say we choose this one Gaussian around, will you,

Â 175 and the other one is around 180 and the variances are around, I don't know, around 5.

Â Around that view. So it's random initialization.

Â So on the next step we have to estimate the sources.

Â We have to estimate the posterior probabilities of the latent variability.

Â So from which Gaussian each data point came. Let's do it here.

Â So, you can see that for example the rightmost data points,

Â they are certainly orange.

Â Because the orange Gaussian explains them

Â much better than the blue one right? It makes sense.

Â The data points in between our randomly initialized Gaussians are kind of not settled.

Â They are half blue half orange.

Â We're not sure about their exact assignment to these Gaussians.

Â The interesting story is about the leftmost Gaussian, the leftmost point.

Â So on the one hand,

Â both Gaussians think that this data point is kind of weird.

Â Because, it has really low density according to both Gaussians.

Â And you can expect that they will both kind of not expect it to appear here.

Â But on the other hand the blue Gaussian is so much,

Â the blue Gaussian probability is so much higher,

Â that this data point's almost completely blue.

Â So you can see here that the blue Gaussian densities are around .01 here,

Â which is low, it's almost zero.

Â But the orange Gaussian density is around .00002,

Â which is much lower.

Â So the overall probability is around 1.

Â Which means that these data points are almost certainly blue.

Â And so I'll know both Gaussians do not think that these data points should exist.

Â The blue Gaussian explains these data points much better,

Â just because it's closer.

Â Okay so it is the end of the third subtype of the first iteration.

Â We have this kind of softer sense,

Â for which data point we know its latent variability or at least a distribution of that.

Â So now let's repeat our Gaussians,

Â let's re-find our parameters.

Â Well it will look something like this.

Â So we feed a Gaussian into all the blue points and then we feed the Gaussian

Â in all the orange points and we get some two Gaussians.

Â Note that the blue Gaussian is now much shifted to the left just because,

Â you were assigned lots of points on the left hand side of this plot.

Â And it just has to explain them so it's shifted to be around these blue datapoints.

Â So this is the end of iteration one and it's already a reasonable solution,

Â so we are not converged yet but it is already an approximation of the solution.

Â And from this I think we can already get some reasonable clustering,

Â so we can now call all the orange points one cluster,

Â all the blue points the other cluster,

Â and all the points in between we can say like we're not certain about them and so,

Â we can't say anything certain about these half blue half orange datapoints.

Â But anyway let's proceed a few iterations more.

Â So on the next iteration, we will re-assign.

Â We'll compute these assignments of latent variables, right?

Â So the colors of the data points will change.

Â So look carefully here.

Â They change just a little bit.

Â And you can see that

Â some of the data points which were half blue half orange,

Â they shifted to be just a little bit of blue and much more orange.

Â It's because now our blue Gaussian's more to the left and so it's explained less.

Â It explains the data points that are around 175 height a little bit worse.

Â So these data points became more orange than blue.

Â But overall the situation didn't change.

Â So we're really close to convergence.

Â So on the next subtype of the second situation we estimate the parameters.

Â And the Gaussians become a little bit more certain about

Â their location and their variances reduce.

Â And finally one more alteration.

Â So look carefully at the color of the data points,

Â they shift just a little bit more.

Â And again we re-estimate the Gaussians.

Â And we convert this kind of solution to

Â our original Gaussian mixture model fitting problem.

Â So this was an expectation maximisation algorithm

Â for Gaussian Mixture Model on this one dimensional data.

Â Note that however this algorithm doesn't give you an optimal solution.

Â So it doesn't give you the global maxim of your likelihood.

Â And actually it turns out that the problem of finding

Â global maxim of this likelihood is NB hard.

Â So you can't expect anyone to give you an optimal solution in reasonable time.

Â But well it gives you something reasonable right?

Â But sometimes this expectation maximisation suffers from local maxim a lot.

Â We can see an example here.

Â Let's initialize our Gaussians somewhere else.

Â Not as we initialized them in the first demo but somewhere else.

Â Let's say we initialized the Gaussians on the left hand side of the plot.

Â The first substep of the first iteration,

Â we estimate the sources,

Â we estimate the latent distribution and latent variability and looks like this.

Â So all the points that are more to the right than 165 height,

Â they will be certainly orange because the orange explains them better.

Â And the others will certainly be blue.

Â It's not a very good solution but it's not a solution yet right?

Â It is just the first subset of the first iteration.

Â It shouldn't hurt us too much.

Â But it turns out that because of this mistake,

Â because we assign so many points to be orange,

Â on the next substep we will have to

Â estimate the variance of the orange Gaussian to be high like this.

Â Just because it has so many points both on the right hand side and on the left hand side.

Â And now we have a problem.

Â So at the start of iteration two,

Â look at the leftmost point.

Â It will become a little bit orange.

Â Just because now it is explained by the orange Gaussian as well as the blue Gaussian.

Â Because now the orange Gaussian is high in this region.

Â This is a disaster.

Â Because now we'll repeat the Gaussians and the orange Gaussian will

Â think that now it not only has lots of orange points on the left and on the right,

Â but also it has this one leftmost point.

Â Well not fully, just a little chunk of this leftmost point,

Â but anyway it contributes to the variance estimation,

Â and the variance estimation of the orange Gaussian will be really high.

Â It thinks that it owns all the points on this plane and this will make

Â the orange Gaussian variance estimation stay high and

Â the blue Gaussian variance estimation will

Â reduce a little bit because it becomes more certain in its estimation,

Â and we're stuck in this kind of situation.

Â So we can do a few more iterations,

Â and we converged this local maximum,

Â which is certainly not optimal.

Â But the bottom line here is that

Â the expectation maximisation suffers from local maximums.

Â And as usual in optimization,

Â the way to deal with it is to start several times from different random initializations,

Â and choose the best run.

Â The best thing here is that you can track the log likelihood,

Â and you can find the best run

Â according to the training log likelihood performance after all.

Â So at least you can compare different logs to show and choose the best one.

Â Summary. Gaussian Mixture Model is

Â a flexible probabilistic model which you can fit into your data,

Â and it allows you to solve the clustering problem,

Â and also gives you a probabilistic model of your data.

Â So you can for example sample from this model new data points.

Â Expectation Maximisation algorithm is something to train this Gaussian Mixture Model,

Â and other latent variables models,

Â as we will see in the next few videos.

Â But for now it's just for Gaussian Mixture Model.

Â And sometimes it can be faster than Stochastic Gradient Descent,

Â and it also helps you to handle

Â complicated constraints like making

Â the covariance matrix positive semi definite on each iteration.

Â And finally expectation maximisation suffers from local maxima.

Â But you know it's kind of expected because the overall problem was NB hard,

Â and so thus Stochastic Gradient Descent.

Â So the optimal solution in a reasonable time is not possible.

Â