0:03

In the previous video,

Â we decided to use Gaussian Mixture Model to fit

Â our dataset and to solve the clustering problem, but how?

Â How can we do better than the generals that castigate and descent?

Â In this video, we will discuss some intuition that leads

Â to expectation maximization algorithm for this particular case.

Â To recall that the density of each data point in

Â our Gaussian Mixture Model is a weighted sum of three or,

Â in general, as many as you want, Gaussians divisions.

Â To start with, we will need to introduce a latent variable

Â here because it will make the reasoning about our model much easier.

Â What can we correlate more variable here?

Â Well, we can do something like this.

Â We can say that each data point was generated

Â by using some information from latent variable T, which exists.

Â It's like we have one latent variable T for each data point X and it causes X.

Â It explains something about X,

Â and the reasonable thing to assume about T here is that it takes 3 radius,

Â 1, 2 or 3 and it shows us from which Gaussian this particular data point came from.

Â We actually don't know for any data point,

Â we don't know from which Gaussian came from.

Â It's a latent variable, right?

Â We doesn't observe this,

Â never nor in training nor in testing,

Â but these can be helpful so if we know the latent variables,

Â it can be helpful to understand something about our model.

Â Later, then with fit the Gaussian mixture model into our data.

Â We may find the distribution on the latent variable even the data.

Â We may ask how will the question,

Â how do you think,

Â what is the value for latent variable T for these particular data point?

Â The answer to this question will be basically the clustering, right?

Â It gives us the beliefs from which Gaussian these data point came from.

Â If we decided to use this kind of latent variable model,

Â then it is reasonable to assume that latent variable T has prior distribution pi,

Â so it's exactly the weights of our Gaussians.

Â The latent variable T equals to some cluster number,

Â for example one, with probability pi one and the likelihood.

Â The density of the data point X,

Â given that we know from which Gaussian came from,

Â is just Gaussian distribution because it came from the Gaussian.

Â Each data point, if we know that it came from the Gaussian number C,

Â it's density just these Gaussian things do with the parameters of

Â the Gaussian number C. When we introduce this kind of latent variable model,

Â we may now look at what will P of X given theta represent in this model.

Â We change our model and we now have to check that it still gives

Â you the same general results as our original model.

Â If you write down P of X given theta,

Â given parameters and this latent variable model, we will get this.

Â It's just a rule of sum for probabilities and it says

Â that a P of X given theta equals to sum with respect to T. So we are marginalizing T,

Â with the respect to all possible values for T, from 1-3,

Â and were assigning the join distribution P of X and T,

Â which equals to the likelihood times the prior.

Â If you substitute the definition of the prior and likelihood into this summation,

Â you can understand that this last formula on this slide

Â is exactly equal to the first formula.

Â We introduced some latent variable in

Â such a way that we when we marginalized out this variable,

Â we get exactly what we had before.

Â Which means that we now can use this latent variable model,

Â train it with observed data X,

Â and it will gives us exactly the same results

Â as we would get if we use the original model.

Â Let's now try to build some intuition about how to train this latent variable model.

Â So say we have a dataset and now,

Â say it's one dimensional.

Â Each data point from 1 to N is just one number for sublist of illustration.

Â How could our goal for finding the maximum likelihood estimation,

Â is finding the parameters, right?

Â How can we find the parameters?

Â Well, it turns out that if we know the sources,

Â if we know the values of the latent variables for each data point,

Â then finding the parameters sigma is easy.

Â Because you're going to say that all the blue data points here is the data points for

Â which we know that they came from the first Gaussian and

Â we can look at them separately from the older orange data points,

Â and just fit them into one Gaussian,

Â which we already know how to do because it's just

Â fitting some data set of blue point into a Gaussian distribution.

Â You can see the formulas on the bottom of the slide,

Â but it's something we have already done in week one,

Â which means that if we know the sources,

Â if we know the values of the latent variables,

Â it's easy to estimate the parameters data.

Â Actually, if we don't have this hard assignments,

Â but rather have soft assignments so some posterior distribution on T,

Â which means that for each data point.

Â We don't assume that it belongs to just one cluster,

Â but rather we assume that it belongs

Â to all clusters simultaneously, all Gaussian simultaneously.

Â What will some different probabilities being posterior P of T,

Â given X and parameters.

Â If we have these probabilities,

Â it is also easy to estimate the parameters.

Â So instead of just signing with respect to all blue points,

Â and averaging their position to get the location of the blue Gaussian,

Â we'll have to sum all the points but with weights.

Â If the posterior P of T = 1,

Â is 1 for some data point?

Â It means that this particular data point is completely blue.

Â It certainly belongs to the blue Gaussian,

Â and it will be used as just blue data point in the averaging with weight 1.

Â If for some data point,

Â P of T = 1 is 0,

Â it means that these data point is certainly orange,

Â and we will just don't use it in the computing the blue Gaussian mean at all,

Â but if the data point is for example,

Â P of T = 1 is 0.8,

Â it just means that this data point is kind of not certain.

Â It thinks that it belongs to the blue Gaussian but it's not sure.

Â It will highly affect the position of

Â the blue Gaussian and a little bit affect the position of the orange Gaussian.

Â We'll direct this kind of formulas later from more strict considerations,

Â but now, just believe me that if you know the posterior probably on T,

Â you could easily estimate the parameters this way.

Â The bottom line here is that if we know the sources,

Â no matter soft segments or hard segments,

Â then we can easily estimate the parameters of our Gaussian mixture model.

Â But on practice, we don't, right?

Â We don't know the sources,

Â so how can we estimate the sources?

Â Well, it turns out that if we know the parameters, so the Gaussians,

Â their locations and their variances,

Â then we can easily estimate the sources because we can use just the Bayes' rule to do it.

Â And by the Bayes' rule, the soft assignment,

Â the posterior probability of T equals to,

Â for example, blue Gaussian,

Â for some particle data point,

Â is just proportional to the join probability,

Â which is likelihood times prior.

Â And if we know the parameters theta,

Â we can easily compute these likelihood and prior at

Â any given point so we can easily to compute this posterior probability,

Â which is basically sources,

Â which is basically assignments for each data point for the clusters,

Â soft assignments, and we think that the normalization constant here can be problematic,

Â but it turns out that we have just two values here.

Â Two probabilities.

Â The distribution P of T given some data and theta,

Â can think just two possible values.

Â We can explicitly normalize this think by signing with

Â respect to two unnormalized probabilities. It's no big deal.

Â To summarize, we have kind of a chicken and egg problem.

Â If we know the Gaussian parameters,

Â we can easily estimate the sources.

Â If we know the sources,

Â we can easily estimate and Gaussian parameters.

Â What can we do in this case?

Â Well, the expectation maximisation algorithm,

Â in this particular case,

Â suggests us to do a very natural thing.

Â Let's, first of all,

Â internalize the parameter somehow randomly and then in iterations in the loop,

Â let's repeat two steps until convergence.

Â First of all, let's fix the parameters.

Â Assume that they are the true ones,

Â and they estimate the sources.

Â On the next sub-step,

Â let's use the sources to re-estimate the parameters,

Â to update the beliefs about the parameters.

Â This way, after repeating this two steps for long enough,

Â we will hopefully obtain a reasonable solution

Â to our probability distribution fitting problem.

Â We will be able to fit Gaussian mixture model into our data.

Â