0:03

Now let's look at the popular algorithm for hard clustering called

Â K-means and see if we can connect it to the general form of the expectation maximization.

Â So, it would be really cool to drive,

Â it is a special case of EM and to feed it in the general framework.

Â So hard clustering.

Â We have a data set of points and if we want to assign each data point to

Â some of the clusters and now we can't say that suddenly the point,

Â it belongs to several clusters simultaneously.

Â Now which data point should be assigned to one and just one cluster, okay?

Â The K-means algorithm suggest

Â you to solve this hard clustering problem in the following way.

Â First of all, let's initialize the parameters randomly.

Â And here are the parameters are just means so the locations of the cluster.

Â We don't have weights by and we don't have

Â some shapes for the clusters Sigma, just locations.

Â And then in iterations,

Â we're repeating two sub steps until convergence.

Â So the first sub step is to find for each data point,

Â find the closest cluster and assign these data points to this cluster.

Â So each data point is assigned to belong to the closest cluster to it.

Â And by closest, I mean according to Euclidean distance.

Â And on the next sub step,

Â K-means suggest you to update the parameters by finding,

Â so for example, to update the first Gaussian,

Â the first cluster centroid Mu_1

Â we'll have to find all the data points which we are assigned

Â to the first cluster on the first sub step and

Â then find their average which will be the center of this cluster,

Â updated center and then we repeat these two things until convergence.

Â So if you look carefully at this algorithm,

Â it looks really similar to the EM, right?

Â We also have random initializations and then we in iterations repeat

Â two steps which also look really close to the stuff we had in the EM.

Â First of all, we compute some property for

Â each data point and then we update the parameters by using these properties.

Â Let's see if we can somehow convert our Gaussian Mixture Model and EM applied to it,

Â so to obtain exactly this K-means algorithm.

Â So to prove that the K-means is just a special case of

Â this EM algorithm applied to Gaussian Mixture Model or maybe to some other model.

Â So first of all, we have to say that we don't have these additional parameters, right?

Â So let's say that the covariance matrices of the shapes are all

Â identical matrices which means that all the shapes of each Gaussian is

Â just uniform circle with fixed radius and the prior weights pi are all uniform also.

Â So they equal to one divided by the number of clusters.

Â This way we will have only Mu as the parameter of our Gaussian Mixture Model or kind of

Â restricted Gaussian Mixture Model and then

Â the density of each data point given that we know the cluster,

Â now looks like this.

Â So, it's some normalization times exponent of

Â point five times the Euclidean distance between x1 and Mu_c where Mu_c

Â is the center of the Gaussian number c. You can prove that

Â this is exactly the formula for multidimensional Gaussian, multivarian Gaussian.

Â If you have identical covariance matrix sigma,

Â if it equals to identical matrix.

Â Now, let's look at the E and M steps of this restricted Gaussian Mixture Model.

Â On the E -step, the expectation-maximization algorithm suggests you

Â to find the minimum of the KL divergence

Â between the variational distribution q

Â and the posterior distribution p(t) given data and parameters.

Â And if we are going to do it,

Â we will find not hard assignments but soft assignments, right?

Â As we obtained in the Gaussian Mixture Model.

Â So for each data point this q will say with

Â which probability it belongs to one cluster or another or one Gaussian or another.

Â And this is not what we want.

Â We want to derive K-means from this model.

Â So we want this step to be hard clustering.

Â How can we do it? Well, let's find

Â not the best optimal q but q among some family of simple qs.

Â So let's look at the q among all delta functions.

Â Where delta function means that this probability distribution takes

Â one value of probability one and all the other values with probability zero.

Â It's really certain about what the outcome will be.

Â So, let's imagine that the posterior distribution looks like this.

Â So our data point belongs to the cluster number two with probability

Â like 0.6 and cluster number one with probability 0.4. What do you think?

Â What the delta function approximation will be?

Â So what will be the closest q to it according to

Â KL distance among the family of delta functions?

Â Well, it will be a distribution like this.

Â It will put all the probability mass into the class that

Â had the highest probability according to our posterior.

Â So by restricting the set of possible qs on the E-step,

Â we're actually approximating the actual posterior distribution with delta functions.

Â And this way the optimal q on the E-step will look like this, right?

Â It's some delta function.

Â So it's probability one then t_i equals to some predefined value and zero otherwise.

Â And what is c_i? What does this predefined value?

Â Well, it's the most probable configuration according to our posterior distribution.

Â So, we have to find c_i which maximizes

Â the posterior probability of latent variable t_i given the data and the parameters theta.

Â Recall that the posterior distribution itself

Â is proportional to the full joint distribution which is x,

Â u and t as p of t and

Â parameters and we also have some normalization constant but we don't care about it

Â because we want to maximize this thing with respect to

Â t_i or with respect to c which like respect to the value of t_i.

Â So normalization doesn't change anything.

Â And if you recall,

Â this thing equals to normalization times the x given t which is exponent of

Â minus 0.5 times Euclidean distance and times

Â the prior and we agree that the prior will be just uniform.

Â So it doesn't depend on c and we also don't care about when we maximize this thing.

Â So finally, we want to maximize this expression with respect to

Â the value of t_i and we have normalization one divided by Z and we have.

Â pi of c which both actually doesn't depend on c. So,

Â we can throw it away and will not change anything.

Â And maximizing the exponent of minus something is the same as minimizing something.

Â So we can just minimize 0.5 times Euclidean distance.

Â And finally, we have an expression like this.

Â So our optimal q from the E-step is the same as,

Â is just a delta function which takes the value of arg minimum of the Euclidean distance.

Â So the closest cluster center with probability

Â one and it has probability zero to be anything else.

Â Okay?

Â So, this is exactly like we have in K-means,

Â which means that this restricted Gaussian Mixture Model if you apply EM to it

Â and if you restrict your possible values

Â of the variational distribution q on the E- step to be delta functions,

Â you will get exactly the same first sub step of the iteration.

Â So now, let's derive the formula for

Â the M-step on the blackboard and see how it's connected,

Â how does the formula for the second sub step of K-means and x to

Â the M-step of the expectation maximization

Â apply to this restricted Gaussian Mixture Model with

Â this particular choice of q which is delta function.

Â