On the M-step of
the Expectation Maximization Algorithm for restricting the Gaussian mixture model,
we have to maximize the following expression.
So we have to find the maximum of sum with respect to individual objects in our data set,
training objects of expected values
with respect to the variational distributions q of t_i of
the joined log likelihood of x_i and t_i given the parameters,
and here with the parameter suggest mu,
so given mu and maximize this expression with respect to mu.
Mu one to mu number of clusters.
We can maximize this thing explicitly by finally get zero,
setting it to zero,
but recall that on the previous video,
we maximize the same expression for Gaussian mixture model.
We can now reuse the results from that video in this one.
Recall that the optimum parameter for μ number C,
for example, was the following one.
It was the average with respect to all the data
or all the data points now are training to data set times
the variational distribution q of the weights of
ti equals to C. The weights according to the card and so the Gaussian C,
that is the data points xi,
divided by the normalization.
So now, the sum of all the weights q of ti equals to C.
Recall that in our particular case, on the E-step,
we are using delta functions instead of the full variational distribution q.
Our q looks like this,
q of ti equals to one if ti equals to ci,
ci-star, which we have already pre-computed on
the E-step and zero otherwise.
We can continue this expression and
simplify it by using these definition of q to ti, in this particular case.
This thing will be just again sum with respect to all the subjects but all the objects
that doesn't have ti equals to ci-star or
to c. They will not contribute to the sum, right?
So any object which didn't came from the Gaussian number c,
it will have the weight q being zero,
both the numerator and denominator and so,
it will not contribute to this sum.
Thus, we can write down like this.
We can say that this thing will be equal to sum with respect to
all objects from the Gaussian number c. All i such
that ci-star equals to c will arrange
these data points and divide them by normalization by the number of such data points.
It's number of i such that ci-star equals to c,
and this is exactly the formula for the K-means algorithm on the second subset.
To summarize, we have just proven that
the M-step of the expectation maximization algorithm applied to
this restricted Gaussian mixture model without some parameters
where we leave just the locations of mu of the Gaussian parameters.
If we use the q,
the restricted q or being the delta function from the restricted E-step,
then on the M-step, we obtain this kind of formula.
We have to find,
so we have to update the mu as being the centers of the all the data points
assigned to the cluster number c. We use again exactly as we had in K-means.
So to summarize, K-means is actually
a special case for applying
the expectation maximization algorithm to Gaussian mixture model,
but first of all,
these Gaussian mixture model has to be some kind of restricted one where we said
covariance matrices are to be identical and the prior Gaussian c to be uniform.
We also have to use
simplified E-step so instead of considering
the full posterior distributions on the E-step,
we're approximating them with the delta function.
Which kind of makes E-step and M-step faster but also restrict
the flexibility of the model and efficiency of the EM algorithm.
But anyway, this kind of restriction still gives you
a valid EM algorithm which still is guaranteed to converge and has nice properties.
It's really cool. We have just took your usual machine learning algorithm for clustering,
which is not probabilistic in any case in any sense.
We show that it's
a special case of the EM algorithm applied to some restricted Gaussian mixture model.
Using this knowledge, we can, for example,
change something in K-means and we
know how to adapt the the training algorithm to these changes.
For example, we can introduce missing data.
This way, we can build the missing data K-means
without heuristics like setting missing data zero,
but by just extending
this restricted EM algorithm and handling these missing data naturally.
Actually, this kind of trick to simplify the E-step by considering
a restricted set of distributions
is strongly connected to something called variational difference,
which we'll discuss and week four.