Well, one more thing we can do to

simplify this expression, to understand how to maximize it,

is to substitute the definition of the marginal likelihood of xi given

the parameters, by using the definition from a few slides before.

So each data point density is a mixture of Gaussian densities, right?

Okay, so now we have this optimization problem.

And just one more thing we forgot here, is that we have some constraints, right?

We have to say that the weights pi are non-negative, and that they sum up to 1.

Because otherwise, it will not be an actual probability distribution.

But now it seems like we're good to go.

Now we may use your favorite stochastic optimization algorithm from TensorFlow,

like Adam or whatever you would like to use,

and we can optimize this thing to find the optimal parameters, right?

Well, it turns out that we kind of forgot one important set of constraints here.

The covariance matrices sigma cannot be arbitrary.

Imagine

that your optimization algorithm propose to use covariance matrix with all zeros.

It just doesn't work. It doesn't define a proper Gaussian

distribution.

Because in the Gaussian distribution definition you have to invert this matrix,

and you have to compute its determinant, and divide by it.

So if you have a matrix which is all 0s,

you will have lots of problems like division by 0, and stuff like that.

So it's not a good idea to assume that the covariance matrix can be anything.

And actually, the set valid covariance matrices is something called positive semi-definite matrices.

Don't worry if you don't know what it is, it's not the important right now.

The important part is that it's a really hard constraint to follow,

so it's hard to adapt your favorite stochastic gradient descent algorithm to

always follow this constraint.

So to maintain this property that the matrices are always positive

semi-definite. I don't know how to do it efficiently, so we have a problem here,

we don't know how to optimize this thing, at least with stochastic gradient descent.

Well, it turns out that even if you get this constraint, so

if you consider a simpler model, for

example if you say that all the covariance matrices are diagonal, which

means that the ellipsoids that correspond to the Gaussians cannot be rotated.

They have to be aligned with the axes.

In this case it's much easier to maintain this constraint.

And you can actually use some stochastic optimization here.

So for example, in this example I used Adam and

I tuned its learning rate to optimize this thing.

And you can see that it's doing a reasonable job here.

So the blue curve is the, performance of the Adam here.

On the x-axis we see epochs, and on the y-axis we see log likelihood,

which we are trying to maximize.

And so, Adam is doing a good job, right?

In like 10 epochs it optimized this thing to something reasonable.

And the green line here is the ground truth which came from ... because I know

from which probability distribution I generated this data, so,

I know the optimal value for the log-likelihood.

But it turns out that even in this case where we don't have this very complicated

constraints you can do so much better by exploiting the structure of your problem.

And this is something we're going to discuss in the rest of the week,

is something called the expectation maximization (EM) algorithm.

And if you apply it here, it just works so much better.

In a few iterations it found the value,

which is better than the Ground Truth, which probably is overfitting,

but anyway it works good on the test set as well.

So to summarize, we may have two reasons to not

to use this stochastic gradient descent here.

First of all, it may be hard to follow some constraints which you may care about,

like positive semi-definite covariance matrices.

And second of all, expectation maximization algorithm,

which can exploit the structure of your problem,

sometimes is much faster and more efficient.

So as a general summary: we discussed that the Gaussian mixture model is a flexible

probability distribution, which can solve the clustering problem for

you if you fit your data into this Gaussian mixture model.

And sometimes it's hard to optimize with stochastic gradient descent, but

there is this alternative which we'll talk about in the next video.