In this video, we are going to discuss a latent variable model for clustering.

So what is clustering?

Imagine that you own a bank, and you have a bunch of customers,

and each of them has some income and some debt.

And you want from this data, so

you can represent each of your customers on a two dimensional plane as a point.

And from this data,

you want to decompose your customers into three different clusters.

Why?

Well, for example, you want to find people who spend money on cars and

make some promotions for them, some car related loan or something.

This can be useful for different retail companies and banks, and

companies like that.

So find meaningful subset of customers to work with.

And this is an unsupervised problem, so we don't have any labels,

we just have raw data, raw x-ses.

Usually clustering is done in a hard way, so for

each data point we assign it a color.

This data point is orange, so it belongs to the orange cluster and

this one is blue.

Sometimes, people do soft clustering.

So instead of assigning each data point a particular cluster,

we will assign each data point a probability distribution over clusters.

So for example, the orange points on the top of this picture are certainly orange.

And they have a probability distribution like almost 100%

to belong to their orange cluster and almost 0% to belong to the rest.

But the points on the border between orange and blue, they are kind of not settled.

They have for example, 40% probability to belong to the blue cluster, and

60% probability to belong to the orange cluster, and 0% to the green.

And we don't know which cluster at this point actually belong to.

So instead of just assigning each data point a particular cluster, we assume that

each data point belongs to every cluster, but with some different probabilities.

And to build a clustering methods with this property,

we will treat everything probabilistically.

Why can we want that?

Well, there are several reasons.

First of all, we may want to again handle missing data naturally.

And another reason, that we may want to consider clustering

in probabilistic way, is to tune hyperparamters.