So, now we have a look at clustering using the k-means algorithm.

K-means is actually the Hello World clustering algorithm.

So, again, let's consider a 3D vector space with 12 examples or 12 points,

and I intentionally created three clusters.

So by the naked eye,

it's easy to spot those.

This sometimes also holds for revert dataset.

So consider this example,

you can easily spot the clusters and draw lines between them.

If you do so and compare it with the correct assignments,

you will notice that it's relatively easy to get a good accuracy here,

just by using the naked eye.

But datasets tend to be high-dimensional,

so in hyper-dimensional space,

it's relatively hard to plot and draw lines.

Therefore, you need an algorithm which helps you.

So, this is where k-means kicks in.

So, k-means has to be initialized with the number of potential clusters.

So, assume we have some idea about the data and we set it to three.

Then k-means, we randomly assign three cluster centroids into this 3D hyper space,

and then it just computes the Euclidean distance between every point and

every cluster centroid and takes the minimum for

each point and assigns each point to the minimum distance centroid.

So, after this has happened, for average centroid,

the point cloud surrounding the centroid is taken,

and the mean of all those points is computed,

and that gives us a new centroid.

Those are marked in black here and then

point assignment to the nearest centroid happens again.

After this has been done, again,

the mean of the new point clouds is computed,

and again those get the new centroids.

This is iterative algorithm and we can basically stop when we are converging.

I will tell you later what converging means here.

So, converging means we are actually optimizing over this function here.

So, what we want to do is we want to find

an optimal assignment of points to cluster centroids.

Those are in the set S, and in addition,

we want to find optimal positions of those cluster centroids.

So, here this sum goes up to K. That's the number of

clusters and here we see that we are taking

the sum individually for each cluster centroid.

So, this sum basically is nothing else than the distance

between each point and the current cluster centroid,

and we take the absolute value and we square it.

The reason why we square it, again,

is to penalize big distances and to favor small distances.

At the end of the day,

after we have finished k-means,

we will find a good assignment of points to cluster centroids,

and we find a good position of cluster centroids in hyper space.

So, in summary, clustering is nothing else than

assigning points to clusters and in the specific case of

k-means via some sort of drawing hyper-dimensional spheres around cluster centroids,

and whatever point lies in those hyperspheres belongs to

the cluster and whatever point lies outside

doesn't belong to the cluster and hopefully will belong to another cluster.

So, here you already see the limitation of k-means because we can

only define and learn spherical cluster boundaries,

and in later videos,

we will learn how we can get over this limitation.