Clustering or unsupervised learning,
is to try to find
those clusters within which the instances are similar,
and then how can the weed represent a cluster?
We just use cluster centroids,
so closer centroids can
be way smaller number of cluster centroids way smaller than data points.
So basically we just use these centroids to represent clusters.
So now, one of the most common type of algorithms is partitional method,
you just partition the data into a set of
predefined clusters then you just calculate like each instance in the data set,
see which cluster it is closest to
then assign this instance to that cluster, Okay this is one method.
Of course there's another method then it's bottom up,
but we're not going to discuss this here.
We will focus on the partitional method.
Now in this 2D case,
I have six clusters, right?
So if I know before I have six clusters,
then I will be able to cluster all of them accordingly.
But if I say on a case two,
then probably I have different ways to cluster, right?
It's one here, it's one here or a one here another one here.
So which one's the best?
And there are so many combinations.
We introduce k-means, an intuitive and the common algorithm.
What it does this,
common data points XY in initial set of K centroids.
So, M1 to MK,
this is the one,
and this is T, it's the iteration we have to go through, okay?
So in the very beginning it is one.
We need to have k centroids,
then what do we do is for each data point,
we will compare with all of the centroids and
see if it's close to one of them,
then we can assign this data point to this cluster corresponding to the centroid.
So in this case, you see at the T iteration.
So, if I can find the mi,
this is the centroid i,
close is closest, right?
this is the one it's smaller than all the others, the remaining ones.
Then I will assign this XP to i,
so that's the cluster i,
but this is just like for one,
right, instance, but if I have N instances,
I have to assign all of them accordingly to
these k clusters and that next one should be T equals two now.
But a centroid should be updated,
because now I have each cluster,
I have new instances from the streaming data and I need to
recalculate and then I repeated this process.
This is the algorithm and we have a data set of real value attributes,
K the number of clusters and we will return a clustering.
The result of clustering is called clustering, Okay.
A clustering of data into K clusters.
You say How can I determine the very fastest K centroids?
So this is why we usually we consider key random instances in the data space.
As the initial cluster centroids.
Now, after we determine that,
we will go through the whole dataset.
While centroids have not converged.
What does that mean? We will do this later.
So, if it's not defined, this convergence later.
So, assign each instance to the cluster that has the closest cluster centroid.
If all instances have been assigned,
then recalculate the cluster centroid by averaging instances inside each cluster.
So it's very intuitive. So that's why for any clustering work,
we need to use k-means.
If we know k, then we use k-Means as a baseline algorithm,
to compare with other more complicated algorithms.
This is intuitive that's why people tend to accept the result more.
But the issue is,
how can I determine the right k, right?
So there are certain issues,
and when do we stop?
So we mentioned about the convergence.
So, convergence can be defined when centroids are no longer changing.
So, like at t iteration,
I have k centroids.
At t plus one iterations,
the set of centroids is the same as the one before,
the set before, then we say it convergences.
But there's another way to define convergence.
We can't just say algorithms can be stopped when the Euclidean distance between
centroids in two consecutive steps is less than some small positive value.
Why? Because k-means does not guarantee that the algorithm will converge.
So in this case,
of course, we cannot wait forever.
So we have to use this distance,
like a small distance between two sets of centroids to decide when to stop.
So an alternative to k-means is to
implement it as an objective function then we try to minimize it.
So, what is that?
Again, we tried to calculate
the distance between data points in this cluster, ith cluster.
You see, we have k clusters.
I is one to k. So,
for ith cluster, so n right?
I have n_i means I have how many instances in this ith cluster?
So I just calculated the distance between each instance in
this cluster with the centroid then square it.
Why? Because the distance must be positive,
but x_ji minus c_i can be positive or negative.
So that's why we have to square it.
Then we add them together.
We calculate the total error.
But now we want to minimize it,
then this is a good one.
Okay? So stopping criteria here is when the difference between
objective function values of two consecutive iterations
of k-means algorithms is less than some small value.
Then we stop. Now, more discussion.
So, one is how can we know we have the right k? What should we do?
We can always talk to domain experts if they have a sense of the data,
or if they want you to cluster data into certain number of k clusters.
But another way is your let your algorithm tell you.
So you just ran like a k equals three,
five, seven, whatever number,
just let the computer work for you.
But finding global optimum of k partition is computationally expensive.
It's difficult to programme.
And this is equivalent to finding optimal centroids that minimize objective function,
since we cannot wait forever.
So we just try to use some heuristics to help us.
Because at the end of the day when we do clustering,
we want to understand the data.
Then, after we gain some understanding,
we can always revise our hypothesis and retry.
Okay? That's why we can run k-means multiple times.
So, run k-means multiple times has several meanings.
One is I can change k. The other one is after I fix a k value,
I can also run multiple times.
Why? Because remember in the very beginning for us to choose k centroids,
it's just a random process.
So we never know which random set is the best.
So, we can just run several times,
and then we will be able to find the good one.
But now, we also want to talk about
the difference between means and medians, okay? So why?
Because, mean we know is the average,
and median is the middle value.
There are differences.
Okay?. So, here is an example.
We have a startup with eight employees,
one CTO and one CEO.
These are their salaries: 50k,
50k, 50k, 50k, 50k.
Five employees with 50k,
and then two employees with 80k,
and one employee has 90k,
and the CTO and CEO each has 150K.
So what's the average?
The average of the salary is if you calculate,
like you add them all up and divide by 10, right?
So, it is 80k. What is the median?