0:00

[MUSIC]

Â >> In this session, we're going to examine agglomerative clustering algorithms.

Â We already introduced the general concepts of, you know,

Â agglomerative and divideditive clustering algorithms.

Â Now we look, from the computer science point of view,

Â we can think agglomerative clustering essentially is a bottom up clustering.

Â That means we start from the single den clusters,

Â that means every single element treated as one cluster.

Â Then we try to merge their nearest neighbor

Â into bigger, bigger clusters and eventually merge them into one cluster.

Â This one is a bottom up, or agglomerative clustering.

Â 1:11

nodes that have the least dissimilarity, or the nearest neighbor.

Â Eventually, all the nodes merged into one cluster, okay.

Â That means you start from the singleton, you try to find its,

Â everyone's nearest neighbor.

Â Then you check the, the, the closest nearest neighbor, you try to merge them.

Â Okay.

Â Then once you merge into bigger cluster you're looking at cluster,

Â cluster their nearest neighbor they're single-link.

Â Then decide which one to merge eventually you'll merge everything into one cluster.

Â Actually in the rear agglomerative clustering algorithms,

Â they vary different similarity measures.

Â For example, we just discussed AGNES.

Â It uses nearest neighbor we called single-link measure.

Â You also can use the farthest neighbor, or diameter.

Â That's complete link.

Â 2:29

Let's examine a little detail on these different links.

Â We first look at the single link, or we call nearest neighbor.

Â Okay.

Â So the similarity or the distance between the two

Â clusters is defined based on the most similar or

Â the closest elements in these two clusters.

Â Okay.

Â Just to give you an example, suppose one is in U.S., one is in Cuba,

Â you want to find their closest points.

Â Likely in the U.S., it will be the Key West.

Â So such kind of merging essentially is you examine its neighbors,

Â or the close regions.

Â So, you know the overall structure of the cluster.

Â So, in that context, you will be able to find irregular shaped,

Â or non-elliptical shaped, group of objects.

Â This link is also sensitive to noise and outliers.

Â Because obviously if you get outlier which is very close to the other one, you may,

Â you may decide to merge them.

Â Another approach called complete link is you check the diameter of the cluster.

Â The idea is you try to define the similarity between the two clusters

Â based on their farthest neighbors.

Â That means between the most, the de-similar elements in these two objects.

Â For example, if you define the distance between U.S. and Cuba, you probably,

Â for the complete link, you may pick up the farthest point in Alaska.

Â So that's pretty far apart.

Â So that means you actually try to merge the two clusters to form

Â one with the smallest diameter, because when you merge them together,

Â you want the final one is pretty compact.

Â You examine non-local behavior, you obtain compact shaped clusters.

Â But this measure also sensitive to outliers, okay?

Â Because you probably do not even want to think about Alaska.

Â You may think about Hawaii or Guam, you know, you try to merge them.

Â That's pretty far apart.

Â Then we will look at the average link.

Â The average link between two clusters actually is rather expensive to compute.

Â Because what you want to calculate is the average distance between all

Â the elements in one cluster and all the elements in the other cluster.

Â That means you calculate all the pairs of the two clusters,

Â then you try to average them.

Â For example, if you have a number of

Â cluster in C sub a is N sub a, and

Â a number of elements in cluster C sub b is N sub b.

Â So what you want to care, the distance, essentially you will get

Â total number of pairs is N sub a times N sub b.

Â That's pretty big, you know, average time.

Â 6:04

And you may even want to put a weight there for example,

Â if the cluster C sub a has many, many points versus

Â C cluster C sub b, you want to take this into consideration.

Â Then we introduce GAAC, or group averaged agglomerative clustering.

Â That means, if you assume the two clusters C sub a and

Â C sub b be merged into one cluster, C sub a unit b,

Â then the centroid will be defined by, you get a centroid of C sub a.

Â But you'll time the number of elements in the cluster,

Â then you get a centroid of C C sub b of cluster C sub b, bigger C sub b.

Â Then the, you want also times their number of elements.

Â Then you finally normalize them, okay?

Â So this is the GAAC measure.

Â Another interesting Y is people define one called Ward's criterion.

Â This Y's centering is to measure the increase, once you do this merge.

Â Remember, once you merge the two clusters, C sub a and

Â C sub b into C, the sub union b the SSE, the sum of the squared

Â distance were increased because the cluster becomes bigger.

Â Okay.

Â Then once it's increased you want to see the original clustering.

Â What's the difference?

Â What's the increase you've got on this?

Â Okay, this why is defined using this criteria and

Â essentially is you look at the distance between the two centroids and

Â then you, based on this weight, you calculate the increase.

Â [MUSIC]

Â