0:00

[SOUND] In this session we are going to introduce you

Â an interesting extension of hierarchical clustering method,

Â called BIRCH: A Micro-Clustering Based Approach.

Â BIRCH is an abbreviation of Balance Iterative Reducing and

Â Clustering Using Hierarchies.

Â It was developed by a group of researchers in University of Wisconsin in 1996.

Â The general philosophy is, it incrementally constructs

Â a CF tree, or called a Clustering Feature tree,

Â which is a hierarchical data structure for multiphase clustering.

Â 0:48

For phase 1, essentially,

Â it scans the database to construct an initial in-memory CF tree.

Â Which is a multi-level compression of the data that

Â tries to preserve the inherent clustering structure of the data.

Â 1:41

The BIRCH methods scales linearly.

Â That means you, it finds a good clustering with a single scan, and

Â then improves the quality with a few additional scans.

Â The clustering feature we first introduce is the CF vector in BIRCH.

Â 2:01

The BIRCH Clustering Feature essentially is suppose you get these five

Â points into one cluster.

Â okay?

Â Then suppose these are the five points, their positions, okay?

Â Then the CF vector contains three components.

Â One is the number of data points.

Â The second is a linear sum of the points in the cluster.

Â The third one is square sum of the N points.

Â Okay.

Â So you probably can see the, the first one, 5 means there are 5 points.

Â The second one acts as a linear sum of each dimension.

Â Okay.

Â The third one actually is the squared sum of each dimension.

Â So the Clustering Feature essentially is the summary of the statistics of

Â a given sub-cluster, which you can consider the number is the zeroth,

Â the first one is the linear, the second one is the second moments

Â of the sub-cluster from the statistic point of view.

Â That means it will register the crucial measurements of the, for

Â computing cluster and utilizes storage quite efficiently.

Â So we can look at the,

Â the general concepts of centroid, radius, and diameter.

Â Okay.

Â The centroid essentially is the center of the cluster, okay?

Â Then, suppose we have a vector of N dimensions, x sub i.

Â Okay.

Â Then, the centroid is essentially computed by the sum of

Â all the points in this cluster divided by the number of points in the cluster,

Â so that what we get is a centroid of the cluster.

Â Okay.

Â Then the radius actually is the average distance from the member

Â objects to the centroid.

Â That essentially is every one you get a difference with the centroid,

Â then we use the sum of their square distance.

Â Divide by the number of points in the cluster.

Â Take their square root.

Â Essentially it's the square root of the average distance

Â from any point of the cluster to its centroid.

Â 4:19

What is diameter?

Â Diameter essentially is average pairwise distance within the cluster.

Â That means if x i and x j is within the same cluster, so essentially what we want,

Â we will find is there are total n times n minus 1 pairs and

Â we sum up all these pairwise distance then you get the square root.

Â That's the diameter.

Â 4:45

Then we look at CF Tree structure in BIRCH.

Â The CF Tree Structure essentially is very much like a, B+-tree.

Â We can do incremental insertion of the new points.

Â That means when the new points come, okay, we can find the closest leaf entry.

Â Start from the root.

Â Okay, then we try, traverse we find the closest entry,

Â we can add points to the leaf entry and update the Clustering Feature, CF.

Â 5:14

If this entry diameter is greater than the maximum diameter,

Â then we'll split the leaf and if it's possibly we

Â even will be able to split parents based on the B+-tree algorithm.

Â A CF tree has two parameters, one called branching factor.

Â That means the maximum number of children.

Â Another is maximum diameter of sub-clusters stored at the leaf nodes.

Â Then a CF tree essentially is height-balanced tree that stores

Â the clustering features.

Â The non-leaf nodes store the sums of clustering features of their children.

Â 5:55

So we can see BIRCH is an interesting algorithm, because it, it is

Â an integration of agglomerative clustering with other flexible clustering methods.

Â The low level we do micro-clustering.

Â We explore the CF feature and BIRCH tree structure.

Â It preserves the inherent clustering structure of the data.

Â 6:18

At the high level we do macro-clustering.

Â It provides sufficient flexibility for integration with other clustering methods.

Â So this method act, impact to many other clustering methods and

Â applications for large data sets.

Â 6:36

There are some concerns.

Â One is the,

Â the BIRCH tree is still sensitive to the insertion order of the data points.

Â Another is, since the leaf nodes has a fixed size,

Â the clustering obtained may not be as natural.

Â And also, the clusters tend to be spherical given the radius and

Â diameter measure as the major parameters.

Â