0:00

In this lecture, I will show you how to make a clustergram in MATLAB.

Â Hierarchical clustering, is another way to

Â visualize high-dimensional data, and it clusters

Â observations by distance and builds a hierarchical structure on top of that.

Â It gives more detailed information of differences among clusters.

Â For example, it can tell you which

Â genes contributed the most to the difference between

Â two clusters.

Â Here is an example of hierarchical clustergram.

Â It is made of a heat map in the middle.

Â Denograms on the left and top.

Â And row and column labels on the right and on bottom.

Â There is also a scale bar on the left.

Â This is the same data set as I used in the PCA plotting.

Â Each column is one tumor cell gene expression profile.

Â And each row is

Â a gene.

Â The color suggests relative expression values.

Â And red indicates high expression values, blue indicates lower expression values.

Â Looking at the column labels, we find that

Â gene expression profiles

Â of the same subtype, nicely clustered together.

Â And there are three red clusters in the

Â heatmap

Â corresponding 3 subtypes.

Â Recall that the colors suggest expression

Â values, we can say that this bunch of

Â genes at the upper side are highly

Â expressed in cluster one which are subtype three.

Â And these genes in the middle are highly expressed in subtype two.

Â And these genes at bottom is highly expressed in cluster three,

Â which are the subtype one.

Â Here is an example of simulated clustergram by random numbers.

Â In this clustergram, no distinct the clusters can be observed.

Â Red and blue colors just mix all together.

Â And that the column labels of g3 subtypes are also

Â expectedly

Â mixed. You cannot find order in it.

Â I always want to present a random figure, because the

Â tumor gene expression

Â data we used is quite good.

Â You can see clear patterns in it, but many data sets, will be noisy, and

Â fall

Â between the nice tumour cell data, and the simulated random data.

Â Though the clustergram may look amazing and complex

Â at first sight, its mechanism is quite simple.

Â In this and next few slides, I will explain how it works.

Â Suppose that we now have a to f, six gene expression profiles.

Â The left are their representations in a two dimensional PCA figure.

Â The question is, how we would like to cluster them?

Â Well, by eye, you may want to cluster bc together,

Â def together and leave a alone, but this is quite arbitrary.

Â So is there a way to rationally and computationally

Â cluster these data points ? Hierarchical clustering offers the solution.

Â 4:19

Then how many clusters we

Â want depends on which level we want to set the cutoff.

Â If we set cut off here, we will only get two clusters.

Â Cluster A. Cluster BCDEF.

Â And if we set cut off here, we've

Â got three clusters Cluster A, Cluster BC, Cluster DEF.

Â And if we set the cut off to the

Â lowest level we will have our original six data points.

Â The denogram we saw in the Clustergram.

Â is just a compact representation of this

Â heirarchical tree-like structure after turned it upside down.

Â Above is the main idea of hierarchical clustering.

Â Here are some additional things you may

Â want to consider when making a clustergram.

Â The first topic is metric.

Â Metric defines how to measure the distrnce between two gene expression profiles.

Â The most common metric is the Euclidean distance.

Â Each gene expression profile is a vector of values.

Â And the Euclidean distance is calculated by the formula below.

Â 5:22

I think most of you are familiar with this formula.

Â Besides Euclidean distance you can choose

Â cosine

Â distance, correlation distance, hamming distance and so on.

Â But most of the time Euclidean distance will do the job.

Â One special case may be, for example, you dataset

Â is binary and you may want to use hamming distance.

Â as your metric.

Â Because it is specially designed for binary data.

Â Look at this picture again.

Â You can see hierachical clustering is performed twice,

Â on both directions. Column wise and row wise.

Â These two clusterings are independent of each other because the order of components do

Â not matter when you compute the distance between two vectors.

Â If this doesn't make sense to you, don't mind.

Â Just remember that two clusterings are independent of each other.

Â The result is

Â that similar expression profiles are clustered together, and genes

Â that have similar expressions across all profiles are also clustered together.

Â For example, genes consistently highly expressed in

Â cluster two is clustered to together, like here.

Â The second topic will be the linkage function.

Â You need linkage function while you want to calculate distance between clusters.

Â Here is a simple example.

Â You want to calculate the distance between clustered

Â data point de and data point f.

Â 6:59

There are a few options. The most common method is called Average.

Â In this method, we caclulate the distance between d and f and the distance between

Â e and f.

Â Now you use the average of the two distances

Â as the distance between this de cluster and this f.

Â Median methods we use the median of the distances.

Â And for single we use the shortest distance of the two and the complete

Â we'll use

Â the longest distance of the two. Here's one more example.

Â If you now what to calculate the distance

Â between cluster bc and cluster de using the single

Â method, you calculated distance between bd, cd and the

Â distance between be, ce and you've got four distances.

Â And you will find that the distance between c and d

Â is the shortest and then you will use this distance

Â as the distance between these two clusters.

Â One more thing to consider is standardization.

Â Standardization converts data into standardized z-scores.

Â Z-score means how many standard deviations away is a value from mean.

Â If a value equals to the mean plus 2 standard deviations, its z-score will be 2.

Â Standardization is a normalization process that forces the value to fall into

Â the range that is most suitable to be visualized in a clustergram.

Â 8:24

There are two options, row standardization and column standardization.

Â Row standardization calculates the z-scores for each row and

Â column standardization calculates the z-scores for each column.

Â For gene expression data,

Â we generally use row standardization because we want to see

Â for each gene, how their expression values change across different conditions.

Â Okay, now we will begin our demo on clustergram in Matlab.

Â 11:47

This command, however, looks too long and it's not easy to write.

Â Actually, many popular properties are already set by default, like

Â the metric by default is Euclidian, linkage is average.

Â So you can write the command in short as the one below.

Â In this command, you do not need to specify rowPdist, columnPdist and linkage.

Â Because Euclidian and average are already ready used by

Â default. So, this command looks nicer and

Â shorter. And it will do the same thing as I paste

Â it here. And run it, we got the same figure.

Â 12:30

After you get this clustergram You can use this button to get a scale bar,

Â and, you can use this button to toggle

Â the denogram and this button to zoom

Â in, and this button to zoom out. After you

Â are in the zoom in mode, you can use this button to pan over the figure.

Â One nice thing about this clustergram is that you can

Â select a subset of the clustergram and copy it to a new clustergram. Then

Â you can examine this part of the clustergram in close detail.

Â Here I will teach you a trick to export clustergram in vector format.

Â First click Export Setup.

Â Change Rendering to Painters Vector Format and click Export.

Â