So, in previous modules, we've talked about hierarchical data structures. We've talked about clustering and classification. We've talked about statistical analysis and methods for exploring data. In this module, we want to talk specifically about one data exploration method for trying to capture potential underlying structures within the data. Specifically, we want to learn about how to apply methods of hierarchical data analysis. So, given some set of data, imagine that we have a bunch of data about cars, and remember we have our car data set for miles per gallon, we might have the maker model of the car we may have the number of cylinders in a car, and we have a whole set of data, and what we can do is we can take methods called hierarchical clustering and produce a set of nested clusters, organizes a hierarchical tree, which then people can start reasoning about different levels within the tree and why these relationships between elements in the data might or might not exist, and then we can visualize this as what we call a dendrogram. So a dendrogram is just our node link diagram that we've talked about in previous modules, where the height between the levels corresponds to some distance metric. Remember in previous modules, we've also talked about how to calculate distance metrics and we'll go back through that again through. This can also be visualized as a tree-like diagram that record sequences emerges or splits. So, we can also take this dendrogram matrix and visualize it in some sort of nested sequence of merges as well, and we're going to talk about how we can do these merges and how to compute hierarchical clustering. Now, the nice thing about hierarchical clustering as compared to say, previous unsupervised methods we've learned like k-means, is there's no assumption on the number of clusters. In K-means, we have to choose a number K. We have to say I think there's three clusters in this data set. With hierarchical clustering, there's no assumption on the number of clusters. Instead, any number of clusters can be obtained by cutting the dendogram at the proper level. What I mean by this is, if this is our dendogram and you want three clusters, you figure out where you would draw a line through the data such that you would get three separate groups of things. So for example, now I've got an edge here, an edge here, a cut here, and a cut here. So, I wind up with four clusters, if I do my cut in that case. So thinking about where we can cut the node link dendrogram, and then the remaining children that are grouped together become our clusters. What's interesting is that hierarchical clustering could correspond to meaningful taxonomies and the data. This is often used in biological sciences, so, phylogeny reconstruction or in the web trying to see if we can meaningful identify product categories et cetera. But this is another way to explore and reason with your data and to help people look at these different hierarchies that are produced to see if there might be anything of interest there. Now there's two main types of hierarchical clustering methods. There's agglomerative and divisive. So, the agglomerative method is going to start with points as individual clusters. So, each data element is its own points, its own cluster, and then what we do is we look at all the points and at each step we're going to merge the two closest points and they become a new cluster. We continue doing this until all the points belonging to one big cluster, that that merge step is showing in our dendrogram tree. Now, divisive clustering does the opposite. We start with everything in one big cluster and at each step, we're going to split the cluster until it contains just a single point or until there's k clusters left. Traditional hierarchical algorithms use a similarity or distance matrix, and there's a ton of different distance metrics that we've talked about before, and these distance metrics are used to merge or split one cluster at a time. Now, the thing is hierarchical clustering is relatively computationally complex. There's a distance matrix, is used for deciding which clusters to merge and split, and it's going to be at least quadratic in the number of data points. So, it's often not usable for very very large data sets. There's a lot of different techniques for speeding this up, for doing distributed hierarchical clustering. So, there are ways for doing this, but if you have a very large data set, oftentimes you may choose a different data exploration method simply due to time complexities.