0:15

Hello and welcome to Lesson 3 in Module 15.

Â This lesson will introduce the density based clustering technique,

Â which is typically done with an algorithm called DBSCAN.

Â DBSCAN doesn't require a number of clusters be specified ahead of time,

Â and that's a real benefit of it.

Â On the other hand, it's a little more complex to understand.

Â It's a density based algorithm in that it finds

Â over densities of points and assigns those together to a cluster.

Â So by the end of this lesson,

Â you should have a better understanding of what

Â density based clustering algorithms are doing.

Â You should understand how the DBSCAN algorithm actually

Â works as well as be able to apply it by using the scikit-learn library.

Â There are two readings for this particular lesson.

Â The first is a visual example showing how this clustering will work.

Â So I'm going to once again go back.

Â I'm going to pick a different number of clusters.

Â Here, you can see there are three clusters of points.

Â And here is just the DBSCAN algorithm being run in

Â this case with a particular value epsilon and a particular value of minPoints.

Â These are two of the hyperparameters.

Â So let me just go ahead and do it.

Â And you can see what happens.

Â It basically finds points and says,

Â "Are you near the other points that I'm looking at?

Â And if so, you're part of this cluster."

Â You notice that it stops; it's run out.

Â When it's done with that particular group of points,it goes to the next points,

Â in this case, the blue ones,

Â and it keeps looking at those.

Â And then when it's done there,

Â it jumps up here to the green one.

Â And this time, you could see it's taking them longer than

Â the algorithm that was with the K means algorithm.

Â And when we're done, you could see that we have three clusters.

Â We didn't say there were three ahead of time.

Â We have three clusters that are marked.

Â We also have these open circles.

Â These are points that are farther away from

Â the main cluster and that's specified by these these values here epsilon and minPoints,

Â such that they are left as outliers.

Â So the DBSCAN algorithm is nice in that it analyzes data,

Â finds clusters as well as points that are not part of a cluster and it marks them.

Â The way it marks them as typically called noise.

Â Let's go ahead and look at the notebook for this particular lesson.

Â First, we are talking a little bit about the formalism of the DBSCAN algorithm.

Â We'll look at the Iris digit data.

Â The Iris data then we'll look at the digit data.

Â For the digit data, we're going to explore some of

Â the hyperparameters as well as how we can

Â cluster the digits as well as what

Â the dimension reduction will do to a density based algorithm.

Â So again, for the formalism,

Â there's two main parameters that we're going to worry about.

Â The first is epsilon and the second is minSamples.

Â The epsilon hyperparameter defines a maximum distance between

Â two points for them still to be considered in the same density neighborhood.

Â You could think of it this way.

Â If you started to build a cluster of points,

Â that's a density neighborhood,

Â and as soon as you start moving beyond epsilon,

Â you're no longer in that same density neighborhood.

Â The second hyperparameter is minSamples.

Â This parameter specifies how many points must lie within

Â this neighborhood of a current point in order for it to be considered a core point.

Â The set of core points is what's going to define our cluster because we have to

Â connect points via paths through these core points in order to build that cluster.

Â And you saw that with the visualization when we started it on a point

Â and we put a little circle that had the epsilon radius,

Â and there's other points in there that we got other points and we kept doing this,

Â and as soon as we get the minimum number of samples,

Â that formed a connected region.

Â If it's outside of that epsilon,

Â then we would call it a outlier because it wasn't

Â reachable by the path between core points.

Â So we look at DBSCAN algorithm.

Â We use it with scikit-learn algorithm or scikit-learn library.

Â We talked a little bit about the different hyperparameters you might want to change.

Â We run it, and when we first run it,

Â we got three clusters with these default values that I put in.

Â It turns out I actually tuned those parameters to make sure we would have three clusters.

Â You see the number of core points is 53.

Â Well, the dataset has 150 so that means most of the data is not treated as a core point.

Â So we really want to look at this in a plot.

Â We can do that. We could see our plot.

Â Here is our data,

Â and we've marked our core points with X.

Â These are points that with those sets of hyperparameters are marked as

Â being basically critical for defining the locations of cluster centers.

Â We can also then say, "Well,

Â where are these cluster centers?"

Â Here's our clusters. Here [inaudible] members they have,

Â and then we can of course plot that.

Â And here, what we're doing is we're saying,

Â "Here's our data with the big circles labeled by the Iris type.

Â Then the small circles marks the individual clusters with the last thing,

Â noise, being marked by a purple dot.

Â So now, we have four different types of clusters if you will.

Â We have class 0,

Â class 1, class 2,

Â and then finally, noise.

Â And if you look at this,you see that

Â the concentrated high density regions are clustered together,

Â but these points out here are marked as noise even though they're part of the Iris

Â they're a little farther away from the main body and so they're marked as noise.

Â Same with this point down here.

Â When we come over these two Iris species,

Â in this two dimensional space,

Â they are intertwined and thus it's hard to separate them out.

Â So you could see here's a single cluster,

Â and here's the second cluster,

Â and there's lots of noise points in between.

Â This just sort of demonstrates how the algorithm is actually working.

Â Noticed this point was marked as actual part of

Â the cluster and these two on either side of it weren't, again,

Â simply due to the nature of the algorithm trying to find paths of connected points,

Â and we see that that point has not included.

Â The rest of this notebook then switches over to the digit data.

Â We're going to view it and then we're going to

Â start changing hyperparameters and seeing what happens.

Â Here, we search and say, the whole idea is,

Â "Is there a particular set of hyperparameter for

Â epsilon that will give us the right number of clusters?",

Â which in this case should be 10 plus the noise.

Â So we say, "Which of these give us 11?"

Â As soon as it does, we print it out,

Â and it turns out there's just this one value when minSamples is 20.

Â See that a lot of the data is noise in this case.

Â The others have different ratios of membership.

Â So we can actually look at these.

Â We could say, "How many clusters do we have?" and "Can we print them?"

Â You can see that there are a lot of the clusters have

Â a reasonable number but this one's a little more than normal,

Â and this one, and this one are a little lower than normal.

Â We can then take a look at what are the mean cluster images.

Â And again, you could see that this cluster is a little confusing;

Â maybe a little ambiguous;

Â maybe this one as well.

Â But then again, this one kind of looks like a 7.

Â This one kind of looks like a 5, a 3,

Â a 6, a 0.

Â So some of them do very well in terms of being viewed and others don't.

Â When we make our fusion matrix, it's a little different because now,

Â we have to include our noise and that's done here at the top.

Â You could see that that noise cluster contains actual images of pretty much every image,

Â with the exception of maybe 6 and 0.

Â And if we come up here, you can see the 6 and the 0 are both really crisp and clear,

Â so these are typically not marked as noise or outliers.

Â The other sets typically do have some that are marked that way.

Â But otherwise, you see a lot of the clusters are done quite well, right?

Â This particular one is split evenly between noise and the cluster 1.

Â This one does pretty well.

Â The only one that doesn't has cluster 1,

Â which is predicted as a 1, a 4, or 9.

Â And if it come up here, you could see that that is a very ambiguous cluster.

Â The other one here was 4. If we come down 4,

Â actually that's a little confusing because it looks like

Â it should be pretty good but it's a little messed up.

Â Last thing was dimension reduction.

Â We can apply dimension reduction.

Â We can then go and see what our clusters were.

Â I noticed that when I do that and use the same hyperparameters,

Â we only get three clusters,

Â and they look kind of strange.

Â Seven maybe, but this is uncertain. That's definitely certain.

Â So we could change our hyperparameters and we can now recover our cluster centres again.

Â We can look at them, do similar analyses.

Â So hopefully, this is giving you a bit of a feel for density based clustering,

Â also the DBSCAN algorithm.

Â The nice thing about this algorithm is you

Â don't have to know the number of clusters ahead of time.

Â So you might think about running DBSCAN and getting a feel for what it might do,

Â and then running K means.

Â But again, this has some useful properties.

Â You don't specify the number of clusters in it.

Â It also has the idea of noise cluster,

Â which are technically outliers.

Â And we'll see the importance of

Â anomaly detection and being able to find outliers in a future module.

Â If you have any questions,

Â let us know in the course form. And of course, good luck.

Â