Hello. This lesson introduces

the kernel density estimation technique by using the Python scikit learn module.

This is an important module.

It's one of the most, if not the most,

important machine learning framework that is

in existence certainly within the Python world if not beyond.

And so this will be your introduction to

that particular module and we're going to use it to

build density estimates for one and two dimensional data sets.

The last thing we're going to do is,

we're going to take this ability of building

a density estimate function from data by using

the scikit learn module and apply it to generate new data from a density estimate.

Now, this is a really cool thing and it's a really powerful technique.

And hopefully, you'll get an idea of what I'm saying by the end of this lesson.

The notebook for this lesson is the Advanced Density Estimate notebook.

And what we're going to do here is,

we're going to build on the introduction density estimate

by doing more complex techniques.

So first we're going to set up our notebook just as we've been doing.

We say all plots in line,

we do our standard imports.

Ignore the warnings that might appear.

Set our seaborn style and load the iris data set.

Now, we saw in seaborn how we could construct a kernel density estimate for our data,

but that just sort of made the visualization easier of what was going on.

With scikit learn, we actually build a functional representation of the density estimate,

which means we can use that function to do new things.

So in this first code cell,

we actually do that.

We create a kernel density estimate of our data set by using

a bandwidth that we calculated

and the data that we read in and we turn it into a NumPy matrix.

We then get a function that represents that kernel density estimate.

We can then fit it to our data and then extract the results. So what does this do?

Well it does the same thing we did before with seaborn in

one line but we have many more lines in this particular code.

So here's our histogram,

here's our kernel density function estimate.

Now, if that's all we did,

that wouldn't be that exciting.

But what is exciting is this,

we can now sample from that model.

What we do here is we take the model,

this curve you see up here,

and we actually say give us 15 new samples from the model.

And if we were to over plot these on this plot you would see that they follow the model.

In other words, most of the data are going to be at

the central range here say between 5 and 6.5.

And if we look at the data you can see that.

Now, one choice when you're doing

kernel density estimate is what type of kernels do you use?

We often use the Gaussian or normal kernel,

but the second one is what is the bandwidth?

So generally, we've just been using what

seaborn defines as a standard but there's many different ways.

There's a lot of literature on how to choose them properly.

So here what we do is we step through three different,

or a number of different bandwidths,

in particular there's going to be four,

and plot the different kernel density estimates that come out and show the results.

So first if we have a really small bandwidth point one,

you can see how it captures that fluctuation in the histogram.

But as soon as we start getting bigger,

we start smoothing over those.

So the green one looks more like what seaborn did.

And then if we increase it, we get this red one,

which now you can see that it actually looks like it's a standard functional form.

It's almost a Gaussian and the purple one here is even more close to that.

This shows you that by changing the bandwidth you can smooth over fluctuations.

So if, for instance, you think that the fluctuations are due

solely to the fact that you don't have a large sample,

you might want a larger bandwidth size.

Now, so far we've looked at univariate or one dimensional kernel density estimation.

We can also do this in multiple dimensions.

The easiest way to understand what this means is to look at a scatterplot,

which we're going to do with the seaborn joint plot method.

The joint plot makes a scatterplot of two dimensions,

in this case sepal width versus sepal length and it

also adds in these univariate histograms.

So you can see the distribution of the data in each axis.

Now, there's lots of points here.

If you imagine if you had tens of thousands or even more points,

it would become hard to distinguish between where the points are.

Even if we added jitter,

there still would be too many points to visualize.

That's where a two dimensional density estimate comes into play.

And in this case, what we would do is actually generate

a smoothed version in two dimensions which is a contour plot.

And that's what you see here when there's a high region in the contour,

that's where there's lots of points.

So there's a cluster of points here and there's a cluster of points here,

and you could see how this falls off.

This should be familiar to anyone who's used

a map and seen topography on a map where you're

perhaps hiking and you want to know how can I go from

a valley over here to a valley over here the easiest?

Well, you would want to go through

the saddle point because it's in between the two peaks.

So that shows you hopefully how

a two dimensional kernel density estimate can be useful at least in visual terms.

You could also, of course, construct

a two dimensional kernel density estimate with scikit learn and end up with

a functional representation of this visualization

and then you could sample that and use that in subsequent calculations.

Now, in order to help you understand the true power of this density estimate,

we're going to switch to a different data set.

We're going to use a sample of hand-written digit data to create new fake digit data.

This originally was done in the scikit learn documentation and we've changed it

slightly for this particular course in this particular notebook structure.

So first, we're going to use some helper code that I wrote and

put in along with the notebook to first get the data.

And it's going to return it in terms of features or columns and

labels along with the actual original image data that we can use for plotting.

And that's what we do here, we simply read the data in and we plot.

And you could see each column is a different type of image data.

This is zeroes, ones, twos, et cetera.

There are 1,797 instances of all of these data,

and what we do then is we construct a kernel density estimate.

We just choose a bandwidth here of 1.5.

One of the things you should definitely try doing is

changing that bandwidth and seeing what happens.

And then we're going to create a kernel density estimate on these images.

So on these features that go along with all of those 1,700 plus images.

We then using that kernel density estimate, we sample.

In this case, we're only grabbing 60,

you could grab more if you wanted to,

and then we simply plot the data.

So first we plot real data,

one row of real data and then below that are fake digits.

Now, these are not organized in columns.

These are just random digits that come out.

Also, note that these are eight by eight pixel images

and we've actually blown them up for this particular visualization.

If you come up here you'll notice these are smaller.

But you can even still just with this data infer.

Here's a three that's been simulated,

this is perhaps a seven.

This is perhaps a two, this is a one.

And as you change the bandwidth,

you'll see how these clarity of these generated images changes.

So let me take a step back and just say this again,

these images that you're seeing were generated from a model

that we built on handwritten digit data that had been scanned into a computer.

So we took the original data and we built a model representation.

If you thought about this in terms of a matrix,

these images are eight by eight or have 64 pixels.

So if we were to do this in a spreadsheet,

we would have 64 columns in the spreadsheet.

Each column would represent one pixel in these images,

each row would represent a single image.

We've taken that data,

turned it into a model representation and then we can

sample from that model and say give us a bunch of images from that model.

Some of them will be ones,

some of them will be twos, et cetera.

And as you can see, sometimes the model makes pretty realistic looking images.

I hope you've sort of gotten the excitement,

the importance of this.

We're actually starting to move beyond simple data analytics into

more complex model building which really allows deeper insights into data.

If you have any questions on this,

let us know on the course forums. And good luck.