0:15

Hello and welcome to this lesson that introduces Dimension Reduction.

Â Dimension reduction is a very important technique for reducing

Â the number of features that you're going to use when building a machine learning model.

Â Unlike feature selection however,

Â dimension reduction can actually create

Â new features that are generated from the original set of features.

Â In addition, techniques such as; principal component analysis,

Â also give you a measure of the importance of each of these new features.

Â This important measure is called the explained variance.

Â So, you can use this explained variance to determine how many of

Â these new features you actually need to

Â keep in order to capture the majority of the signal,

Â that you're trying to model with a machine learning algorithm.

Â So, in this particular lesson,

Â you need to be able to explain how the PCA algorithm operates.

Â You should be able to explain the relationship between individual PCA components and

Â that explained variance and you should be able to apply

Â the PCA algorithm by using the scikit learn library.

Â The activities for this particular lesson include two readings.

Â The first one is a visual interactive website and I'll show you that in just a second.

Â The second is a notebook by Jake VanderPlas that shows you

Â how PCA can be applied to data by using the scikit learn library.

Â And then lastly, is our own notebook on introduction to dimension reduction.

Â So, let me just jump straight into this interactive Website.

Â There's a couple of different examples they provide to help understand

Â the effect of PCA in creating new dimensions.

Â First is a two-D visualization.

Â So, we have our original data and then we have our PCA components overlaid.

Â And then over here on the right,

Â we actually have the same data points and these principal component dimensions.

Â There's also a visualization down here of how

Â those data points are distributed on the original PCA,

Â the original features, and then the new features.

Â And so you can see the second principle component, they're all compact.

Â That's the idea that you get.

Â The new components, the first one has most of

Â the variance and in this case the second one has very little variance.

Â So, the first feature actually encodes most of the information.

Â Then there's an example in 3D.

Â And then there's another example in

Â much higher dimensional space that actually has more information in it.

Â So I'm only going to talk real briefly here about the 2D example.

Â You can play with the 3D example.

Â This is interactive. So we can move data points around.

Â And as we do, you notice that the principle components change,

Â the distribution of data points over here change as well and you see

Â that the variance are spread and these two components changes as well.

Â So, if we move these points around a lot,

Â you can see that now there's not really a good signal.

Â This is sort of a circle we can't expect that are

Â new features are going to be well represented in the new feature space.

Â On the other hand, if I move these data points down,

Â so they become a line again.

Â You can see what happens to the new principal components that line up here.

Â You can see these are almost all completely in line and

Â the first principal component and

Â the second principal component has basically zero again.

Â So, this shows you how the PCA generates

Â new features that ideally capture as

Â much of the variance in the first few components as possible.

Â In this case, we only had two,

Â so that first new component had most of the variance.

Â Second thing to look at, is this notebook.

Â It talks a lot about principal component analysis,

Â how it works and how you might use it to

Â improve the results that you're trying to employ.

Â And so this is a nice notebook that does this.

Â PCA is used for dimension reduction because when we have the explained variance,

Â we could say look we don't need to keep all the features we can really

Â just keep a couple and still have most of the signal.

Â Of course, most of the content is in the notebook for this week,

Â which is our dimension reduction notebook.

Â The dimension reduction and PCA in particular is an unsupervised technique.

Â We don't use labels to generate this.

Â We learn from the data,

Â the distribution and compute the new components or features automatically.

Â In this particular notebook,

Â you're going to look at the idea of principal component analysis.

Â You're going to see how do we actually create features from a data set,

Â In this case, the iris data set and how can we apply that to machine learning.

Â So, we're going to see how a signal could be captured from the original features and

Â how well we can capture it from

Â a reduced set of features via principal component analysis.

Â Then, we'll look at a similar idea

Â of two principal component analysis called factor analysis.

Â And then we're going to move on to PCA applied to the handwritten digit data

Â set and you going to see how we can capture this fraction of explained variance,

Â how that will impact the reconstruction of the original data set and other things,

Â such as the covariance matrix.

Â And lastly, we'll have a quick demonstration of

Â some other dimension reduction techniques just to sort of show you how you can use

Â those and how they compare in terms of

Â the capture of the signal and their data reconstruction to PCA itself.

Â So, first what do we do?

Â We look at the PCA algorithm.

Â This is what this does.

Â We have a data set here.

Â You can see that it's kind of a long elliptical shape.

Â The idea is that there's clearly a important component

Â or feature along this diagonal and then perpendicular to that,

Â which is the way PCA works with an orthogonal basis that has less signal.

Â And so we could imagine rotating

Â this coordinate frame to capture that and that's what we do.

Â You apply PCA to that data that we just randomly

Â generated and you could see that now the data is distributed

Â primarily along this new primary component

Â and there's less spread along the secondary component.

Â We then go in and apply this to the iris data set. We look at it visually.

Â You can see that, when we look at the original feature space,

Â there are this combination petal length and pedal width,

Â where there's actually a very compact shape similar to

Â what we just saw with the ellipse from our random data.

Â So, we can actually create new features from that data set and when we do that,

Â you can see that we generate four new features.

Â The first one has 92.5 percent of the signal.

Â That's quite a bit. This next one has 5.3.

Â So, the first two components capture almost 98 percent of the entire signal.

Â That means those two features probably are all

Â we need when we actually do machine learning.

Â The other two are very small, nearly noise.

Â We can then apply this to machine learning.

Â We can use SVM on our original data set and we can get a accuracy and a confusion matrix,

Â and then we can do PCA on that data,

Â and then do the exact same machine learning and see what we get,

Â and it turns out that our results are consistent.

Â We're up here, our confusion matrix was just two down here and versicolor.

Â And if we scroll down here,

Â you see you get the exact same results.

Â So, that showed you that we could reduce the data.

Â We actually cut the amount of data we were analyzing in half

Â from four features to two features and yet we got the same result.

Â That means we're going to be running our analysis

Â faster and perhaps getting more precise results,

Â because we're not being affected by noise or

Â small variance features as we were in the original.

Â Next, we introduce factor analysis.

Â Slightly different technique for computing the coefficients.

Â They no longer need to be orthogonal as they are with principal component analysis.

Â And when we do this again, it shows us that there's two important features,

Â just like we saw with the original iris data set.

Â Next, we move into the digit data set.

Â We look at some images,

Â we then apply PCA to them and we can compute a mean image from all of our pixels.

Â We can actually look at the fraction of explained variance.

Â This is an interesting plot because it shows us how many features do we need

Â to retain in order to capture a total amount of the variance in our signal.

Â So, you can see that with just 40 features,

Â remember we started with 64,

Â and with just 40 features we have 99 percent of the original signal.

Â And with around 20,

Â we have 90 percent of the original signal.

Â So, you can see that this data set can actually be compacted into

Â a much smaller amount of data and still retain most of the signal.

Â You can actually visualize the components these actually generated components,

Â the new dimensions if you will and that's them going along here.

Â You could imagine seeing how when you look at the original data,

Â this particular component is capturing part of the zero,

Â maybe a little bit of the two, part of the four,

Â five, six, eight, and nine.

Â You can see this one kind of looks like a zero.

Â This one maybe the two maybe part of the

Â eight and you could sort of understand what the algorithm is doing.

Â But the important thing here is that we

Â don't actually need to know what the algorithm is doing.

Â It's generating these components mathematically and we're simply looking at them.

Â Notice that as we get closer to 20 here though,

Â the fluctuations become more random.

Â You see less and less structure,

Â which is telling us in part,

Â that there's less and less noise here.

Â And as we get to 40 and beyond,

Â you can see that these are very little information.

Â Remember, when we had 40 features,

Â we had still retained 99 percent of

Â the signal and this tells us that these features contain very little useful information.

Â We can also use the PCA to recover our original data.

Â So what we do here, is we perform PCA on

Â the data by using different numbers of retained components.

Â So, one component, two,

Â et cetera., 10, 20 and 40.

Â And then we plot the reconstructed data by using just those numbers of components.

Â So, if we only have one component,

Â which doesn't have most of the signal and we try to reconstruct

Â these original images shown here on the top row,

Â you could see it doesn't do a very good job.

Â This three doesn't really look like a three,

Â the four maybe a little bit and not until we get down here to about five and even 10,

Â do we start to see that, yeah,

Â this kind of looks like a zero,

Â one, two, three, et cetera.

Â When we get out to 40,

Â definitely you can see that we've captured most of the signal.

Â In other words, if you compare this row or even this row with 20 to the very first row,

Â you can see that the representation is fairly well performed.

Â The other thing you can look at is a covariance matrix,

Â which relates the different pixels to each other.

Â So of course the diagonal is highlighted,

Â but you can see that different pixels are tied to other pixels.

Â That's an interesting way to sort of try to understand

Â how the PCA components are actually calculated.

Â And then lastly, we actually look at

Â some different techniques for performing dimensional reduction.

Â Here's some PCA, some of the top 10 components.

Â Here's another technique called non-negative matrix factorization.

Â There is fast independent component analysis.

Â There's mini batch PCA,

Â there's mini batch dictionary learning and then there's also factor analysis.

Â So, these are just different ways to show how that all works.

Â And then we can use those techniques,

Â some of them, to reconstruct the original data set.

Â And again, they all do a fairly similar job.

Â In part, that's just simply because the data that we're

Â analyzing can be recovered seamlessly in this manner.

Â With that, I'm going to go ahead and stop.

Â Hopefully, I've given you a good introduction

Â to dimension reduction and the importance of PCA.

Â This is a very fundamental technique that we're often going to want to

Â use before we apply a subsequent algorithm.

Â If you have any questions, be sure to ask me in the course forums and good luck.

Â