0:00

The data used in machine learning processes often have many variables.

Â This is what we call highly dimensional data.

Â Most of these dimensions may or may not matter in the context of our application

Â with the questions we are asking.

Â Reducing such high dimensions to a more manageable set of related and

Â useful variables improves the performance and accuracy of our analysis.

Â After this video, you will be able to explain what dimensionality reduction is,

Â discuss the benefits of dimensionality reduction, and

Â describe how PCA transforms your data.

Â The number of features or variables you have in your data set determines

Â the number of dimensions or dimensionality of your data.

Â If your dataset has two features, than it is two dimensional data.

Â If it has three features than it has three features and so on.

Â You want to use as many features as possible to capture the characteristics of

Â your data, but

Â you also don't want the dimension audio of your data to be too high.

Â As the dimensionality increases, the problem spaces you're looking at increases

Â requiring substantially more instances to adequately sample of that space.

Â So as the dimensionality increases,

Â the space that you are looking at grows exponentially.

Â As the space grows data becomes increasingly sparse.

Â In this diagram we see how the problem space grows as

Â the dimensionality increases from 1 to 2 to 3.

Â In the left plot, we have a one dimensional space partitioned into four

Â regions each with size of 5 units.

Â The middle plot shows a two dimensional space with 5x5 regions.

Â The number of regions has now going from 4 to 16.

Â In the third plot, the problem space is three dimensional with 5x5x5 regions.

Â The number of regions increased even more to 64.

Â We see that as the number of dimensions increases, the number of regions

Â increases exponentially and the data becomes increasingly sparse.

Â With a small dataset relative to the problem space, analysis results degrade.

Â In addition, certain calculations used in analysis become much more difficult to

Â define and calculate effectively.

Â For example, distances between samples are harder to compare since all samples

Â are far away from each other.

Â All of these challenges represent the difficulty of dealing with high

Â dimensional data and as referred to as the curse of dimensionality.

Â To avoid the curse of dimensionality,

Â you want to reduce the dimensionality of your data.

Â This means finding a smaller subset of features that can effectively capture

Â the characteristics of your data.

Â 3:26

This reduces the dimensions of the data while eliminating the relevant features

Â making the subsequent analysis simple.

Â A technique commonly use to find the subset of most important dimensions is

Â called principal component analysis, or PCA for short.

Â The goal of PCA is to map the data from the original high dimensional space

Â to a lower dimensional space that captures as much of the variation in

Â the data as possible.

Â In other words,

Â PCA aims to find the most useful subset of dimensions to summarize the data.

Â 4:05

Here, we have data samples in a two dimensional space that is defined

Â by the x axis and the y axis.

Â You can see that most of the variation in the data lies along the red diagonal line.

Â This means that the dat samples are best differentiated along this dimension

Â because they're spread out, not clumped together along this dimension.

Â This dimension indicated by the red line is the first principle component

Â labelled as PC1 in the part.

Â It captures the large amount of variance along a single dimension in data.

Â PC1, indicated by the red line does not correspond to either axis.

Â The next principle component is determined by looking in the direction that is

Â orthogonal, in other words perpendicular, to the first principle component which

Â captures the next largest amount of variance in the data.

Â This is the second principal component PC2 and

Â it's indicated by the green line in the plot.

Â This process can be repeated to find as many principal components as desired.

Â Note that the principal components do not align with either the x-axis or

Â the y-axis.

Â And that they are orthogonal, in other words, perpendicular to each other.

Â This is what PCA does.

Â It finds the underlined dimensions, the principal

Â components that capture as much of the variation in the data as possible.

Â These principal components form a new coordinates system to transform

Â the data to, instead of the conventional dimensions like X, Y, and Z.

Â So how does PCA help with dimensionality reduction?

Â Let's look again in this plot with the first principle component.

Â Since the first principle component captures most of the variations in

Â the data, the original data sample can be mapped to this dimension indicated by

Â the red line with minimum loss of information.

Â In this case then, we map a two-dimensional dataset to

Â a one-dimensional space while keeping as much information as possible.

Â Here are some main points about principal components analysis.

Â PCA finds a new coordinate system for your data,

Â such that the first coordinate defined by the first principal

Â component Captures the greatest variance in your data.

Â The second coordinate defined by the second principal component captures

Â the second greatest variance in a data, etc..

Â The first few principle components that capture most of the variance in a data

Â can be used to define a lower-dimensional space for your data.

Â PCA can be a very useful technique for dimensionality reduction,

Â especially when working with high-dimensional data.

Â While PCA is a useful technique for reducing the dimensionality of your

Â data which can help with the downstream analysis,

Â it can also make the resulting analysis models more difficult to interpret.

Â The original features in your data set have specific meanings such as income,

Â age and occupation.

Â By mapping the data to a new coordinate system defined by principal components,

Â the dimensions in your transformed data no longer have natural meanings.

Â This should be kept in mind when using PCA for dimensionality reduction.

Â