Minimising this objective requires us to find
the orthonormal basis that spans the subspace that we will
ignore and when we have that basis we take
its orthogonal complement as the basis of the principal subspace.
Remember that the orthogonal complement of a subspace U,
consists of all vectors in
the original vector space that are orthogonal to every vector in U.
Let us start with an example to determine the v vectors.
And let's start in two dimensions where we wish to find
a one dimensional subspace such that the variance of
the data when projected onto that subspace is minimised.
So we're looking at two basis vectors b1 and b2 in r2
and b1 will be spanning the principal subspace and b2 its orthogonal complement.
That means the subspace that we will ignore.
We also have the constraint that b1 and b2 are orthonormal which
means that bi transpose times bj
is δij meaning that this dot product is one if i equals j and zero otherwise.
In our example with our two vectors,
our loss function J is b2 transpose times s times b2,
with the constraint that b2 transpose times b2 is one.
To solve this optimisation problem,
we write down the Lagrangian and the Lagrangian is
b2 transpose sb2 plus λ times 1-b2 transpose times b2,
where λ is the Lagrange multiplier.
So now we compute the gradients of the Lagrangian with
respect to b2 and with respect to λ and set them to zero.
So dL/dλ is 1-b2 transpose times b2.
And this is zero,
if and only if b2 transpose times b2 is one.
So we recover our constraint.
So now let's have a look at the partial derivative of L with respect to b2.
So we get 2b2 transpose times s from the first term
and minus 2λ b2 transpose from the second term.
And this needs to be zero.
And that is zero if and only if s times b2 is λ times b2.
Here we end up with an eigen value problem.
B2 is an eigen vector of the data covariance matrix and
the Lagrange multiplier plays the role of the corresponding eigen value.
If we now go back to our loss function,
we can use this expression.
You can write J which was b2 transpose times s times b2.
Now we know that s times b2 can be written as λ times b2,
so we get b2 transpose times b2
times λ and because we have an orthonormal basis,
we end up with λ as our loss function.
Therefore the average squared reconstruction error is
minimised if λ is the smallest eigen value of the data covariance matrix.
And that means we need to choose b2 as
the corresponding eigen vector and that one will span the subspace that we will ignore.
B1 which spans the principal subspace is then the eigen vector
that belongs to the largest eigen value of the data covariance matrix.
Keep in mind that the eigen vectors of the covariance matrix are already
orthogonal to each other because of the symmetry of the covariance matrix.
So if we look at a two dimensional example,
if this is our data,
then the best projection that we can get that retains
most information is the one that projects onto
the subspace that is spanned by the eigen vector of
the data covariance matrix which belongs to
the largest eigen value and that is indicated by this long arrow over here.
Now let's go to the general case.
If we want to find the M dimensional principal subspace
of a D dimensional data set and we solve for the basis vectors bj
where j equals M+1 to D. We optimise
these ones we end up with
the same kind of eigen value problems that we had earlier for a simple example.
We end up with s times bj equals λj times bj.
For j equals M+1 to D.
And the loss function is given by the sum of the corresponding eigen values.
We can write J is the sum from M+1 to D of all λj.
Also in the general case,
the average reconstruction error is minimised if we choose the basis vectors that span
the ignored subspace to be the eigen vectors of
the data covariance matrix that belong to the smallest eigen values.
This equivalently means that the principal subspace is spanned by
the eigen vectors belonging to the M largest eigen values of the data covariance matrix.
This nicely aligns with properties of the covariance matrix.
The eigen vectors of the covariance matrix are orthogonal to each other
because of symmetry and the eigen vector belonging to
the largest eigen value points in the direction of the data with
the largest variance and the variance in
that direction is given by the corresponding eigen value.
Similarly, the eigenvector belonging to
the second largest eigenvalue points in the direction of
the second largest variance of the data and so on.
In this video we identified the orthonormal basis of the principal subspace
as the eigen vectors of
the data covariance matrix that are associated with the largest eigen values.
In the next video, we will put all the pieces
together and run through the PCA algorithm in detail.