0:00

[NOISE] In this example,

Â we will see linear regression.

Â But before we start, we need to define the multivariate and

Â univariate normal distributions.

Â The univariate normal distribution has the following probability density function.

Â It has two parameters, mu and sigma.

Â The mu is a mean of the random variable, and the sigma squared is its variance.

Â Its functional form is given as follows.

Â It is some normalization constant that ensures that this probability

Â density function integrates to 1, times the exponent of the parabola.

Â The maximum value of this parabola is at point mu.

Â And so the mode of the distribution would also be the point mu.

Â If we vary the parameter mu, we will get different probability densities.

Â For example, for the green one, we'll have the mu equal to -4, and for

Â the red one, we'll have mu equal to 4.

Â If we vary the parameter sigma squared,

Â we will get either sharp distribution or wide.

Â The blue curve has the variance equal to 1, and

Â the red one has variance equal to 9.

Â The multivariate case looks exactly the same.

Â We have two parameters, mu and sigma.

Â The mu is the mean vector, and the sigma is a covariance matrix.

Â We, again, have some normalization constant, to ensure that the probability

Â density function integrates to 1, and some quadratic term under the exponent.

Â Again, the maximum value of the probability density function is at mu,

Â and so the mode of distribution will also be equal to mu.

Â In neural networks, for example, where we have a lot of parameters.

Â Let's note the number of parameters as t.

Â The sigma matrix has a lot of parameters, about D squared.

Â Actually, since sigma is symmetric, we need D (D+1) / 2 parameters.

Â It may be really costly to store such matrix, so we can use approximation.

Â For example, we can use diagonal matrices.

Â In this case, all elements that are not on the diagonal will be zero,

Â and then we will have only D parameters.

Â An even more simple case has only one parameter,

Â it is called a spherical normal distribution.

Â In this case, the signal matrix equals to some scalar times the identity matrix.

Â Now let's talk about linear regression.

Â In linear regression, we want to feed a straight line into data.

Â We feed it in the following way.

Â You want to minimize the errors, and those are,

Â the red line is the prediction and the blue points are the true values.

Â And you want, somehow, to minimize those black lines.

Â The line is usually found with so-called least squares problem.

Â Our straight line is parameterized by weights, vector, and w.

Â The prediction of each point is computed as w transposed times xi,

Â where xi is our point.

Â Then, we compute the total sum squares, that is,

Â the difference between the prediction and the true value square.

Â And we try to find the vector w that minimizes this function.

Â Let's see how this one works for the Bayesian perspective.

Â Here's our model.

Â We have three random variables, the weights, the data, and the target.

Â We're actually not interested in modeling the data, so we can write down the joint

Â probability of the weights and the target, given the data.

Â This will be given by the following formula.

Â It would be the probability of target given the weights of the data, and

Â the probability of the weights.

Â Now we need to define these two distributions.

Â Let's assume them to be normal.

Â The probability of target given the weights and

Â data would be a Gaussian centered as a prediction that is double transposed X,

Â and the variance equal to sigma squared times the identity matrix.

Â Finally, the probability of the weights would be a Gaussian centered around zero,

Â with the covariance matrix sigma squared times identity matrix.

Â 4:22

All right, so here are our formulas, and now let's train the linear regression.

Â So we'll do this in the following way.

Â Let's compute the posterior probability over the weights, given the data.

Â So this would be probability of parameters given and

Â the data, so those are y and x.

Â So using a definition of

Â the conditional probability,

Â we can write that it is P (y,

Â w | X) / P (y | x).

Â So let's try not to compute the full posterior distribution, but

Â to compute the value at which there is a maximum of this posterior distribution.

Â So we'll try to maximize this with respect to the weights.

Â We can notice that the denominator does not depend on the weights,

Â and so we can maximize only the numerator, so we can cross it out.

Â All right, so now we should

Â maximize P (y, w | X).

Â And this actually given by our model.

Â So we can plug in this formula,

Â this would be P (y | X, w) p (w).

Â 7:04

We can plug in the formulas for the normal distribution and

Â obtain the following result.

Â So it will be log of some normalization

Â constant C1 x exp(-1/2).

Â So the mean is w transposed x, so

Â this would be (y- w transposed x),

Â times the inverse of the covariance matrix.

Â So it would be sigma squared I inversed,

Â and finally, y- w transposed x.

Â And we have to close all the brackets, right?

Â And in a similar way, we can write down the second term,

Â so this would be log C2 x exp(-1/2),

Â and this would be w transposed gamma squared I

Â inverse w transposed, since the mean is 0.

Â All right, so we can take the constants out of the logarithm, and

Â also the logarithm of the exponent is just identity function.

Â So what we'll have left is minus one-half.

Â The inverse of identity matrix is identity matrix,

Â and the inverse of sigma squared is one over sigma squared.

Â So we'll have something like this.

Â Y- w transposed x transposed x y- w transposed x.

Â And finally, we'll have a term-

Â 1 / 2 gamma squared w transposed w.

Â This thing is actually a norm, so we'll have a norm of w squared.

Â This is w squared.

Â And this is also a norm of y-

Â w transposed x squared.

Â So we try to maximize this thing, with respect to w.

Â It will multiply it by- 1 and

Â also to sigma, times to sigma squared.

Â We'll count to the minimization problem from the maximization problem.

Â And finally, the formula would be the norm of this thing squared,

Â plus some constant lambda that equals to sigma

Â squared over gamma squared, times norm of the w squared.

Â And since we multiplied by 1, it is a minimization problem.

Â So actually, the first term is sum of squares.

Â So we solved the least squares problem.

Â And the second term is a L2 regularizer.

Â And so by adding a normal prior on the weights,

Â we turned from this quest problem

Â to the L2 regularized linear regression.

Â [SOUND]

Â [MUSIC]

Â