0:00

Welcome back.

Â In the previous video, we saw a method of learning and

Â matrix factorization that uses gradient descent to approximate the SVD.

Â And this algorithm we called FunkSVD.

Â In this video, we're going to look at where the formulas came from.

Â So we described the structure of the algorithm, we gave you the update rules.

Â How do you update the user and

Â item feature values at each step of the algorithm?

Â But we didn't explained how we derived those values, so

Â in this video that's what we're going to do.

Â We're going to walk through, we're going to derive,

Â each of those update formula values.

Â So gradient decent tells us how to find a minimum.

Â We've got a sum function, and we want to find

Â where is the x value that has the smallest value,

Â or the largest value if we're doing gradient descent.

Â So we have this update rule that we use for this,

Â it uses a vector a partial derivatives.

Â So this lecture is going to use some calculus.

Â So as you saw in the previous video, we followed the error.

Â We take the insight, the single value decomposition is

Â the breakdown of the matri that has minimal RMSE.

Â So we're just going to directly, using gradient descent,

Â find the matrices that minimize the RMSE.

Â First we do a couple of simplifications.

Â So we've got, the RMSE is the square root of the sum

Â of the squared errors over the number of items.

Â Now square root is monotonic.

Â That is if a value goes up, so does its square root.

Â Also we're testing over a single data set.

Â We're testing all these triangles, all these same matrices on the same dataset.

Â So n is the same.

Â Dividing by a constant is also monotonic, if it goes up.

Â So does divided by the constant.

Â So if we find the recommendation parameters

Â that minimize the sum of squared error.

Â 2:18

Those same matrices will minimize the root mean squared error and

Â sum of squared error is a much easier formula to work with.

Â So we're going to minimize the sum of the squared error.

Â Now if you recall our prediction rule.

Â The score of item i for a particular user,

Â which I'm going to abbreviate here as r tilde ui,

Â to say the prediction of the rating,

Â is our baseline value plus the sum or the dot product of the user and

Â item feature vectors for that user in that item.

Â So we compute the error by subtracting the prediction from

Â the rating which is r sub ui- b sub ui minus this dot product.

Â But then the updates, gradient descent is based on the rule, That theta,

Â so we're going to, it's common to call the parameters of one of these models theta.

Â So theta is P and Q together, the two matrices.

Â That theta at step n equals

Â theta at step n- 1 plus

Â the gradient of our error

Â with respect of theta.

Â And the gradient is just this big matrix of partial derivatives.

Â So we're trying to care,

Â though we're training individual user item feature values that are timed.

Â So we've got a particular rating, r sub ui.

Â It has a particular item.

Â It has a particular user, a particular item.

Â We're also training one feature at a time.

Â So we train the first feature.

Â Then we train the second feature.

Â So we're training at feature f.

Â So we're trying to update P sub uf, and Q sub if.

Â The user and item feature values for that particular feature.

Â So we really only have two values we care about at any step.

Â We're using something called stochastic gradient descent in FunkSVD,

Â which means we're updating for every rating.

Â Rather than going over all the ratings and computing a big update matrix.

Â We're just automatically upgrading with every rating.

Â So at any given point, any given step through the algorithm,

Â we care about these two values.

Â We're trying to update these two values.

Â So how do we do that?

Â We want to take the derivative.

Â The derivative with respect

Â to Puf of the squared error,

Â or of epsilon sub ui squared.

Â So if you've taken calculus and

Â remember your derivative rules for

Â dealing with powers, that's going to

Â be equal to 2 times epsilon sub ui,

Â times the derivative with respect

Â to P sub uf of epsilon sub ui.

Â 6:54

That contains piece of uf.

Â Everything else does not have any piece of uf.

Â Which when we take the derivative means, They're going to go away.

Â Because the derivative of something that doesn't have the thing we're deriving with

Â respect to is 0.

Â So 0, 0, a bunch of 0 here except for one thing which

Â is piece of p sub uf times q sub uf, which is multiplying by a constant.

Â So it's going to be q sub uf is all

Â we have in this derivative.

Â Negative q sub uf.

Â So that's going to be equal to,

Â 7:43

2 epsilon sub ui, q sub uf,

Â negative, because -q sub uf.

Â So the derivative with respect to the primer that

Â we're updating, the user feature value is n-

Â 2 times the error times the item feature value.

Â And if you take the derivative with respect to the item feature value,

Â you get all this out, so you get the user feature value.

Â So this brings us to the, since the update

Â is the value plus the derivative times a learning

Â rate lambda to get things to slow down.

Â We get a P sub uf prime.

Â The next value equals

Â P sub uf minus lambda minus

Â 2 epsilon ui, q if.

Â So we can roll this two into this lambda.

Â And so we get the update rule that we saw.

Â We get the, in the previous video,

Â we get the value plus lambda epsilon i, q if.

Â That gives us our update rule of four

Â FunkSVD So as we saw the gradient descent.

Â The next value is the learning rate times the derivative, or

Â is previous value minus the learning rate times the derivative.

Â This uses the previous steps values.

Â Uses the learning rate, it uses the gradient of the error.

Â We apply this to our derivative.

Â We get exactly the update rules that we saw last time.

Â So we add lambda times the error times the item

Â value to get the feature value update.

Â We then apply a regularization to discourage the item,

Â user feature values from getting very big without a lot of evidence.

Â So we take this additional term gamma, that's this regularization parameter.

Â And so, we decrease the update based on how big the value we're updating is.

Â And this makes it take a lot more data to justify large item and

Â user feature values.

Â So that is the calculus behind the FunkSVD update rules.

Â A lot of algorithms are going to follow the same approach.

Â We're going to talk about this more in detail towards the end of this course.

Â This basic approach of optimizing an error function and

Â in order to actually implement it you're going to need to

Â compute the derivatives which give you the update rules.

Â A lot of times in an algorithms published it's going to do that for you, sometimes

Â especially if you're developing your own new algorithm you're going to have to do

Â it yourself This is an example of how to do that, with one widely used algorithm.

Â