0:00

Finally, let's take a look at a complete numerical example.

Â In this case again, just like the Google PageRank lecture, we are constrained to a

Â slide on a page, and therefore the examples do not represent the challenge of

Â scale or sparsity. If you recall, scale and sparsity, one of

Â the big challenges for Netflix Prize. So, this only really serves the purpose of

Â illustrating the five steps that we just outlined for neighborhood method.

Â Here is our starting point, we are given, these are given constants, the rating

Â information. Each entry of this big matrix R is rui.

Â And you see that this is ten user by five movies, level users one up to ten, and

Â movies A to E. And, it is 80 percent food, okay?

Â So, 40, out of the 50 possible entries are already actually known.

Â And this is, again, very atypical. Generally speaking, only about one%,

Â instead of 80 percent of the entries are filled in a typical real data.

Â But, if we only have one%, then we don't even have a single entry of known data in

Â the training. Now, ten of these are represented by bars,

Â this means that the rating simply does not exist for that user movie pair.

Â And now, we're going to say let's hide ten out of the fourteen know data as test

Â data, and use the remaining 30 as training date.

Â So, 30 samples for training and ten to test our prediction.

Â So, some arbitrary list, just pick these circled ten entries as test data, okay?

Â So, we know that we just pretend we don't first and train with the remaining 30

Â points and see how we do on the on these ten points.

Â First of all, the very lazy r bar, in this case, 3.83, in Netflix case, 3.6.

Â All right. Step number one, baseline predictor.

Â Let's construct the baseline predictor which is r hat ui equals r bar 3.83 in

Â this example, plus a bunch of biases for each user for each movie.

Â And this holds for all the ui pairs. Okay?

Â So, we've got how many? Ten of these parameters and five of these

Â parameters, we've got fifteen parameters in this case for the baseline predictor,

Â and we're going to predict the right base line predictor parameters bu star bi star

Â by solving the Li squared problem. The Li squared problem is, as you'll

Â recall, A, AB minus C, and with two norm square, minimize this, and the solution is

Â a transpose A times B star will be equal to a transpose C.

Â So, what are the A, B's, and C's? The B is easy, that's our fifteen

Â optimization variables, these parameters under training.

Â B1 up to b10, and then bA up to bE. What about the matrix A and the vector C?

Â We'll simply construct those at, like what we did in a small two movie one user

Â example before. In particular, this A matrix going to be,

Â we have this fifteen by, we have a 30 by fifteen matrix.

Â 30, because we have 30 training points starting from r1a to r10, okay?

Â So basically, all the points here that are not red circled are our training data,

Â okay? And the, we have 30 equations and fifteen

Â variables corresponding to b1 up to bE. And, the tables basically, this matrix A

Â is 01 binary, is zero everywhere unless we are talking about that particular user and

Â particular movie. Say, in this role, the first equation out

Â of 30, we're talking about r1a training data so we have to write a one where it

Â corresponds to b1, the first position, and the eleventh position, which is

Â corresponding to the bA position. And, the C vector is simply the known

Â training data r1a down to r10E subtracting the r bar.

Â So, now that we have defined A, B and C we can use this equation as the solution to

Â this Li squared problem and the resulting solution bu vector star and bi vector star

Â are just partially shown here. This is ten entries, this is five entries.

Â And then, you can stick these back into the predictor here, with op, optimize the

Â bu, bi by training, and you got the baseline predictor, okay?

Â That's step number one. Now, we can for sanity check, verify, in

Â the textbook, you see all the entries. But here, you can see indeed, bu is quite

Â positive, b, b1 is quite positive, and bA is also quite positive.

Â That means, user one tend to give high scores, and movie A is well-liked.

Â So, just to make sure that's the case, indeed, user one tend to give high scores,

Â and movie one is well-liked, okay? A lot of fours and fives.

Â So, the bias computation is indeed, making some sense.

Â Step number two, is the prediction on the base line.

Â So, the base line predictor using the optimized bu, bi and the average r bar

Â will give you this prediction, okay? We call this r hat.

Â If you look at this r hat, and remember, some of these are training data, some are

Â test data. So these, for example, are the test data,

Â okay? And, let's compare with the actual data,

Â rui for, say, these five out of the ten. What do we have here?

Â We have five, three, two, two, five. Those are the actual rui's that we were

Â hiding from ourselves. Now that, let's see, five, three, three,

Â two, one. Five, three, two, two, five.

Â Five, three, two, two, five, two, two, five.

Â Are they close to five, three, two, two, five?

Â Yeah. Reasonable, I guess.

Â 4.62. If you round up, and it will be five.

Â But, if you compute RMSC, then there is a difference of 0.38.

Â 3.49 versus three. This is getting a little off.

Â Half a star off is quite off, actually. And if you, kind of how you treat this

Â 3.45, you may round it to four, and you actually incur a significant error.

Â And even without rounding, this difference of 0.49 is also quite big.

Â This difference is huge, it's between two and 2.78.

Â This difference is also big, and diff, difference is quite small.

Â So, we pretty much get this quite right, this quite right, this sort of reasonable.

Â But, these two predictions are kind of not that good, okay?

Â One way you can deal with this is to use regularization to avoid overfitting in the

Â advanced material. But in general, baseline prediction

Â doesn't leverage any collaborative featuring techniques of global structure,

Â so we're not doing very well. But, we're not done yet.

Â So, at the end of step two we shift r by r hat, and we get R2d matrix.

Â I'm just writing down R2d matrix two columns out of five corresponding to

Â movies B and C, respectively cuz we're going to use this to illustrate in our

Â step number three, which is construct the similarity matrix D for each entry of dij,

Â let's look at a particular example. Let's say, we look at the movie B, C,

Â okay? This is movie B, this is movie C.

Â Let's look at, out of the ten users, how many users already rated both movies B and

Â C? Well, that's one.

Â User one rated both. User two rated both.

Â But then, you've got other users either they have not rated some movie or they

Â have rated, but we can't use those for training, those are for testing.

Â So, we have to go all the way down to the user nine, where actually we have, data of

Â user nine rating both movies B and C. So, there are three users, one, two and

Â nine, that rated both, you, movies B and C and therefore can be used to compute the

Â distance between movies B and C, the cosine similarity coefficient.

Â 10:09

I'm sorry, this is minus. I should say minus.

Â This is also minus. So, -0.9 plus -0.11, 0.31 plus -0.87 times

Â 0.33, and I have to normalize. So, the square root of 0.91 squared plus

Â 0.11 squared plus 0.87 squared times 0.9 squared plus 0.31 squared plus 0.33

Â squared, and this is the formula for cosine similarity coefficient dBC, which

Â turns out to be -0.84. So, notice these entries are negative

Â because after shifting the rating from baseline predictor, sometimes it's

Â positive or negative relative to the baseline predictor.

Â And these coefficients of similarities are sometimes negative because the two movies

Â are dissimilar from each other. And these two are indeed quite similar,

Â okay? This -0.84 is quite close to -one.

Â So, we can say movies B an C are quite dissimilar.

Â Now, we can do this now for all the pairs of movies to come up with this entire

Â matrix D here. The diagonals doesn't matter, so just

Â cross them out. The off diagonals are symmetric cuz the

Â way we define it, dBC equals, dCB, okay? And, indeed, this is dBC or dCB, same

Â thing is -0.84 as we just calculated, and you simply follow the rest of the, the

Â same structure for rest of the entries. That's the end of the step three.

Â We have constructed similarity matrix. Now, step four is to define a

Â neighborhood. Let's say, a neighborhood for any movie is

Â just two. Okay?

Â So, the top two movie with a absolute value dij in the largest relative to movie

Â j will be conclude, included in the neighborhood.

Â So, let's take a example. For example, let's say, r hat 3D.

Â We want to estimate user three rating movie D.

Â Remember, user three rating movie D, we did not have that actual information.

Â We can see here 3D is one of those that we we're hiding from ourselves so that now we

Â can predict, see if we can get to two, okay?

Â So, we want to find out what is r hat 3D based on the neighborhood method.

Â Well, first of all, there is the r hat 3D baseline predictive component, then there

Â is the neighborhood prediction component. This happens to be 2.78.

Â Now, what is this? Since we decided the neighborhood consists

Â of two movies, so for movie D, what are the top two similar or most dissimilar

Â movies? This is for movie D and we can see that

Â movies A and B are most similar or dissimilar in their absolute dij terms.

Â Actually, both are very dissimilar, A and B are very dissimilar, okay?

Â So, we have used, so far, the information about different people's taste on these

Â movies, A. This is A up to E, right?

Â A down to E, between movies D and A and B. And then, we take that information, okay?

Â The fact that A and D are very dissimilar by people's record, and B, D are also

Â quite different by people's record. As a way to say that, therefore, AB should

Â be incorporated as we predict anyone's prediction liking or disliking movie D.

Â So, following our formula, we have -0.97 times the weighting where the weighting

Â turns out to be the prediction of the rating of user three on movie A, okay?

Â Plus -0.73 times the rating of user three on movie B.

Â And, this happens to be 0.9 in the shifted domain, and this happens to be -0.19 in

Â the shifted domain. So, in particular, we see that the same

Â user three whose taste on movie D we're trying to predict really liked movie A.

Â But, since A and D are very different kinds of movies, they're so dissimilar,

Â that means she will very likely not like movie D.

Â This is a big positive number, this is a big negative number.

Â The overall impact is to make this r hat 3D, and to be a very small number.

Â And then, we just normalize now by the absolute value of the dij, which is 0.97

Â and 0.73. This, therefore, happens to be 2.35.

Â In other words, the net effect of the neighborhood help is to reduce the

Â prediction from 2.78 to 2.35. The actual answer is two.

Â And indeed, in the base line predictor, it overshoot two by too much, very close to

Â three stars. Whereas, with the help of neighborhood,

Â knowing that A and B are very dissimilar from movie D, and user three kind of likes

Â A and doesn't hate B too much, which actually tells us that she is going to

Â really not like D a whole lot. With that information, we push 2.78 down

Â to 2.35 which is, indeed, much closer to the actual ground to two.

Â If you computed the, our squared error between two and 2.78, you can see it's

Â much bigger than that between two and 2.35.

Â 17:11

So, now we can compute this for all of the ten items that we want to predict.

Â And it turns out that, we get a much better improved performance in the

Â baseline predictor, r hat. We get the error of 0.51 and 0.58,

Â respectively, for the 30 known training data and the ten test data.

Â But, this get improved into a neighborhood predictor r hat N where at the overall,

Â the root mean square error, for the training data is as small as 0.32, and for

Â the ten test data is also small as 0.54. Again, it's what these ten predictions

Â that really matter in Netflix Prize. You may think these are very small

Â numbers, again, in a scale up to five stars, half star is a big deal.

Â And even improving on the next digit is often going to make a big difference in

Â what we recommend, and that's what the ten percent goal of Netflix challenge is

Â trying to do here. This represents seven percent enhancement

Â by neighborhood method over baseline method.

Â And it turns out that, if you use neighborhood method plus temporal dynamics

Â plus regularization idea of Li square, plus a couple of other tricks, you can

Â easily get to almost nine%. Meaning, you can actually win the first

Â entire year of progress prize, and more of enhancement already sign match.

Â But, getting the other one, or little over one percent to get to ten percent target

Â of enhancement over sign match took two more years of effort by many smart people

Â around the world. Thousands of teams, tens of thousands of

Â different submissions and two years time to enhance that 1-1/2%.

Â That's why a lot of people believe that half Netflix said this to be eleven or

Â twelve%, it may not even be achievable using the known data, that we will never

Â know the exact answer. But, what you have seen so far with a few

Â tricks get you quite far in that prize, if you were participating back in 2006.

Â So, that's the end of this longest lecture in the entire course.

Â We looked at Netflix Prize, a very interesting scientific story as a special

Â case of recommendation system. In particular, can help collaborate a

Â future leverage and similarities among users or, as in our example about movies,

Â to enhance the accuracy of prediction, say, measured by root and square error.

Â And if you did minimize root and square error, which you don't have to, as we saw,

Â this would lead to often a Li squares problem, which is a special case of convex

Â optimization that we'll see over and over again in the course.

Â This is the, just a very brief terminology introduction on convex state today.

Â So, this lecture is long in part because not only we're talking about collaborative

Â futuring as a learning idea, but also start to introduce convex nonlinear

Â optimization. It's also long in part because Netflix

Â Prize is very interesting story to show the timeline and have some details of that

Â competition. Now, this special case recommendation

Â system will lead us to the next lecture, Lecture five, on recommendation system on

Â Amazon in a different context, and a different question over there.

Â So, see you.

Â