0:00

This is a long lecture,

Â the longest in the entire course and one of the most mathematical.

Â But finally, we have arrived at the point of introducing neighborhood method.

Â Up to this point, everything we talked about is about an individual row or

Â individual column in the movie user rating table or matrix.

Â But now, we are ready to go into the global structures.

Â And the way we go into the global structure is by defining

Â a neighborhood L_i for each, say, movie i.

Â Or we can also look at this from a user point of view.

Â We'll instead just focus on the movie-movie correlation.

Â So you can think of two columns representing

Â the two movies and we want to look at how different they are,

Â in other words, define a pairwise statistical correlation between the two movies.

Â And then, we look at another pair of movie,

Â still movie i, but instead of comparing with movie j,

Â we'll compare with movie k and look at the ratings between this and these two columns.

Â And then we're going to pick up all those movies that are

Â somehow very similar or very dissimilar compared to movie i.

Â And both kinds of movies would be very useful.

Â So this is what we're going to do.

Â We'll first define a similarity metric

Â between two movies and then we will search through all the movies

Â out there and then pick up a subset of those that are

Â either very similar or very dissimilar

Â compared to movie i and call that the neighborhood L_i,

Â and we'll leverage that global information in making predictions

Â about users liking or disliking this movie i.

Â So, this idea can be illustrated quite nicely through a simple picture.

Â Suppose we got just two movies and two users,

Â so we can easily compare two points on 2D grid.

Â The two points are the movies A and B,

Â and we've got two users one and two.

Â The points are the movies and the axes are the users.

Â So movie A is getting a rating of five from user one and one from user two.

Â So, movie A is getting a five star and then a one star from these two users,

Â whereas movie B is getting a two star and a five star from these two users, respectively.

Â As you can tell that these two movies are kind of very different.

Â OK. They are basically liked or disliked in opposite polarity by these two users.

Â And indeed, this means that this angle theta is kind of big. What is this angle?

Â This is the angle between two straight lines: one is the line from

Â the origin to this movie's coordinate,

Â and the other is from the origin to this movie's coordinate.

Â And if these two coordinates are far apart,

Â one quantification of that is the angle theta here.

Â We measure the cosine of that angle and that is

Â a particular number and we use that number to

Â describe how similar or dissimilar A and B movies are.

Â This is called cosine coefficient.

Â It is a particular instance of a similarity metric that we'll be using here.

Â So how do we write down the cosine coefficient here?

Â I'm going to say the cosine coefficient between two movies i and j is called d_ij and,

Â if you remember from basic geometry, analytic geometry,

Â it's really this point viewed as a vector from origin to that point.

Â Call that the vector r_i,

Â it's a vector, transpose r_j, that is,

Â this point viewed as a vector from the origin to that point,

Â divided by the normalizations, that is,

Â the size of i,

Â L2 norm, and the size of j, also L2 norm.

Â So we can view these two ratings for two different movies as two

Â vectors r_i and r_j and the cosine of the angle is the standard formula,

Â which is just the inner product normalized by the sizes.

Â And this, in the longer form,

Â it just means r_ui and r_uj,

Â summing over all the u's who have rated both movies i and j, that is.

Â That is the inner product of these two vectors,

Â divided by the square root of,

Â just by the definition of L2 norm of the two vectors, r_ui-squared,

Â summing over all the u's,

Â the same set of u's here in the numerator,

Â and just summing over the same set of u, r_uj-squared.

Â That's just the L2 norm of these two vectors.

Â So, this d_ij is the cosine coefficient

Â defined as such geometrically as

Â the cosine of the angle theta and algebraically defined as such.

Â Now because we have already shifted

Â the r data from r to r minus the baseline predictor r,

Â for each ui pair,

Â we called this our tilde ui at the end for the last segment of the video.

Â So, strictly speaking, we should put tilde here,

Â so all looking at our tilde value that has already subtracted the influence

Â of the baseline predictor.

Â So now, what we need to do is to look at all the pairs of movies i,j,

Â except movie i,i back to itself,

Â and then compute this relationship.

Â So, at this point,

Â we have for each movie i computed the d_ij for all the j's out there.

Â We've got a list of numbers.

Â Let's arrange these in descending orders.

Â OK. These are the d_ij values,

Â and maybe movies five,

Â eight, 10, two and so on with respect to movie,

Â say, i=1, the first movie.

Â OK? All right, very good.

Â If you look at these and say some of these are real big numbers,

Â some of these are real small numbers, real negative numbers.

Â Again, remember, this actually can be either a positive or a negative value.

Â So we will say those that are very positive and those that are very

Â negative are both useful because they are very similar or very dissimilar movies.

Â So let's just look at the absolute value of d_ij's,

Â and we want those that are big to be included.

Â So I'll pick the top, say,

Â five movies, top either meaning they are really big here or really big here.

Â OK. And that will define what we call the neighborhood,

Â a list of other movies with respect to this given movie i.

Â So for example, L_1 could be movies five and eight,

Â and I say there's a movie 21 down here, 21.

Â OK. I say, who determines how big the set is?

Â How long is the list?

Â Who determines basically the cutoff threshold of the d_ij value?

Â Well, the one that would be determining those is you, OK, the recommender.

Â So there are quite a few rules and guidelines in determining how big is this list.

Â If you make it too big,

Â then you are indeed incorporating

Â the comparative information with respect to many other movies,

Â but some of them are so weakly coupled with this movie that it doesn't quite matter.

Â So, just adding too much noise.

Â On the other hand, if you include too few neighborhood movies,

Â then you're not fully leveraging all the information and then you are

Â losing precious information about other movies,

Â so that's not good either.

Â We won't have time to go into the exact art of picking the right size,

Â and that's indeed part of the reasons why it took

Â many computation for these teams to compete in the Netflix Prize.

Â But roughly speaking, the idea is that if you like a movie,

Â say, "Schindler's List" and,

Â say, "Life is Beautiful Life" then maybe you'll like "Doctor Zhivago".

Â If you dislike "E.T."

Â or dislike "Lion King" then maybe you also dislike,

Â say, "Toy Story". So that's the idea.

Â The idea is that we're going to look at all these movies as points in a space.

Â Those that are very close by this cosine theta metric,

Â we'll call them neighbors.

Â And therefore for each movie i,

Â we will define a neighborhood by the absolute value of these cosine coefficient values.

Â So now that we have defined what is the metric cosine similarity,

Â and from there we have defined what is a neighborhood for each movie,

Â we can finally define

Â the predictor based on that information called the neighborhood predictor.

Â So the neighborhood predictor,

Â it will be denoted as r-hat_ui,

Â but we have already used that symbol to represent the baseline predictor,

Â let's add a sup N. This N does not mean the number of users or movies,

Â it's a shorthand for neighborhood.

Â OK? So, we want to find out what is our r-hat_sub_ui^sup^N.

Â This is basically the original baseline predictor

Â now plus what we learn from these neighbors.

Â We learn from these neighbors that if

Â a neighbor movie j in the shifted rating r-tilde,

Â OK. Again, r-tilde is just r minus r-hat.

Â In this shifted space,

Â depending on whether j

Â is positively correlated with movie i or negative correlated with movie i,

Â it's going to give us some information.

Â So we want to add up those information,

Â summing across all the movies j,

Â indexed by j, in the set L_i.

Â So, we have to weigh each such rating by a neighbor movie j in some way.

Â What should the weight be?

Â One choice is just let weight be the cosine coefficient itself, d_ij.

Â Now, this may not be the best choice.

Â We just don't have time to go into optimizing this particular choice of weight.

Â Let's say it's just d_ij.

Â So if d_ij is positive,

Â we're adding to the prediction.

Â If it's negative, we're subtracting from that.

Â And then we have to normalize somehow because

Â we might be adding a lot of these movies in the neighborhood.

Â So we have to divide that by the magnitude of the weights.

Â The weights can be,

Â it's just d_ij absolute,

Â summing over j in the neighborhood of movie i, L_i.

Â Now, why do we put absolute sign here but not here?

Â Well because in normalization,

Â if you don't put absolute value,

Â the positive and negative d_ij's will cancel each other,

Â so that's not intention to count their size.

Â But in the numerator,

Â if we add absolute value,

Â then we get confused whether j is very similar to I or very dissimilar to i.

Â That's why when we calculate the influenced by neighborhood movies j,

Â we use the actual d_ij which could be a positive or negative,

Â but when we normalize by the size,

Â we have to use the absolute value of d_ij's.

Â Now you can make it even simpler by just making the weights one.

Â You simply just count or add up the

Â r-tilde's and divide it by the size of the neighborhood.

Â That's it. That turns out not to be performing very well,

Â so we use a slightly more involved weights here.

Â All right, so this is what we have been looking for in the last, what, 80 minutes.

Â This is the neighborhood predictor that will enable you

Â to get quite far ahead in the Netflix Prize competition.

Â It composed of two parts.

Â One part is the baseline predictor.

Â Again, how do we get to the baseline predictor?

Â Very simple, by solving the least square problems involving these b_u's and b_i's,

Â because the baseline predictor r-hat is

Â just the lazy average predictor plus b_u plus b_i for each u,i pair.

Â OK? So solve least square based on the ground truth,

Â train your parameters b_i*, b_u*,

Â stick it into the predictor, you get r-hat.

Â The second term in the neighborhood predictor is the actual neighborhood information.

Â Again, we use the cosine coefficient as the metric.

Â Then for each movie i, we define a neighborhood of a certain size.

Â And then, based on that,

Â we say all those movies j in that neighborhood set

Â will be weighted with a certain weight,

Â for example, the d_ij themselves, and then normalized.

Â This is the total influence in the prediction of ui.

Â So now, let's summarize what we have at this point.

Â The neighborhood predictor for collaborative filtering in Netflix consists of five steps.

Â The first step is train your baseline predictor.

Â r-hat equals r-bar plus b_u plus b_i.

Â I don't know what should b_u, b_i be,

Â so I'm going to look at this prediction with the actual ui,

Â r_ui, and then look at the difference squared,

Â solve the least squares problem,

Â that will give me the b_u, b_i,

Â and therefore the baseline predictor.

Â OK. Once I have the baseline predictor,

Â let's call that r-hat,

Â then we can subtract r from r-hat and get this shifted r-tilde,

Â which could be positive or negative depending on what's your r-hat.

Â And then from the r-tilde,

Â we can compute a similarity metric.

Â If we collect all these into a matrix,

Â big R-tilde, and from this big R-tilde,

Â we get another matrix D where each i,j entry of big D

Â is simply the d_ij that we computed as a function of these r-tilde's.

Â Once we have computed this d_ij,

Â we can define a neighborhood L_i for each movie i,

Â consisting of a bunch of movies very similar or very dissimilar from movie i.

Â And once that's done, we can now plug everything into

Â the r-hat^N_ui formula which we just

Â shown on the previous page because we have now all the ingredients,

Â including the baseline, the b's, and the r-tilde's.

Â And that will give us the final predictor which we

Â can collect all the ui entries into a particular table or matrix R-hat^sup^N,

Â N for neighborhood method.

Â