0:00

So, we're going to go into neighborhood method now.

Â But, before we can do that, we first need to go through what's called a baseline

Â predictor. And before we can go through a baseline

Â predictor, we first want to talk about the general notion of parameterized model.

Â Now, we're going to see the parameterized model in a very different context in the

Â next chapter with Bayesian analysis. In general, we want to build a model that

Â is some mathematical representation of the problem we want to solve.

Â Sometimes, this model carries with it a bunch of parameters, we'll see a lot of

Â parameters today. And these model, parameterized models will

Â go into the so called training phase first.

Â Where you have some observations, but you also have the ground truth.

Â And you stake the observations, make prediction, and compare with ground truth.

Â But then, the difference is what something they want to minimize by tuning the model

Â parameters. After this step of training, you have

Â optimized model parameters. Then, you go into the actual real action,

Â which is the prediction step, where you take these optimized parameters in the

Â model, optimized by the training phase, with the known ground truth.

Â And then, you, you take the observation to make predictions.

Â Sometimes, other people have the ground truth, like Netflix had the ground truth

Â for the Netflix Prize, but you don't. So, training followed by prediction.

Â And training phase would train your parameter just like a voice recognition

Â system now on your computer or phone or car.

Â And you first train them by speaking certain known sentences.

Â These sentences are the ground truth, and then the computer will take in the

Â observation which is the recording of your voice, and then train the parameters

Â before going into the actual action of making predictions in the future for you.

Â All right. So, having said that, let's go into the

Â baseline predictor. This is the step number one before we can

Â afford to go to step number two and neighborhood predictor.

Â So, baseline predictor does not utilize the global structures of the user's hit

Â movies rating. We're going to start the baseline with the

Â baseline of the baseline, that's the laziest predictor that you can think of

Â the average predictor, let's call that r bar.

Â R bar is simply the summation of all the rui's across all the ui appears where

Â there is a rating and divided by the total number of set ratings.

Â That's it, simply the average. It turns out that Netflix's Prize training

Â data, 100 million data points, r hat is 3.6.

Â That is, over all these 100million ratings, the average is 3.6.

Â And we can say, well, that is our prediction.

Â We predict that all the users and movies interactions will be 3.6 stars in the

Â future. And clearly, this is a very lazy, we can

Â do a lot better than this. For example, we know that certain movies

Â are better and certain reviewers are harsher.

Â So, let's say, the predictor, r hat is r bar but with some bias terms.

Â So, b, bias terms with a regard to the user, and bias terms with regard to the

Â movie. So, r, r hat ui is the average predictor,

Â the laziest one plus biased term on the user, on a user and some biased term on

Â that movie. And now, we can look at the root mean

Â squared error. We can also take the square root of this

Â root. So, just look at the mean square error, if

Â it's, that doesn't change our answer of the optimal parameters.

Â So, we look at the summation of the difference squared, difference between ru,

Â ui, and r hat ui squared. We can also delete the step of dividing by

Â the total number of, of ratings because that's just a constant, it doesn't change

Â our answers. So, this is our objective function for the

Â minimization. Minimize over the set of bu's and bi's,

Â the summation of the square of the differences between the ground truth, rui,

Â which you have in the training phase, and the prediction r hat ui, where r hat ui

Â assumes this formula. Okay?

Â So, this is the function involving some squares of the bu and bi terms.

Â R hat is not a variable because it is just the average, say, 3.6 for Netflix, and the

Â bu, bi are your variables. That's an optimization problem, you've got

Â objective function which is the square. Definitely, not a linear programming

Â anymore, but we'll see if this is a nice nonlinear programming.

Â And, there's no constraints, okay? The variables are the bu, bi's, and the

Â given constants are the known truths, rui's, and this simple scalar, r bar that

Â describes the baseline predictor training optimization.

Â So, let's take a look at a particular example of this.

Â Maybe get rid of the r because we are effectively taking the square root of the

Â error, the mean square error minimization. Or, the total square error, since we're

Â not dividing by the number of ratings either.

Â Let's say, we have just two movies A, B, and we just have one user, one.

Â So, this is a ridiculously small example. In fact, it's so small that we have more

Â parameters than we have data points. We have three parameters, bA and for the

Â bias terms of the two movies, and the b1 which is bias term of, for, of this single

Â user here. We have three parameters and we have two

Â norm ratings from the ground truth, r1A and r1B.

Â And, what is our minimization problem? That is, looking at the square of the

Â difference between the prediction and the ground truth.

Â Since we have taken the square, it doesn't matter which way we write out the

Â different list say we write out r hat minus r, that is b1 plus ba plus r bar,

Â okay? This is our prediction.

Â 6:56

R hat of 1A minus the actual 1A, and this thing squared plus the other error, which

Â is 1A, b1 plus, plus the average r bar minus the actual 1B rating squared.

Â So, this is the r1B hat minus r1B, this is the r1A hat minus the actual r1A, okay?

Â And we take the sum of the squares of the differences, and that is the object of

Â function. We want to minimize this object of

Â function subject in no constraints over the three optimization variables namely,

Â our three model parameters, bA, , b1. So, our question is, is this an easy

Â optimization or not? It's clearly not linear but it is an

Â unconstrained quadratic minimization over the variables.

Â It turns out it is a very easy problem. It is so called least squares problem,

Â okay? While minimizing the so called convex,

Â we'll come back to this in a minute. Convex quadratic function over no

Â constraints, and this is a very nice nulling optimization.

Â In general, we say, a least square problem is the minimization of the following

Â vectors, l2 norm. First of all, what is l2 norm?

Â So, given a given vector x1 to measure size, one standard way that is l2 norm is

Â the square root of the individual entry squared summing over all the entries.

Â We often don't like the square root thing, we take the square of the l2 norm.

Â So, square of the l2 norm is simply the sum of the xi squared over i.

Â It's a very standard way to measure the length of a vector, and we would like to,

Â in this square, minimize the l2 norm square, the length of the vector, Ab minus

Â c, where b is our variable, A is a known matrix, c is a known vector.

Â Say, if A is N by M, then b is M by one a column vector, and C is N by M by one,

Â another column vector. In the particular case we just saw, we can

Â also write this problem minimizing this expression in this the standard form.

Â It looks a little weird, but you sort of see, it's alright.

Â The matrix is one, one, zero, one, zero, one.

Â The variable vector we have is b1, bA, . And the constant vector term we have is

Â r1A minus r bar, and r1B minus r bar. This is our A, this is our b vector, this

Â is our c vector. Just to conform to the standard expression

Â that we want to minimize in the least squares problem.

Â You can easily check, minimizing this is this, l2 norm squared is the same as

Â minimizing this expression, okay? Example, this is just saying b1 + bA - r1A

Â + r bar. And that's exactly this term, alright?

Â And then, you can write out the other term.

Â And then, just by definition of the square of the l2 norm, you see we are recovering

Â exactly the same object of function. So, we are, in the baseline predictor

Â problem solving a least square problem. So, the question is, how do we solve this

Â square problem in general? Okay?

Â Well, in general, we want to say, we want to minimize Ab minus c l2 norm squared

Â which can also be written out as Ab minus c transposed times Ab minus c.

Â Just it, multiplying itself in product in with itself.

Â You can write this out if you're not comfortable with the matrix notation in

Â scalar format. We can write out a little two by two

Â matrix A, you know one, two, three, four and a little vector b1, b2 minus little

Â vector c1, c2. And then, you can see that the derivation

Â coming up is no mystery. But, to save time, let's use the forehead

Â matrix notation. It's almost similar to our scalar

Â operation, okay? This will be Ab transpose times Ab minus

Â Ab transpose c minus another Ab transpose c and plus c transpose c.

Â And, this can in turn be written simply as by transpose rule, b transpose A transpose

Â Ab minus two times b transpose A transpose c plus c transposed c.

Â 15:39

And, you're minimizing over bu, bi that function.

Â It turns out picking the right bu, bi and therefore, completing this space-line

Â model, boils down to solving this kind of problem.

Â And this kind of problem is called least square in this format.

Â And the least square solution is simply the solution to this linear equation.

Â We'll go through a complete numeric example towards the end of the lecture,

Â but this is a quick summary of where we have been, okay?

Â Now, you may wonder, this is kind of complicated.

Â Why don't I just pick bu, bi to be the average rating of all, of course, all

Â movies for this user an, an average rating across all users for this movie?

Â That turns out, that's a reasonable guess but it does not necessarily minimize our

Â error metric which is root mean square error.

Â So, all we're doing is to say, if you believe in this model, stick it into our

Â error term. Then, you got to solve for least squares

Â problem which can be efficiently solved by solving linear equation, just like Google

Â PageRank solving linear equation. But, over there, it's a very different set

Â of equations. And then, once you get the optimal bu

Â star, bi star, stick it back in here, you get your predictor that minimizes RMSE.

Â 17:06

It turns out that, that's almost a good, good idea, except you need something else.

Â We often encounter was called overfitting. This is in the textbook in the advance

Â material. But basically, it says, that you can add

Â so much parameters, all these bu's and bi's, and you can find tune them so well

Â based on the ground truth and the training data that it loses flexibility and

Â predictive power of the actual prediction. So, we want to somehow minimize the impact

Â of the parameters. We don't want to rely too much on

Â hindsight perfection, We want to leave some room for foresight cuz it's

Â ultimately foresight that matters in predicting or recommending.

Â One way to continue the weight of the parameter is to actually minimize the

Â size, say, l2 norm of these bu's formed as a vector and these bi's formed as a

Â vector. That's called the Regularized Least

Â Squares. If we use quadratic norm, then l2 norm is

Â still a least squares, it's just a bigger least squares problem.

Â But, it does try to contain to much reliance on too refined parameters based

Â on hindsight. So, after the baseline predictor, what we

Â have is the following. We got the actor rui, we've got the

Â baseline predictor r hat ui which is the same as rui minus r bar, the lazy average

Â plus the optimized bu and optimized bi, based on solving the least squares.

Â And we call this difference a shifted r, called r tilde ui, just a shorthand

Â notation. In other words, we have removed in this r

Â hat r tilde ui, the impact of the average and the impact of the per user bias, and

Â per movie bias. And we're going to take this as the

Â starting point into our neighborhood predictor to look at similarities across

Â movies and across users. But before we do that, just a very quick

Â detour for five minutes on the notion of convex optimization.

Â Now, we've seen linear programming which is a special case of convex optimization

Â where we minimize linear objective functions over linear constraints, linear

Â equations and inequalities. So, pictorially, that means, you use

Â linear in, inequalities to cut out the space of the variables into something

Â called the polytope with sharp corners. And then, you slide a bar, which is a

Â straight line cuz it's a linear function representing your objective function

Â across this constraint set. So, imagine an airplan, airplane flying

Â above this constraint set, okay? And we want to identify the minimal value

Â of this objective function over this kind of constraint set.

Â It turns out, what's really important is that you want your objective function to

Â be convex, like a convex quadratic function, such as x square over this space

Â of constraint set that is also convex like an oval or circle.

Â It's convexity that determines the watershed between easy and hard

Â optimization problems. So, what is convex optimization?

Â It refers to minimizing a convex objective function.

Â So, we've got to define what is a convex function, subject to a convex constraint

Â set. And, to be computationally useful, this

Â must be a convex constraint set that can be easily represented by some upper bound

Â inequalities on convex functions. So, we have to define what's a convex set

Â and what's a convex function. We will encounter other forms of convex

Â optimization in network economics, in internet pricing, in congestion control,

Â in routing, and so many other topics. This is our first brief, quick

Â introduction to this subject. And, we'll see in this and future lectures

Â that convex optimization is easy in both theory and in practice.

Â But, we just want to take three more minutes to introduce the basic notation

Â before we bring them back, just in time for the future applications in later

Â lectures. So, what's the convex set?

Â What's the convex function? We can actually go into a whole course on

Â nothing but convex analysis, and there are such courses in Math Department.

Â But for today's purpose, we just need to specify with some pictures.

Â A convex set is basically a set where you can pick any two element in the set, and

Â draw a line segment in between. And entire line segment also lies in the

Â convex, in the set. Then, we call this a convex set.

Â Say, well, this sounds like all sets of convex.

Â Well, no. Some sets, for example, like this set, you

Â can pick two points in the set but part of their line segment, this part, does not

Â lie inside the set. And, this is a kidney-shaped thing that is

Â a non-convex set. So, generally speaking, we say, a set is

Â convex if we pick any two points, a and b, that in this set, okay?

Â C, pick c. Then, the entire line segment between A

Â and B, which can be represented as theta a plus one minus theta times b.

Â For some theta between zero and one, for example, when theta is half, then it is

Â the midpoint between a and b also lies in c.

Â For all such theta, that's the key word, not for some theta but for all theta.

Â This is the representation mathematically of the line segment between point a and b.

Â And, by the way, say a, b are vectors in general, in, in dimensional space So,

Â pictorially, it just says that, a convex set is a set where the whole line segment

Â lies inside. It turns out the most important geometric

Â property of convex set is that any two convex set that don't touch each other can

Â be separated by a straight line. So, that one set lies on the one side and

Â the other lies on the other side. Let's say, will this be true for

Â non-convex set too? Not really.

Â For example, one set is convex, the other set is non-convex.

Â They don't touch each other and yet, you cannot separate them with a straight line.

Â Of course, you can separate them one way or another cuz they don't touch.

Â But, you can't separate them with a straight line.

Â It turns out that this is the most useful property of convex set, but we don't have

Â time to go through the implication of that in this lecture.

Â Now, having specified what's a convex set, we define what's a convex function.

Â A convex function intuitively is a function that curves upward, okay?

Â 24:34

Now, we know that if you look at a function f, its first derivative tells is

Â monotonous is increasing, is it decreasing function?

Â And the second derivative tells us whether it curves upward or downward.

Â Let's say, the function is a single variable function, f of, of scalar

Â variable x. Let's say, quadratic as well, we're

Â talking about quadratic today, is x squared, okay?

Â You can add a few other, a few other terms, say, 2x min, plus three, okay?

Â This, if you plot it, curves upwards. Or, another function, minus x square plus

Â 2x plus three. If you plot it, it curves downward.

Â Both of them, quadratic function, one is a convex, the other is not.

Â In fact, the other, if you flip it minus f of x for this function is convex.

Â We call it function whose minus of that function is convex, a concave function.

Â So, this is convex function, this is a concave function.

Â And, of course, it doesn't have to be quadratic, and it doesn't have to

Â univariate. It could be a multivariate function.

Â But, the idea is that if it curves upward, it's convex, downward it's concave.

Â Some functions have convex and then concave like this sigmoidal function we'll

Â encounter in social network influence models later.

Â It starts out convex, at some point, so-called inflection point.

Â It starts to bend downward. So, this part, the second derivative is

Â positive. In this part, the second derivative is

Â negative. This curves up, this curves down, this

Â part is convex, this part is concave. The whole function, therefore, is neither

Â convex nor concave. Now, what kind of function is both convex

Â and concave? You say, wow.

Â That means you need the first, the second derivative to be both non-negative and

Â non-positive. The only possibility is that it is exactly

Â zero, that means it has no curvature. That's the only way to make it both

Â positive and negative. So, no curvature means it is a straight

Â line. It could be pointing up or pointing down

Â depending on the first derivative, but the second derivative is zero.

Â In other words, linear or affine function is the only function that's both convex

Â and concave. Now, these pictures are drawn on 2D,

Â therefore, they're talking about a single variable.

Â But, it's a multivariable function, now we have to talk about second derivative

Â slightly more evolved. Recall your multivariable calculus course.

Â Assume that you still remember, or can quickly brush up your memory about that.

Â If a function, let's say, of just say two variables, once you get hang of that, it's

Â the same as many variables is quadratic. X squared plus xy plus y squared, okay?

Â Xy are single variable but is a function of both variables.

Â Now, what is a second derivative of a, of a multivariable function is not a simple

Â expression anymore. It's not a number, it's a, it's a matrix.

Â 28:00

With partial derivative of this function with respect to x squared and with respect

Â to y, twice, okay? And, with respect to both x and y, with

Â respect to both y and x. This is the matrix.

Â You have to take partial derivative with respect to x and y.

Â And in general, we say, given a function f of variables x1, x2, dot, dot, dot of xn.

Â If we look at a partial derivative, will direct to xi, and then respect to xj, this

Â is the ij-th entry of a matrix. That matrix is called the Hessian Matrix,

Â or in short, the Hessian of the function. So, in this particular case, it is very

Â simple. We take directive twice with respect to x,

Â we get two. Back to y, we get two.

Â With respect to x and then y, or y and then x the same thing for us is one, okay?

Â So, this two, one, one, two is the Hessian of this particular function, and this is

Â the general definition tight here. The ij-th entry of the matrix is partial

Â f, partial xi, and then partial f, partial xj.

Â 29:26

So, how do we make sense out of saying that a matrix is positive?

Â It's not that the entries of the matrix are positive, but its about whether this

Â matrix eigenvalues are all positive or non-negative.

Â And, you can check that this matrix eigenvalues are all, not negative.

Â And therefore, this function is a convex function.

Â But in general, we say, a function is convex if it satisfy the secondary test

Â where the Hessian matrix that's defined is so called positive semi-definite.

Â And, a matrix being positive semi-definite means all the eigenvalues are positive or

Â non-negative. Now, you may wonder what if my function

Â does not have a second derivative? If Hessian matrix doesn't exist, some

Â function are now smooth enough that you can take derivative twice, or even once.

Â Well, it turns out that, that's fine too. I can have a function with a, like sharp

Â corner, okay? At this point, we don't have a second

Â derivative. But, we can still define a function as

Â convex or not and this will be a convex function.

Â However, since all our functions throughout entire course are smooth enough

Â that the Hessian matrix exist, we will just follow this definition, okay?

Â