We mentioned in the last video that a procedure

called regularization helps linear regression

and other Machine Learning Algorithms when

their generalization performance starts to deteriorate.

Now, let's talk about regularization,

another key concept of Machine Learning,

in a bit more details.

The idea of regularization becomes more clear if we recall that in Machine Learning,

we minimize training error,

MSEtrain for regression, even though our main interest is in MSEtest.

The main idea is to modify the objective function by adding to the function of

model parameters in the hope that

a new function will produce a model with smaller variance.

Such a function that I called J is shown here.

It has two terms.

The first one is the familiar MSE train loss.

Note that it's a function of both model parameters W and data X.

The second term is given by

a regularizer function omega that enters the sum with the weight lambda.

This parameter is usually referred to as a regularization parameter.

Note that the second term is only a function of model parameters,

W, but not of inputs X.

Now, the impact of this new term depends both on the form of omega of W,

as well as the value of the regularization parameter, lambda.

If lambda approaches zero,

we are back to the initial case without any regularization, whatsoever.

If lambda on the other hand is very large,

then the optimizer will ignore the first term all together

and will try to minimize just the second term.

Because the second term doesn't depend on the data at all,

the resulting solution would simply be fixed by a form of regularizer omega

of W once and for all and irrespective of the actual data.

For any intermediate values of lambda,

the resulting values of model parameters would be somewhere

in between values that would be obtained in these two limiting cases.

This case is what we need.

I will explain in a few moments how we choose an optimal value of lambda,

but first let me show you some popular examples

of regularizers that are commonly used in Machine Learning.

The first common regularizer is just the square of the equation norm of the vector

W. A regression problem with such regularizer is called a Ridge Regression.

What it does is trying to find the solution where weights would not be too large.

Another popular regularizer is a L1 norm of

vector W. This is called L1 or LASSO regularization,

and it turns out that such regularization enforces sparsity of the solution.

Yet another regularization that is sometimes used when weights should be

non-negative and sum up to one is to use the so-called entropy regularization,

which is shown here.

Such regularization has roots in Bayesian statistics.

We will talk about regularization methods in

much more depths in the next course of this specialization.

But for now, I want to discuss how to choose

the optimal value of lambda if you already

decided about the proper regularization term to use.

A regularization parameter will be our first example of the so-called hyperparameters.

We will see many more examples of hyperparameters in this specialization,

but what are those hyperparameters?

In general, hyperparameters are

any quantitative features of a Machine Learning Algorithm that are

not directly optimized by minimizing training or in-sample losses such MSE train.

The main function of hyperarameters is to control model capacity,

that is the ability of the model to adjust to progressively more complex data.

One example of hyperparameter for regression

is a degree of a polynomial used for the regression that is whether it's linear,

quadratic, cubic, and so on.

The regularization parameter lambda that we have

just introduced is another commonly used hyperparameter.

Other examples of hyperparameters include the number of levels in

a decision tree or the number of layers or nodes per layer in neural networks.

Some further examples would be given by

some parameters that determine how fast the model adjusts to a new data.

Such parameters are called learning rates and the appropriate choice is often

very important in practice.

What is common between

all such hyperparameters is that they are usually chosen using one of the two methods.

The first method is very straightforward.

We just split our training set into

a smaller training set and a set which is called validation set.

For example, you can set aside

about 20 percent of your training data for a validation set.

The idea is to use the new training set to tune parameters of

a model and then use the validation set to tune hyperparameters.

When both the parameters and hyperparameters are tuned,

the final performance of the model is evaluated with a test set,

the same way as we did before.

Such approach is straightforward and theoretically solid,

but it might not be ideal when your data set is small.

In such cases, setting aside some data for a validation set might be

undesirable as it can lead to worse accuracy in sample.

To cope with such situations,

a method called cross-validation is used instead.

Unlike the first method,

the cross-validation method does not discard any information during training.

This is how cross-validation works.

Assume that we have N samples available for training but N is small,

so setting aside some part of data is problematic.

So, what we do is the following.

First, we define some set of possible values of a hyperparameter.

For example, the regularization parameter lambda that we want to optimize.

Usually, this is defined as a small range of possible value,

so that we just want to select the best value from a set of candidate values.

Next, we partition our whole training data set into k blocks,

X1 to Xk of about equal size.

And then we initiate the Round-robin procedure of the following sort

which we repeat for all values of

the regularization parameter in our set of candidate values.

First, we take the first block out and train our model on the rest of the training data.

When it's done, we evaluate the model error for the block X1 using

the current value of the hyperparameter and the coordinates value.

So, so far, it looks exactly the same as for the methods based on the validations set,

but the difference comes in our next step.

After recording the out of sample error obtained will be taken the first block,

X1 out, we put it back to the training set and take our next block, X2 out.

Then we repeat all steps above,

that is we train the model used in blocks X1, X3,

and so on, and then run the the trained model on

block X2 and record the estimated out of sample error.

We then continue this Round-robin procedure and compute the average

out of sample error obtained when taking out all of the blocks sequentially.

We do this procedure separately for

all possible values of the hyperparameter that we wanted to

try and the best value of the hyperparameter will

be the one that provides the smallest average of the sample error.

What I described is called a K-fold cross-validation where K

stands for the number of blocks in our partitioning of the training data set.

There are a few special cases here that are worth mentioning.

First, if we take K equals 1,

then obviously in this case we will not have any holdout block whatsoever.

So, this case is not very interesting for our purposes.

Second, if we take K equals M,

we get the limiting case of cross-validation

that is called the leave-on-out cross-validation.

In this case, in each step of our Round-robin procedure,

we only have one data point that was not used for estimation of the model and,

therefore, can be used for in out of sample test.

Such leave-on-out cross-validation is rarely used in practice because it turns

out to be a way too time consuming once your data set becomes considerably large.

A much more popular choice is to use

a 10-fold or maybe five-fold cross-validation

where the number of blocks is respectively 10 or five.

Tuning hyperparameters is usually a very important part

of building Machine Learning Algorithms which may also

be quite time consuming part especially if you have

many hyperparameters or when you don't have a good guess on them.

So, it takes a wide range of possible values.

We will be doing such analysis in many parts

of our specialization including topics in supervised learning,

unsupervised learning, and reinforcement learning.

To summarize in this lesson,

we have covered many important concepts of Machine Learning,

such as the problem of overfitting bias-variance decomposition,

regularization, and hyperparameter tuning.

In the next lesson,

we will start seeing how these concepts apply to real world

financial problems that can be addressed using

methods of supervised learning. See you soon.