0:08

In the past few weeks, you've been using cross-validation to estimate training

Â error, and you have validated the selected model on a test data set.

Â Validation and cross-validation are critical in the machine learning process.

Â So it is important to spend a little more time on these concepts.

Â As we noted in the Buy Experience Tradeoff video,

Â estimation of the best statistical model and the training set,

Â capitalizes on random sample-specific patterns and associations among variables.

Â One of the challenges in machine learning is figuring out which model is best.

Â So how do we know which model is the best model?

Â 0:46

We need to be able to estimate the test error which is the estimate of the error

Â in a model when it is tested on different observations.

Â We can then select the model that has the smallest test error.

Â There are different ways of estimating the test error.

Â One way to do this is to randomly split the data into training and test or

Â validation data sets.

Â The model is developed on the training data set,

Â then applied to the test data set to predict the values of the target for

Â the response variable for the observations in the test data set.

Â This is called the validation set approach,

Â while easy to implement the validation set approach has a couple of drawbacks.

Â First, the test error estimate can be highly variable depending on

Â which observations are included in the training and test data set.

Â Because the model is estimated on only a single data set.

Â And then validated on only a single data set again.

Â Second, because we end up splitting the observations into two data sets,

Â the model is developed on only a subset of the data.

Â Statistical methods generally perform worse when there are fewer observations,

Â leading to greater estimation error in the training data set.

Â Which leads to poor performance of the statistical model in the test data set.

Â To address these drawbacks, we can use Cross Validation.

Â The goal of Cross Validation is to define a data set to test the model

Â during the training phase.

Â It involves partitioning the training data set into subsets,

Â where one subset is held out to test the performance of the model.

Â This data set is called the validation data set.

Â 2:15

There are different cross value validation methods.

Â To Leave One Out Cross Validation Method,

Â holds out one observation from the training set for validation.

Â The statistical model is fit on the rest of the observations, and the response for

Â the single observation is predicted based on the values of the predictors.

Â And the regression coefficients from the model estimated on the n-1 observations.

Â Then the process is repeated by holding out a different observation for

Â validation and training the data on the other observations.

Â Because the test error is based on only a single observation,

Â it is highly variable and is therefore a poor estimate of the true test error.

Â But, if we repeat this process for

Â every observation each time holding out a different validation observation and

Â using the rest of the observations to train the model, then we will end up

Â with as many test error estimates as observations in the full data set.

Â These individual test error estimates can be averaged

Â to get an overall test error estimate.

Â The advantage to using Leave One Out Cross Validation is that the regression

Â coefficients will have less bias because the model travel is fit in all but

Â one observation in the data set.

Â 3:29

In addition, unlike the single validation set approach, the parameter estimates will

Â not vary as a result of how the data is split in to training and test data sets.

Â Because the test error is estimated multiple times and then averaged.

Â The disadvantage is that because the model is fit n times where n is equal to

Â the observations n the data sets, Leave One Out Cross Validation approach

Â can be time consuming and computationally intensive.

Â Especially in large data sets.

Â K-fold Cross Validation is a kind of compromise between a validation set and

Â leave one out cross validation approaches.

Â 4:21

Then, the error for each of the fold is average,

Â where the model with the smallest amount of error is selected.

Â One major advantage to K-fold Cross Validation over Leave One Out Validation

Â is that it requires considerably less computational resources.

Â Because rather than fitting the statistical model as many times as

Â the number of observations in your data set, you fit it,

Â a substantially smaller number of times, typically less than 20 times.

Â Some statistical learning methods have computationally intensive fitting

Â procedures and data sets can have an extremely large number of observations.

Â This makes leave one out cross validation less feasible.

Â So, K-fold Cross Validation is a nice compromise between single data set

Â validation and leave one out cross validation.

Â In addition, K-fold Cross Validation often provides more

Â accurate estimates of the test error rate, than does leave one out cross validation.

Â Again, this has to do with the bias variance trade-off.

Â We know that the validation set approach can overestimate the test error rate,

Â because the training set will have only a proportion

Â of the number of observations in the full data set.

Â In the leave one out cross validation approach,

Â the training data set will have only one less observation than the full data set so

Â it can provide essentially unbiased estimates of the test error rate.

Â The leave one out cross validation approach is actually superior for

Â providing less biased estimates of the test error rate.

Â In the leave one out cross validation approach,

Â the training dataset will have only one less observation than the full dataset.

Â So it can provide essentially unbiased estimates of the error rate.

Â The leave one out cross validation approach is actually superior for

Â providing less biased estimates of the test error rate.

Â But bias is not the only thing we're concerned about.

Â We're also concerned about variance.

Â When it comes to having less variance in the test error rate,

Â the K-fold approach to the leave one out, cross validation approach.

Â Leave one, out cross validation has higher variance

Â than does K-fold Cross Validation.

Â This is because in leave one out cross validation,

Â the n-1 training data sets contain pretty much the same observations each time.

Â As a result, the estimates calculated in each cross-validation sample

Â will be highly correlated with each other, and

Â the mean of these highly correlated estimates will have greater variance.

Â With K-fold validation there's considerably less overlap

Â in the cross-validation samples,

Â which means less correlation between the cross-validation estimates.

Â And consequently, less variance.

Â For many statistical methods, cross validation is

Â easily conducted with procedures for functions that will do it automatically.

Â We just need to specify the type of cross validation.

Â 7:08

If we decide to go with the k-fold cross validation approach,

Â then we have to specify the number of folds.

Â The number of folds can vary but

Â you will typically see k-fold cross validation with k=5 or k=10.

Â There is a bias variant trade off associated with the choice of how many

Â folds to specify in k-fold cross validation.

Â Using k=5, or k=10,

Â has been found to estimate test error rate with low bias and variants.

Â That is why these values of k, are often used.

Â