0:00

This lecture is about covariate creation.

Â Covariates are sometimes called predictors and sometimes called features.

Â They're the variables that you will actually

Â include in your model that you're going

Â to be using to combine them to predict whatever outcome that you care about.

Â There are two levels of covariate creation, or feature creation.

Â The first level is, taking the raw data that you

Â have and turning it into a predictor that you can use.

Â So the raw data often takes the form of an image, or a text file, or a website.

Â That kind of information is very hard to build a predictive model around when you

Â haven't summarized the information in some useful

Â way into either a quantitative or qualitative variable.

Â 0:40

So what we want to do is take that raw

Â data and turn it into features or covariates which are variables

Â that describe the data as much as possible while giving

Â some compression and making it easier to fit standard machine-learning algorithms.

Â So the idea here is, so suppose you have a email,

Â this is an email, an example email here on the left.

Â And so it's very hard to plug the email itself into

Â a prediction function, because most prediction functions are based on the idea

Â of taking a small number of variables and building a quantitative model

Â around them and it doesn't work for a free text for example.

Â 1:13

So the first thing that you need to do is create some

Â features, and those features are just variables that describe the raw data.

Â So in this case, in the case of an email, we

Â might think of different ways that we could describe this email.

Â For example, when I calculate the average number of

Â capitals that are in the email, in this case 100%

Â of the letters in the email are capital letters,

Â you might say what's the frequency a particular word appears.

Â So for example, you might say, how often does you appear?

Â And you appears twice in this email, so we say that we calculate two for this email.

Â That's a feature.

Â You might also calculate the number of dollar signs.

Â This might be a really good predictor of whether an email is spam or not.

Â And so here you can see there are a large number of dollar signs,

Â there are eight of them, so we calculated another feature of that data set.

Â So this step, the raw data of the covariate, usually involves a

Â lot of thinking about the structure of the data that you have

Â and what is the right way to extract, extract the most useful

Â information in the fewest number of

Â variables that captures everything that you want.

Â The next stage is transforming tidy covariates.

Â In other words, we calculated this number, say capital

Â average, the average number of capitals in the data set.

Â But it might not be the average number that's

Â related very well to the outcome that we care about,

Â it might be the average number of capitals squared or

Â cubed, or it might be some other function of that.

Â And so the next stage is transforming

Â the variables into sort of more useful variables.

Â So for example, if we load the kernlab data

Â and the spam data set, we can take the

Â capital average, so the, this is basically this variable

Â right here, the fraction of letters that are capitals.

Â And we could square that number, and assign it to a new

Â variable, capital average squared, that might

Â be useful later in our prediction algorithm.

Â So those are the two steps in creating covariates.

Â So the first step the raw data,

Â the covariate really depends heavily on the application.

Â So like I showed you on the previous slide, in an email case, it

Â might be extracting the fraction of times a word appears or something like that.

Â In a case of voice, it might be knowing something about

Â the frequency or the timbre of which voices are typically fall.

Â In the case of images, it might be identifying features of the images.

Â So if it's faces, where are the noses or the ears or the eyes are?

Â And it will depend greatly what your application is.

Â And the balancing act here is definitely summarization versus information loss.

Â In other words, it, the, the best features

Â are features that capture only the relevant information in,

Â say, the image or the email, and throw out

Â all the information that's not really useful at all.

Â And so the idea is that you have think very carefully about how to

Â pick the right features that explain most of what's happening in your raw data.

Â 3:55

So some examples here, for text files, it might

Â be the frequency of words or frequency of phrases.

Â There's this cool site, Google ngrams, which tells you about the

Â frequency of different phrases that appear in books going back in time.

Â For images, it might be edges and corners, blobs and ridges for example.

Â These are all ideas about how do you identify different structures in an image.

Â For websites it might be the number and

Â type of images, where buttons are, colors and videos.

Â This is a huge area of importance in web development which is called

Â A/B testing, which is called randomized trials

Â and statistics, which is basically showing different

Â versions of a website with different

Â values of these different features and predicting

Â which one will introduce a more clicks or get more people to buy products.

Â For people you can imagine features of

Â people are their height, weight, hair color, etc.

Â It's basically any summary of the raw data that you can make as a potential feature.

Â And often this involves quite a bit of scientific thinking and business

Â acumen to know what the right covariates are for a particular problem.

Â So the more knowledge you have of a system,

Â the better job you'll do at feature extraction in general.

Â In general it's a good idea to have a really clear understanding of why

Â this set of data is useful for, to predicting the outcome you care about.

Â 5:12

So there's this balance between summarization and information loss, and in

Â general, it's better to err on the side of creating more features.

Â You lose less information and then filter some

Â of those features out during your model-building process.

Â This can all be automated and has

Â been automated in various different ways, but you

Â generally have to use a lot of

Â caution when using that approach because sometimes a

Â particular feature will be very useful in

Â the training set that you created but won't

Â be very useful in a new set of data and the test set won't generalize well.

Â 5:41

So the second level is taking tidy covariates, so these are features you've

Â already created on the data set, and then creating new covariates out of them.

Â Usually this is transformations or functions of the covariates,

Â that might be useful when building a prediction model.

Â This can sometimes be more necessary

Â for methods like, regression, or support vector

Â machines that might depend a little bit more on what the distribution of

Â the data are, and a little bit less for things like classification trees,

Â where the idea here is you

Â don't necessarily have as much model-based prediction.

Â In other words, you don't depend quite so much on the data looking a particular way.

Â 6:20

On the other hand, in general, it's a good idea to spend

Â some time making sure you have the right covariates in your model.

Â So you, when you create these functions or decide on these

Â functions, you have to do it only in the training set.

Â This is a common theme of machine learning.

Â Building features can only happen in the training

Â set, it can't happen in the test set.

Â Later when you apply your prediction to your function to the test set, you

Â will make that same function of the covariate so you can apply your predictor.

Â But the original creation or thinking about what covariates to build has

Â to happen only in the training set, otherwise you'll lead to overfitting.

Â And the best approach I've found is through exploratory analysis,

Â so basically making plots and making tables of the data, and

Â trying to understand what are the patterns of variation in

Â your data set, and how they might relate to the outcome.

Â When you're using the care package or

Â doing this analysis in r, the new covariates

Â need to be added to data frames so that they can be used in downstream prediction.

Â And it's important to make sure that the names of the new variables are

Â recognizable so that you can use the same name on your testing data set.

Â 7:45

So one idea is that's very common when building machine learning algorithms is to

Â turn covariates that are qualitative, or factor

Â variables, into what are called dummy variables.

Â So you probably learned a little bit about this in your

Â regression modeling class if you've taken

Â it through this data science specialization.

Â But the basic idea is suppose we have a variable, in this

Â case let's look in the training set at the variable called job class.

Â So that job class has two different

Â levels, it's either industrial, or it's information.

Â So one thing that we could try to do is try to plug that variable directly into

Â a prediction model, but the values of that

Â variable will be a actually a set of characters.

Â It'll either be industrial, or it'll be information.

Â And it's sometimes hard for prediction algorithms to use those

Â qualitative information variables, in order to actually do the prediction.

Â So one thing we might want to do is turn it into a quantitative variable, and the

Â way that you can do that with the

Â care package is with this dummy variables function.

Â So basically it says we're going to pass in a model so the outcome is wage.

Â Job class is going to be the predictor variable, and tr, training

Â set is the set where we're going to be building those dummy variables.

Â And then if you predict, if you use the

Â predict function, this dummy's object and a new data set,

Â in this case we're just going to apply it to

Â the training data set, you get, two new variables out.

Â So the first is an indicator that you are

Â industrial, and the second is an indicator that you're information.

Â If the indicator that you're industrial is one, it

Â means that for that person, they had an industrial job.

Â If it's zero it means for that person, they had not an industrial job.

Â So the same thing is true for information.

Â If it's zero that means they had not an information

Â job, and if it's one, they have an information job.

Â So, in this case, where's there only two different levels

Â of this variable, there's only industrial and information, then whenever

Â you're one for industrial, you're zero for information, and whenever

Â you're zero for industrial, you're one for information and so forth.

Â But if you had three variables here, it would

Â probably have, every column would have two zeros, because

Â those are the two classes you don't belong to,

Â and a one for the class that you belong to.

Â So this is taking these factor or

Â qualitative variables and turning em into quantitative variables.

Â 9:57

Another thing that happens is that some of

Â the variables are basically have no variability in them.

Â So it's often that you'll create a feature for example, if you create a

Â feature that says for emails, does it have any letters in it at all?

Â Almost every single email will have lots, have at least one

Â letter in it, so that variable will always be equal to true.

Â It's always got letters in it, so it has no

Â variability and it's probably not going to be a useful covariate.

Â So one thing that you can use is this near zero variable or function in carrot

Â to identity those variables that have very little

Â variability and will likely not be good predictors.

Â So you apply it to a dataframe that's the training data set.

Â And here I'm telling it to save the metrics so

Â that we can see how it's calculating what the variables are.

Â So, for example, here we can see that it tells us the percentage of unique

Â values for a particular variable, so in, in this case the variable has

Â about 0.33% unique values, and it's not, not near zero variable, near zero variance

Â variable, but for example, sex, the variable

Â sex, only is basically males and so

Â it has a very low frequency ratio.

Â In other words, it's basically all one category, and so, this ends up being

Â a near zero variable and so, it will be, you could use this column of the matrix

Â to throw out all those variables like sex and, in this case, like region that are

Â variables that don't really have any variability in

Â them and shouldn't be used in prediction algorithms.

Â So this is a nice way to throw

Â those sort of less meaningful predictors out right away.

Â 11:44

The other thing that you might do is, so instead of

Â fitting, if you do linear regression or generalized linear regression as

Â your prediction algorithm, which we'll talk about, in a future lecture,

Â the idea will be to fit, basically straight lines through the data.

Â Sometimes, you want to be able to fit curvy lines, and one way

Â to do that is with a basis functions, and so you can find

Â those, for example, in the splines package, and so one thing that you

Â can do is create this, the bs function will create a polynomial variable.

Â So in this case, we pass at a single variable, in this case, the training set,

Â we take the age variable, and we say

Â we want a third degree polynomial for this variable.

Â So when you do that, you essentially get, you'll get a three-column matrix out.

Â So this is now three new variables.

Â This variable corresponds to age, the actual age values.

Â There are scales for computational purposes.

Â 12:35

The second column will correspond to something like age squared.

Â So, in other words, you're allowing it to

Â fit a quadratic relationship between age and the outcome.

Â And the third column will correspond to age cubed, so

Â you allow a cubic relationship between age and the outcome.

Â So this'll, if you include these covariates

Â in the model instead of just the

Â age variable when you're fitting a linear

Â regression, you allow for curvy model fitting.

Â So just to show you an example of that, here I

Â fit a linear model, you'll remember that from your linear modeling class.

Â So, the wage is the outcome.

Â Again the tilde tells you what's we're predicting it with.

Â Here we pass it that BS basis, in other words, we

Â pass it all the predictors that we generated from the polynomial model.

Â So in this case, it's age, age squared and age cubed.

Â 13:20

And then we can plot the age data versus the wage data.

Â So that's age on the x axis, wage on the y axis.

Â And you can see that there's, kind

Â of, a curvilinear relationship between these two variables.

Â And so we can plot age and the predicted values

Â from our linear model, including the, the curvy terms, polynomial

Â terms and you see you get a curve fit through

Â the data set as opposed to just a straight line.

Â So that's one way that you can generate new variables is

Â by allowing more flexibility in the way that you model specific variables.

Â 13:53

So then on the test set, you'll have to predict those same variables.

Â So this is the idea that's incredibly critical

Â for machine learning when you create new covariates.

Â You have to create the covariates on the task data set

Â using the exact same procedure that you used on the training set.

Â So you can do that by saying I'm going to predict from this

Â variable that I created using the BS function, a new set of values.

Â This is the testing set age values.

Â So these are the values that I'm going to actually plug in to

Â my prediction model when I'm testing it out on the test set.

Â This is as opposed to creating a new set of, predictors based on

Â just applying the BS function directly to this age variable, which would be creating

Â a new set of variables on the test set that isn't related to

Â the variables that you created on the training set and may introduce some bias.

Â 14:43

So a little bit about this idea, these ideas

Â and some future reading for you, So level one

Â feature creation is basically all about science or, application

Â specific knowledge, I've found that the best way to do

Â it, find things for a specific application, that I

Â haven't talked about here, or that you don't know

Â about, or you're new to, is Googling feature extraction

Â for the type of data that you're trying to analyze.

Â Feature extraction for images.

Â Feature extraction for voice.

Â Things like that.

Â you, you can also just look up that particular data

Â type and see as much information as you can about it.

Â In particular you're looking for what are the salient

Â characteristics that are likely to be different between individual samples.

Â In general you want to err on the side of overcreation of features

Â because you can always filter them out

Â later in the machine learning algorithm process.

Â In some applications like images and voices,

Â it's often both possible and pretty much necessary

Â to create features that aren't necessarily just

Â things that you imagine out of your mind.

Â It's very hard to know exactly what the right components of

Â an image to include as features in a model, and so there

Â are things like you may have heard of deep learning which is

Â basically a way of creating features for things like images and voice.

Â And this is a nice tutorial I've linked to here, that

Â kind of explains how that feature creation process works for those things.

Â But in general, automatic feature creation requires an equal level of thinking to

Â make sure that the features being generated

Â by your feature creation process make sense.

Â 16:12

Level 2 feature creation covariates to new covariates can be

Â done a lot with the preProcess components of the caret package.

Â You can create new, new covariates using basically any of

Â the functions in r, if they make sense to you.

Â The key is, again, making lots of plots and doing exploratory analysis

Â to see where the connections between the predictors and the outcome are.

Â You can create new covariates if you think they will improve fit.

Â Again, you can kind of err on the side of overcreation of features,

Â but sometimes features just are, are, sort

Â of, nonsensical and you shouldn't create them.

Â 16:46

Be careful about overfitting in the sense that

Â if you create lots of features that are

Â particularly good for just your training set, they

Â may not work well in the test set.

Â And so a good idea is if you overcreate lots of features

Â to do some filtering before you actually apply your machine learning algorithm.

Â This tutorial on preprocessing with caret is very good.

Â It's a good place to start for really basic preprocessing.

Â And if you want a flit spline model like the

Â ones I talked about with flexible curves, you can use

Â the gam method in the caret package which allows smoothing

Â of multiple variables using a different smooth for every variables.

Â