0:00

This lecture is about predicting with trees.

Â The basic idea is that if you have a

Â bunch of variables that you want to use to predict

Â an outcome, you can take each of those variables,

Â and use it to split the outcome into different groups.

Â And, so as you split the outcomes into different groups, then

Â you can evaluate the homogeneity of the outcome within each group.

Â And continue to split again if necessary, and

Â then, until you get outcomes that are separated

Â into groups that are homogeneous enough, or that

Â they are small enough that you need to stop.

Â The pros of this approach are that it's easy

Â to interpret right, and you tend to get better

Â performance in non-linear settings than you do with the

Â linear regression models we talked about in the previous lectures.

Â 0:38

The cons are that without precluding or some

Â kind of cross-validation, this can lead to overfitting.

Â And they can be harder to estimate uncertainty than

Â it can be for the linear regression model setting.

Â In general the results may be variable depending on exact

Â values of the parameters, or the variables that you collected.

Â 0:56

So here's an example of what a decision

Â tree looks like after it has already been built.

Â So this is a decision tree that appeared in the New York Times during the 2008

Â elections, when Bar, Barack Obama was running against

Â Hillary Clinton for the Democratic nomination for President.

Â And they were trying to decide what would be a prediction rule

Â for whether a county would vote for Obama or for Hillary Clinton.

Â 1:17

And, so the way that happened was that they would, build a prediction model that

Â would ask questions of each of the different

Â variables that they had in their data set.

Â So the best split happened to be this variable.

Â So if the county was more than 20% African American, then

Â the county was much more likely to vote for, Barack Obama.

Â If it was less than 20% African American, then it

Â became more likely that the county would vote for Hilary Clinton.

Â Then within each of these two subgroups, they went and

Â looked for other variables that might split those subgroups out.

Â So, on this branch here, the next question was,

Â is the high school graduation rate greater than 78%?

Â If it wasn't, then the county was more likely to vote for Hillary

Â Clinton, and if it was, then it's more likely to vote for Barack Obama.

Â And the algorithm continues in that manner until you've split out

Â into smallest, the smallest subgroups that you are willing to consider.

Â And, so you see that within each of these

Â leaves of the tree, the predictions were quite homogeneous.

Â In other words.

Â Obama was 383 out of about 450 counties in this case.

Â And Hilary Clinton won 704 out of about 790 or 800 counties in this ca, case.

Â In other words, these questions split the

Â counties into groups that were homogenous within each

Â leaf their prediction was likely to come true

Â that, that candidate would win in the election.

Â 2:44

So the basic algorithm behind building one of these trees

Â is, start with all the variables of one big group.

Â And then, you find the first variable that

Â best splits the outcomes into two different homogenous groups.

Â 3:03

Within each split, we then search through all the variables, again.

Â Including the variable we just split on.

Â And try to find within that group if there's another variable

Â or split that separates the outcome, into even more homogeneous groups.

Â You continue until the groups are too small, or they're sufficiently pure.

Â In other words, sufficiently homogeneous.

Â To, stop the algorithm.

Â So there are different measures of impurity.

Â And they're all based on basically this probability that you can estimate.

Â So, within a particular group.

Â So, in the le, m, m leaf then you have n total objects that you might consider,

Â and you can count the number of times that a particular class appears in that leaf.

Â So this is the number of times that class k appears in leaf m.

Â So that's this probability.

Â So it's p hat for the mth leaf and the kth class.

Â The misclassification error is 1 minus the probability that you're

Â equal to the most common class in that particular leaf.

Â So for example, if you're if it's a leaf where.

Â Almost all of the counties would vote for

Â Barack Obama, then 1 minus the misclassification error is

Â 1 minus the er the misclassification error is 1

Â minus the probability that you'd vote for Barack Obama.

Â 4:18

So zero means perfect purity.

Â In other words, there's no misclassification error and

Â all of the counties would go for Barack Obama.

Â 0.5 means no purity because, it, it's not one,

Â because in any particular leaf if the, counties was overwhelming,

Â the counties where overwhelming likely to vote for Hilary Clinton

Â then you would get perfect purity in the other direction.

Â So it's actually when the leaf is perfectly balanced between the two

Â different outcomes, that's when you don't

Â have homogeneity within that particular leaf.

Â 4:51

Similarly there's something called the Gini index which is

Â not to be confused with the Gini coefficient in economics,.

Â And it's basically 1 minus the sum of the squared

Â probabilities that you belong to any of the different classes.

Â So again zero means perfect purity.

Â In other words the class has one particular class has probability equal

Â to 1 and all the other classes have probability equal to 0.

Â 0.5 means no purity.

Â In other words, all of the classes are perfectly balanced within each leaf.

Â Deviance and information gain is another measure that can be used, and so, it's

Â called deviance if you use log with base e here, and log base 2 is information gain.

Â And it's basically minus the sum of the probability.

Â Of being assigned to class k and leaf m times

Â the log base 2 or base e, of that same probability.

Â 5:48

Value of zero means there is perfect purity within

Â the leaf, and value of one means there's no purity.

Â If you go to this link from Wikipedia, you'll have a lot more information about

Â how these measures of impurity are used to separate the, the values within each leaf.

Â 6:03

So here's what they look like for a couple, for an s, for a specific example.

Â So suppose we had a variable that was trying to

Â split the dots into the blue and the red dots.

Â If the variable splitting in the following way were 15 of the dots were blue

Â in a particular leaf and only one was red, that would be a relatively pure situation.

Â And so, misclassification rate would be one out

Â of 16, so that's just this one dot here.

Â And so it's relatively low value the GinI coefficient would also

Â be low, because it's one minus 1 divided by 16 squared

Â plus 15 divided 16 squared which is a relatively small number

Â and then the information gained is similarly would be, in this case.

Â &Uh, smaller towards 0, because it's 1 over 16 times log base 2,

Â 1 over 16 plus 15 divided by 16 times log base 2 15 over 16.

Â So that's the case where there is relatively pure outcome in the two groups.

Â We can also look at the case where it's not.

Â So suppose this variable split the outcomes into a leaf where half

Â of the values were blue, and half of the values were red.

Â So that wouldn't be a very good split, because it's basically a

Â coin flip whether you're from one class to the other in that leaf.

Â And so then this classification right here is .5, it's very large.

Â The Gini index is also maximized, it's .5,

Â and the information is also large, it's one.

Â So, this would be a split that wouldn't be made, because

Â it wouldn't separate the two groups very well on that particular leaf.

Â 7:40

So, this, I'm loading the data with the command data iris.

Â And then I'm loading the ggplot2 library for making plots.

Â So the names of this data set are the

Â different variables that we're going to be using to predict with.

Â Here it's sepal length, Sepal.Width, Petal.Length and Petal.Width

Â and what we're trying to predict is the Species.

Â So there's 50 of these three different species

Â that we're trying to predict with those variables.

Â 8:15

And the first thing that I do is, I'm

Â going to plot the petal width versus the sepal width.

Â So that's petal width on the x axis, the sepal width on the y axis, and then I'm

Â going to color it by the different species, and you

Â can see there are three very distinct clusters here.

Â 8:30

So, it's a relatively easy classification problem.

Â It might be a little bit challenging for a linear model,

Â but not necessarily challenging for this

Â more advanced model with classification trace.

Â 8:43

So you can fit the model using the train function in caret.

Â So again, I'm training the model.

Â The outcome is species.

Â I'm predicting with all the remaining variables, which is

Â why I have this tilde and then a dot.

Â And I'm telling it the method is rpart which is [UNKNOWN]

Â package for doing regression and classification trees, one of the packages.

Â And I'm telling you to use the training data.

Â If I look at the final model, it tells me what all the nodes are and how they're

Â split and in order to, and what the the

Â probability of being in each class is for each split.

Â 9:18

So, here for example, there's a split that says petal.length is less

Â than 2.45, and if that happens then you in that case all of the

Â examples that have pedal length less than 2.45 belong to this bc setosa.

Â So, you can read those model splits to

Â tell you, what the classification tree is doing.

Â You can also make a plot of the classification tree.

Â And so if you just, plot the final model that's produce what

Â it will do is it will produce what's called a dandergram, like this.

Â And so it's a little hard to read, it's cut

Â off here, but you can see petal length less than 2.45.

Â Then you're assigned a setosa.

Â And so you can follow to the left what happens if the petal length is

Â less than 2.45, and to the right what

Â happens if petal length isn't less than 2.45.

Â And then follow the next split here at this node

Â down to the, see the total classification for any particular example.

Â A prettier version of that plot can be made with the rattle package.

Â So, you use the function fancyRpartPlot, and you pass

Â it the final model that was fit using caret.

Â And it makes again dentigram, but now it's a

Â little bit easier to see, so here we see.

Â That if the petal length is less than 2.5, we move over here to the

Â left, and if it's greater than 2.5 then we go over here to the right.

Â And then, within that split, so once you've already made that

Â decision about those samples, and if the petal length is less

Â that 4.8 you go down here to the left, and if

Â the petal length is greater you go over here to the right.

Â And so this makes it a little bit easier to see what's going on.

Â 10:54

You can predict new values using the predict function,

Â just like you can with the other linear regression models.

Â Here the prediction is going to be a particular class label,

Â because the classification tree was built to predict a particular class.

Â And, so when I pass the new data, testing data, to the model.

Â It'll actually write out the different categories of species,

Â because it's actually predicting the class for each variable.

Â 11:19

So notes and further resources, classification trees

Â are non-linear models, so they immediately use

Â interactions between variables, this is important to

Â keep in mind, because it's always a

Â model that's built on the relationship between

Â multiple variables and if you have a

Â large number of classes for a particular

Â variable for example that you're predicting with.

Â The models can over-fit a little bit.

Â The data transformations may be a little bit less important.

Â If you do any monotone transformation of

Â a continuous variable, in other words, any transformation

Â that doesn't change the order of the

Â values, but maybe makes them bigger or smaller.

Â Then you will get the same data

Â splits, with the classification or regression trees.

Â This can make the transformations a little bit less important.

Â They can also be used for regression problems.

Â So, I showed measures of misclassifcation impurity, you can also

Â use root mean squared error as a measure of impurity.

Â And do a similar classification tree building

Â procedure for building models for regression as well.

Â 12:21

There are multiple tree building options in RR, both in

Â the caret package, using the party package, the rpart package.

Â Or you can even use this package tree

Â which doesn't appear in the caret package, but

Â I also has a lot of use, useful

Â functions that can be used when building these models.

Â For more information you can read in, in these two books,

Â where there's a lot of good information about classification and regression trees.

Â Or, in this book about, that's more

Â specifically targeted to this particular prediction algorithm.

Â