0:21

The one difference is that at each split, when we split the

Â data each time in a classification tree, we also bootstrap the variables.

Â In other words, only a subset of

Â the variables is considered at each potential split.

Â This makes for a diverse set of potential trees that can be built.

Â And so the idea is we grow a large number of trees.

Â And then we either vote or average those trees

Â in order to get the prediction for a new outcome.

Â The pros for this approach are that it's quite accurate.

Â And along with boosting, it's one of the most, widely

Â used and highly accurate methods for prediction in competitions like Kaggle.

Â The cons are that it's, it can be quite slow.

Â It has to build a large number of trees.

Â And it can be hard to interpret, in the sense that

Â you might have a large number of trees that are averaged together.

Â And those trees represent bootstrap samples with bootstrap nodes

Â that can be a little bit complicated to understand.

Â It can also lead to a little bit of overfitting

Â which can be complicated by the fact that it's very hard

Â to understand which trees are leading to that overfitting, and so

Â it's very important to use cross validation when building random forests.

Â 1:27

Here's an example of how this works in practice.

Â So the idea is that you build a large number

Â of trees where each tree is based on a bootstrap sample.

Â So, for example, this tree is built on a random subsample of your data and this is

Â a separate random subsample of the data and

Â this is a separate random subsample of the data.

Â And then at each node we allow a different

Â subset of the variables to potentially contribute to the splits.

Â 1:50

Then if we get a new observation, say

Â this V observation here, we run that observation through

Â tree one, and it ends up at this leaf down here at the bottom of that, tree.

Â And so it gets a particular prediction here.

Â Then the next, we take that same observation,

Â we run it through the next tree, and

Â it goes down a slightly different leaf, and

Â it gets a slightly different set of predictions here.

Â And finally we go down the third tree,

Â and we get an even different set of predictions.

Â Then what we do is we basically average those predictions together in order

Â to get the predictive probabilities of each class across all the different trees.

Â 2:50

I'm also telling it to fit the outcome to be Species

Â and to use any of the other predictive variables as potential predictors.

Â I'm setting prox equals true.

Â You'll see why I'm doing that in a minute

Â because it produces a little bit of extra information.

Â And then I could use when I'm building these model fits.

Â So here it tells me that I built the model and I've

Â done bootstrap re-sampling and then, I

Â tried a bunch of different tuning parameters.

Â And so the tuning parameter in particular is the number of

Â basically tries, or number of repeated trees that it's going to build.

Â 3:23

I can look at a specific tree in our final model fit using the get tree function.

Â So I applied get tree here to our final

Â model and I say I want the second tree out.

Â And this is what the tree looks like.

Â So each of these columns, or each of these rows corresponds to a particular split.

Â And so, you can see what the left daughter of the

Â tree is, the right daughter of the tree, which variable we're splitting

Â on, what's the value where that variable is split and then

Â what the prediction is going to be out of that particular split.

Â 3:53

You can use this centers information as well to see what

Â the predictions would be, or the center of the class predictions.

Â So what I've done here is, I'm looking at two particular variables.

Â The petal length and the petal width.

Â So I plotted petal width on the X axis and petal length on the Y axis.

Â I then get the class centers.

Â So these are going to be the centers for the predicted values.

Â So I'm going to send in the model fit, and I'm going to give

Â it this prox variable which we asked for in the previous fitting.

Â And when I'm going to tell it we're looking at the training data set.

Â And so that gives us the class centers.

Â Those class centers will then, we can then plot

Â those to see where they fall in the data.

Â So now, I've created the centers data set, as well as the species data set.

Â And what I'm going to do is plot petal width versus petal length.

Â And I'm going to color it by species in the training data.

Â That's what I did with this qplot command.

Â Then I'm going to add points on top of that.

Â That are the petal width and petal length, corresponding to the color being the

Â species, and now I am using it from the irsP which is the centers of the data set.

Â So what you can see is each dot here represents an observation.

Â And the x's show the color center, or

Â the observation centers for each of the different predictions.

Â So you can see that we predict the, each species has a prediction for these two

Â variables that's right in the center of the

Â cloud of points corresponding to that particular species.

Â You can then predict new values using the predict functions.

Â So you past to predict our model fit and the testing data set.

Â And here, I'm also setting a variable, testing predict right, which

Â is that we got the prediction right in the data set.

Â In other words, our prediction is equal to the testing data set species.

Â I can then make a table of our predictions versus

Â the species to see what that variable would look like.

Â So I can see for example I missed two values here with my random forest model.

Â But overall [INAUDIBLE] it was highly accurate in the prediction.

Â I can then look and see which of the two that I missed.

Â And perhaps unsurprisingly you can see the two that I missed,

Â marked in red here, are the two that, in-between, two separate classes.

Â So remember there was one class up in this right corner, and one class right here in

Â the middle, and this cloud, and the two points

Â that lie right on the border we were misclassified.

Â So you can kind of use this to explore, and see where

Â your prediction is doing well and where your prediction is doing poorly.

Â 6:19

Random forests are usually one of the top

Â performing algorithms along with boosting in any prediction contests.

Â They're often difficult to interpret because of these multiple trees that we're

Â fitting but they can be very accurate for a wide range of problems.

Â You can check out the rfcv function to make sure that cross validation

Â is being performed, but the train function in caret also handles that for you.

Â For more information you can read about

Â random forests directly from the inventor here.

Â The Wikipedia page for random forests is also quite good,

Â and the elements of statistical learning covers it as well.

Â