In this lesson, we will talk about another large class of
regression and classification models that is not called neural networks.
In the previous lesson,
we talked about supporting the machines that are widely used for
classification and regression in many industrial and financial applications.
This time, I would like to talk about a tree base methods.
First, what is the main idea of these methods?
Let me explain this using this specific example.
Here, you see results obtained by training a tree called CART,
which stands for classification and regression tree.
As most of the other methods presented in this lesson, this approach,
which was pioneered proposed by Leo Braitman in his papers with quarters from 1984,
the graph here shows the results obtained for banking data by training
a car classification tree on our banking dataset using psychic law.
Let's see how a tree is constructed here.
The tree starts with choosing one feature, in this case,
log of total assets,
and picks a split threshold for this feature.
This happens in the root node of the tree.
After that, all data points are split into two groups
depending on whether the test of the first node is true or not.
This produces two children nodes.
Then the whole procedure repeats for each of the children nodes.
Separately, for each of them,
we'll choose a feature and the split value for this feature.
This creates a pair of children for each of the child of the root node and so on.
This continues until we either reach a peace out
limit of the depths of the tree or each rooting of accuracy in
classification or stay with the nodes that can no longer be partitioned any further.
I will go momentarily over how a feature and the threshold are chosen for each node,
but for now, let's just know it is that important to the CART tree.
There are three most important features that
determine whether a given bank will fail or not.
We saw the logarithm of total assets,
the ratio of non-performing loans to total loans,
and the ratio of net income to total assets.
Now, let's see what the other numbers shown in these boxes on the graph mean.
First, for each node,
after selecting the feature and the threshold,
the algorithm reports the number called impurity.
By default, psychic loan uses the Gini impurity.
The Gini impurity for mount I is equal to 1 minus the sum of
squared ratios of number of instances of different classes going through this node.
These ratios are denoted as Pi sub IK,
where K is the class and the sum runs over all classes.
For binary classification, there are only two classes.
It seemed that the node I contains only instances
of class 1 but not instances of class 2.
Then P sub I 1 will be 1 and P sub I 2 will be 0.
So that Gini impurity for this node will be 0.
In other words, this node will be [inaudible].
Next samples in the boxes mean
the percentage value of all samples that go through these nodes.
And finally, two numbers for value show
how many instances of which class this node applies to.
Now, let's talk about how CART chooses
what feature to pick and what threshold for this feature to use.
This is decided using the very simple course function shown in this equation.
In this equation, K is the index of features and T sub K is
the corresponding threshold for feature
K. M is the total number of instances going through the node.
M sub left and M sub right will be numbers of
instances going to the left and right children node, respectively.
And finally G sub left and G sub right stand
for Gini impurity for the left and right subsets.
At each step of constructing a tree,
the CART algorithm chooses K and T sub K that minimize this close function.
Instead of Gini impurities,
G sub left and G sub right could use other metrics.
For example, we could use entropy of the note which is called H sub
I and defined as sum of P sub IK times
log of P sub IK sum to over all classes
K. If a node is [inaudible] that is it has only instances of one class,
then the entropy will be 0.
And this is similar to Gini impurity.
So use an entropy instead of Gini impurity usually
produces similar results though sometimes,
classes obtained with entropy are more balanced.
The algorithm shown in this formula is a grid algorithm,
which means it only does what is optimal at this particular step.
It does not try to account for
possible input of the current petitioning or future petitioning.
Now, an important question with trees is how to constrain complex equality.
If we let it grow unbounded,
a tree might grow very complex producing
excellent fit in sample and create a bad fit out of sample,
in other words, doing overfitting.
There are a few parameters or rather
hyperparameters with trees that you can tune to prevent overfitting.
One of them, is the maximum depths of the tree.
You can set it to some reasonably small number to keep your tree interpretable.
The other parameter is the minimum number of
instances passing through a node to let it split.
Another hyperparameter is an impurity split threshold
such that a node will split if its impurity is above this threshold.
Otherwise, it's a leaf node.
There are some other hyperparameters such as minimum number of samples in a leaf.
All of them can be tuned using methods that we outlined in our first course namely,
using either a separate validation set or relying on cross validation.
So as a short summary,
what can we say about trees in terms of their pros and cons?
Among pros of trees,
the first place is definitely taken by their simplicity and interpretability.
Once you build a tree,
either for classification or regression,
you can always visualize and try to interpret it.
Trees require very little data preprocessing.
In particular, there is no need to rescale data or to fill in missing nodes.
With trees, you can often treat missing discrete variables as another possible value.
The results obtained with trees turn out to be in
variant with respect to monotone transformation of
inputs and this is because the algorithm depends only on
ranking of different feature values rather than on their absolute values.
Trees also perform arithmetic feature selection.
In particular, the feature peaked at the root note of the tree is the most important one.
Finally, trees can be scaled to really large data sets.
All these are pros of the trees.
Now, what about the cons?
Are there any cons for the trees?
The answer is yes, there are.
And as usual, there are flip sides of their pros.
In particular, trees often have lower accuracy than other models such
as logistic regression or neural nets or support vector machines.
These can be traced back to the fact that trees use
grid algorithm such as the CART algorithm.
This might produce a sub-optimal solution but
better optimal solutions are harder to find because the original problem is NP hard.
The other cone has the same origin.
They're grid algorithm of building the tree.
The problem is that under moderate variations in input data,
the grid tree algorithm may produce a trees that will look very different.
This means that trees are estimators with high variants and hence,
with a potentially large generalization error.
We will see in the next video how this latter problem can be partially overcome.