The Decision Tree Algorithm: Fighting Over-Fitting Issue - Part(2): Reduced Error Pruning

What is this post about?

In the previous post, we learned all about what over-fitting is, and why over-fitting to our training data could easily hinder the generalization power of the final trained model to the future unseen test data. In particular, we said that there are 2 cases where over-fitting can harm us:

If the ground truth labels are noisy
If the training data is a poor representative of the actual generative distribution of the data

In this post we will focus on probably the most popular technique to fight the over-fitting problem. Ready when you are 😉

Training Validation Approach

I wouldn’t be exaggerating if I told you that this is possibly the most common approach to fight the over-fitting problem in decision trees. There are 2 variants of this approach and both of them are based on the idea of pruning a fully grown tree! Remember that if we let the tree grow indefinitely, it will for sure grow and grow and grow until the training error is reduced to 0 and here is the real danger of over-fitting and learning a training set that could have certain circumstantial regularities that have happened by chance, perfectly! These regularities might not be a theme and just be a random presence in a part of the training data! We do not want the tree to keep growing until all these randomness and noise have been learned! This will simply destroy the generalization power of the model!

The idea behind the training validation approach is to divide the original training data into 2 groups:

Training Data
Validation Data

We let the tree learn this newly built training data perfectly and grow for as long as it needs to! Then we use the validation set do decide which parts of the tree need to be pruned! Why is this a nice idea?

Even though our tree will learn the random errors, and coincidental regularities present in the training data, the validation set is quite unlikely to exhibit the same regularities and fluctuations!

So, we can prune certain nodes of the tree and measure the performance of the tree on this unseen validation set, which can tell us whether the decision to prune a certain node (and the sub-tree that this node is a root for) was a good idea!

How to choose this validation set?

We need to make sure that the validation set is big enough itself! As it itself, has to be a statistically significant sample of the entire instance space (The instance space is the space of all the data and our original training data is indeed a sample from this space! Obviously, there is no way for anybody to have ALL the data in the instance space! We hope that our training data is a big enough of a sample and a good representative of this infinitely large instance space).
In general, the original training data is divided into 3 equal parts! One-third of it is used for validation and fighting the over-fitting issue, and the remaining two-thirds is used for the actual training.

Let’s start learning about the first practical method of the training validation approach.

Reduced Error Pruning

So, clearly the idea is to prune a fully-grown tree! So, each node is a potential candidate for pruning!

When choosing a particular node for pruning, we prune the sub-tree rooted at that particular node! This would turn that node into a leaf node as it has no children!

Say we are doing a classification task, and as we have seen in a classification task, each leaf node of a decision tree is a pure node with a particular class label dominating the whole sub-set of training examples at that leaf node. However, if through pruning we turned a non-leaf node into a leaf node, then what label should be assigned to it?

We will assign that node, the most dominating class label of the training examples that are associated with it, before the pruning happens!

Now, the question is if every node is a potential pruning candidate, how do we decide which one to choose for pruning? This is where our validation set comes to play:

Measure the performance of the tree on the validation set before pruning node X (Let’s call it pre-prune performance)
Prune the tree at node X, and assign the most dominating class label of the training examples at node X to it
Measure the performance of the pruned tree on the validation set (Let’s call it post-prune performance)
Considering node X for pruning, if:
post-prune performance ≥ pre-prune performance

In plain English, if the performance of the pruned tree on the validation set is no worse than the performance of the tree before pruning on the same validation set, we keep will consider node X for pruning but will not prune it yet! Why not just prune the bloody tree at node X already? That is a good question!

Let’s say there are 3 nodes whose pruning result in a tree whose performance on the validation set is no worse that the tree before the pruning, and let’s call them X1, X2, and X3. So, which one should we choose? I am sure you know the answer: The node that results in highest increase in the performance of the tree on the validation set, of course! So, the intuition for using the validation set is that by pruning, we get rid of the leaf nodes in the original tree that were created due to the aforementioned coincidental regularities in the training set, as they are unlikely to appear in the validation set as well!

For how long do we have to prune the tree, you ask? We will keep slicing up the tree until further pruning at any node would be harmful (i.e., if it would decrease the accuracy of the tree over the validation set).

Final Thoughts on Reduced Error Pruning

So, we have seen that the role of the validation set is to guide the pruning process.

This approach is only useful if we have a large set of data

And if at any point I sounded like this is a flawless and brilliant approach, I am sorry because:

The main drawback of this approach is that if the data is limited, withholding a part of it for the validation set, reduces the number of examples (i.e., training examples) for the actual training

This means that the tree would simply not be trained well as the training data is quite limited!

What is Next?

In the next post we will talk about another pruning method that can fight over-fitting in decision trees, which can be used even if the data is limited. It is called the Rule Post-Pruning Approach.

Until then,

On behalf of MLDawn,

Take Care 😉