What is this post about, and what is next?
Hi there!!! Let’s cut to the chase, shall we?! You have this decision tree algorithm, coded beautifully through Scikitlearn (I applaud you!), or from scratch using numpy (I am proud of you!). You let the bloody thing train, and train and you see the training error (say the error in classification) keep going down on your training data. You are happy and proud (You can feel the lurking evil, can’t you?)
Then you want to test you decision tree on some new data and see how well could it generalize its knowledge to the unseen test data. To your surprise your decision tree performs quite poorly! You think you have done something wrong, so you again test your tree on the training data to see whether it can still perform beautifully on that data, and it still does!!! What the hell?
So, what the heck is going on? Where have you gone wrong?
My friends, you have fallen victim to a nasty phenomenon in machine learning called: Over-Fitting!
In this post and the subsequent few posts, you will learn about over-fitting and some common ways to address it, while training a decision tree algorithm. More specifically, you will learn the following in this post and a few posts after this:
- What over-fitting is!
- Common techniques to fight it in decision trees! Such as : Reduced Error Pruning and Rule Post-Pruning
Let’s start 🙂
What is Over-Fitting?
Let’s start with this quote:
A proper fit is a good-fit, not an over-fit, not even an under-fit! A good fit!
You do remember that the traditional decision tree , also known as the ID3 algorithm, keeps up the splitting until literally, and I mean literally, all the training examples are classified correctly! What a nice algorithm indeed (I am smirking)!
– So the training error will always go down to zero?
– Isn’t this great?
– This can lead to a disaster!
In general, learning a training set perfectly can lead to a disaster: A tree that cannot generalize its knowledge to the unseen data during testing. More specifically, learning a training set perfectly (over-fitting to it!) can harm us in 2 cases:
- If the training data has errors
- If the training data is too small to be a good representative of the actual distribution under study!
We need to understand how each of these cases can affect the predictive capabilities of a trained decision tree on the unseen test data.
Nothing like a visual example can help this sink. Check this out in Figure. 1 below:
Figure 1 ) The Effect of Having Noisy Labels in the Training Set and the Result After Correcting the Noisy Labels
a) A training set of 2 types of cars: Sports and Family cars and the corresponding trained decision tree
b) Same data and tree, but we just realized that 2 of the data entries where labeled wrongly!
c) During testing, we might easily make a mistake on the test data because of the faulty training examples
d) The corrected training set and the corresponding corrected decision tree
Figure 1-a shows a traditional 2 dimensional training data of 2 types of cars. Color red denotes the family cars (F) while yellow is for the sports car. That is right! This is a binary classification problem. If you notice, every car has only 2 features (aka. attributes): car price and mileage! This is the question:
Given the mileage and price of a car, can we classify it into either family or sports car? If there is a hidden relationship, can our decision tree learn it?
you can see the trained decision tree and also the corresponding green dash line that shows the splitting condition at every split node of the decision tree. So far so good! Our tree can classify our training data perfectly. Why? Because it has reached leaf nodes that are by definition completely pure! In other words, it has grown as much as possible to make sure it can train each and every one of our training examples perfectly with 0 error!
But what if we have learned a faulty training set perfectly?! What will happen to the predictive power of our tree on the unseen test data? What will happen to its generalization power?
Short Answer: We are potentially screwed at this point!
Let’s see what we mean by error in the training set. Figure 1-b shows that 2 of our data points have been mistakenly labeled as family cars. Now ask yourself:
My decision tree has learned the training data perfectly! But the data has 2 mislabeled entries! So, is my tree still valid?
Your tree has learned a partially wrong data perfectly. Let’s test the tree! Figure 1-c shows the red and yellow areas of the input space the way your tree perceives it. Imagine our unseen test data is indeed quite similar to the mislabeled data! It is indeed a sports car! But guess what? Your tree perceives it as a family car as it falls in the red region of the input space (You can verify this by navigating the tree as well). You have over-fitted the erroneous training data, hence your tree performed poorly on the unseen test data. Figure 1-d shows the corrected dataset and the corresponding, simpler and corrected tree. Note that this tree will classify the aforementioned unseen test data, correctly as sports car.
The second case where learning the training data perfectly can put us in trouble during testing is if the training data is quite insufficient in representing the concept that we are trying to learn. This means that we might not have enough training data. As a result, learning this incomplete data perfectly can make the tree make big mistakes on the test data that might indeed belong to the area of the input space from which we have had NO examples during training our tree.
This issue is beautifully demonstrated in Figure.2 below:
Figure 2 ) The Effect of Having Insufficient Training Data and the Result After Adding the Missing Data to the Training Set
a) A training set of 2 types of cars: Sports and Family and the corresponding decision tree
b) The little blue circle reveals 6 sports cars that are missing from our training data.
c) The result on the test data when the tree has over-fitted a highly insufficient training data
d) The corrected training set and the corresponding corrected tree
Figure 2-a shows the same dataset and tree that we have seen before. However, this time, instead of having noisy labels, we have an incomplete training set. Figure 2-b shows all the data regarding the sports cars that are currently missing in our training set, and our trained tree is naturally oblivious about all of them (can’t learn what you can’t see, right?). Figure 2-c demonstrates the perception of our trained tree (on the incomplete dataset) of the input space. So, the red area is where the tree thinks the family cars belong to and the yellow area belongs to the sports cars.
Now, look at the unseen test data in the figure! This test data is actually quite similar to the data that is missing from our training data, which belong to the sports car class, and it indeed belongs to the sports car class! However, the tree will predict it as a family car (you can verify this by navigating through the tree)! So you can see how doing great on an incomplete training set can disable the algorithm to generalize its knowledge to the unseen data. Figure 2-d is where the missing data is added to our training set and the tree has been trained with this new and more complete training set. You can also see the resulted corrected tree!
So, when an algorithm (let’s say a decision tree) is said to have over-fit the training data?
A decision tree (T) is said to have over-fit the training data, if there exists some other tree, (T’), such that even though T has a smaller error on the training set compared to T’; T’ has indeed a smaller error on the entire distribution of the data (Where the unseen test data can occur) compared to T. So, T’ has more generalization power than T and T has over-fit the training data!
Perfect. I do hope that this has been informative for you!