Which Attribute to Choose? (Part1)

In our last post,&nbsp; we introduced the idea of the decision trees (DTs) and you understood the big picture. Now it is time to get into some of the details. For example, how does a DT choose the best attribute from the dataset? There must be a way for the DT to compare the worth of each attribute and figure out which attribute can help us get to more pure subtables (i.e., more certainty). There is indeed a famous quantitative measure for this called Information Gain. But in order to understand it, we have to first learn about the concept of Entropy. As a reminder here is our training set:</p> <img width="1024" height="580" src="https://www.mldawn.com/wpcontent/uploads/2019/02/whichattributetochoose1024x580.png" alt="" srcset="https://www.mldawn.com/wpcontent/uploads/2019/02/whichattributetochoose1024x580.png 1024w, https://www.mldawn.com/wpcontent/uploads/2019/02/whichattributetochoose300x170.png 300w, https://www.mldawn.com/wpcontent/uploads/2019/02/whichattributetochoose768x435.png 768w, https://www.mldawn.com/wpcontent/uploads/2019/02/whichattributetochoose.png 1092w" sizes="(maxwidth: 1024px) 100vw, 1024px" /> <h2>What is Entropy?</h2> Entropy of a set of examples, can tell us how pure that set is! For example, if we have 2 sets of fruits: 1) 5 apples 5 oranges, and 2) 9 apples and 1 orange, we say that set 2 is much more pure (i.e., has much less entropy) than set 1 as it almost purely consists of apples. However, set 1 is a halfhalf situation and is so impure (i.e., has much more entropy) as neither apples nor oranges can dominate! Now, back to the adults' world and enough with fruits :) 
In a binary classification problem, such as the dataset above, we have 2 sets of examples 1) Positive (Yes) and 2) Negative (No), and this is how we compute the entropy of our dataset: Please note that the base of the Logarithm is 2. As for the proportions, let’s …
The Decision Tree Algorithm: Entropy Read More »