The online calculator below parses the set of training examples, then computes the information gain for each attribute/feauture. If you are unsure what it is all about or you want to see the formulas, read the explanation below the calculator.
Note: Training examples should be entered as csv list, with semicolon used as separator. First row is considered to be row of labels, first attributes/features labels, then the class label. All other rows are examples. The default data in this calculator is the famous example of data for "Play Tennis" decision tree
Information Gain and Decision Trees
Information Gain is the metric which is particularly useful in building decision trees. A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.1
Let's look at the calculator's default data.
Attributes to be analyzed are:
- Outlook: Sunny/Overcast/Rain
- Humidity: High/Normal
- Wind: True/False
- Temperature: Hot/Mild/Cool
Class label is:
- Play: Yes/No
So, by analyzing the attributes one by one, algorithm should effectifely answer the question: "Should we play tennis?" Thus, in order to perform as less steps as possible, we need to choose the best decision attribute on each step. The one, which gives us maximum information.
How to measure the information which attribute can give us? The one of the ways is to measure the reduction in entropy, and this is exactly what Information Gain metric does.
Lets get back to the example. In our training set we have 5 examples labelled as "No" and 9 examples labelled as "Yes". According to the well-known Shannon Entropy formula, the current entropy is
Now let's imagine we want to classify some example. We decide to test "Windy" attribute first. Technically, we are performing a split on "Windy" attribute.
If the value of "Windy" attribute is "True", we are left with 6 examples. Three of them has "Yes" as Play label, and three of them has "No" as Play label.
Their entropy is
So, if our example under test has "True" as "Windy" attribute, we are left with more uncertainty than before.
Now, if the value of "Windy" attribute is "False", we are left with 8 examples. Six of them has "Yes" as Play label, and two of them has "No" as Play label.
Their entropy is
This is, of course, better than our initial 0.94 bits of entropy (if we are lucky to get "False" in our example under test).
In order to estimate entropy reduction in general, we need to average using the probability to get "True" and "False" attribute values. We have 6 examples with "True" value of "Windy" attribute and 8 examples with "False" value of "Windy" attribute. So, the average entropy after split would be
Thus, our initial entropy is 0.94, and the average entropy after split on "Windy" attribute is 0.892. Hence the information gain as reduction in entropy is
The general formula for the information gain for the attribute a is
- a set of training examples, each of the form where is the value of the attribute or feature of example and is the corresponding class label,
- the entropy of T conditioned on a (Conditional entropy)
The conditional entropy formula is
- the set of training examples of T such for which attribute a is equal to v
Using this approach, we can find information gain for each of the attributes, and find out that the "Outlook" attribute gives us the greatest information gain, 0.247 bits. Now we can conclude that first split on "Windy" attribute was a really bad idea, and the given training examples suggest that we should test on the "Outlook" attribute first.
On a final note. You might think why we need decision tree if we can just provide the decision for each combination of attributes. Of course you can, but even for this small example, total number of combinations is 3*2*2*3=36. From the other side, we have just used a subset of combinations (14 examples) to train our algorithm (by building decision tree) and now it can classify all other combinations without our help. That's the point of machine learning.