Tree Entropy in R Programming
Last Updated :
25 Aug, 2020
Entropy in R Programming is said to be a measure of the contaminant or ambiguity existing in the data. It is a deciding constituent while splitting the data through a decision tree. An unsplit sample has an entropy equal to zero while a sample with equally split parts has entropy equal to one. Two major factors that are considered while choosing an appropriate tree are- information gain (IG) and entropy.
Formula :
Entropy = - \sum p(x) \log p(x)
where, p(x) is the probability
For example, consider a school data set of a decision tree whose entropy needs to be calculated.
Library available | Coaching joined | Parent's education | Student's performance |
---|
yes | yes | uneducated | bad |
yes | no | uneducated | bad |
no | no | educated | good |
no | no | uneducated | bad |
Hence, it is clearly seen that a student's performance is affected by three factors - library available, coaching joined, and parent's education. A decision tree can be constructed using the information of these three variables for the prediction of student's performance and hence are called predictor variables. The variables with more information are considered a better splitter of the decision tree.
So to calculate the entropy of parent node - Student's performance, the above entropy formula is used but probability needs to be calculated first.
There are four values in the Student's performance column out of which two performances are good and two are bad.
P_{good} = \frac{\text {good performance in sample}}{\text {total performance in sample}} = \frac{1}{4} = 0.25 \newline \newline
P_{bad} = \frac{\text {bad performance in sample}}{\text {total performance in sample}} = \frac{3}{4} = 0.75
Hence, total entropy of parent can be calculated as below
Entropy = - \sum p(x) \log p(x)
= - (\sum p_{good} \log_{2}{P_{good}}+ \sum p_{bad} \log_{2}{P_{bad}})
= - (0.25 \log_{2}{0.25} + 0.75 \log_{2}{0.75}) = 0.811
Information Gain using Entropy
Information gain is a parameter used to decide the best variable available for splitting the data at every node in the decision tree. So IG of every predictor variable can be calculated and the variable with the highest IG wins the race of deciding factor for splitting of root nodes.
Formula:
Information Gain(IG) = Entropyparent - (weighted average * Entropychildren)
Now to calculate IG of the predictor variable coaching joined, firstly split the parent node according to this variable.
Now there are two parts and their entropy is first calculated individually.
The entropy of the left part
There are two types of output available - good and bad. On the left part, there are three total outcomes with two being bad and one being good. Hence, Pgood and Pbad is calculated again as follows:
P_{good} = \frac {1}{3}= 0.334 \newline \newline
P_{bad} = \frac {2}{3}= 0.667 \newline \newline
Entropy_{left} = -(0.667 \log_2{0.667} + 0.334 \log_2{0.334}) = 0.9
The entropy of the right part
There is only one component in right, ie, bad performance. Hence, the probability becomes one. And the entropy becomes 0 because there is only one category to which output can belong to.
Calculating the weighted average with Entropy of children
\text{weighted average} \times Entropy_{children} =
\frac{\text{no. of outcomes in left child node}}{\text{total outcomes in parent node}} \times {entropy_\text{left node}}
+ \frac{\text{no. of outcomes in right child node}}
{\text{total outcomes in parent node}} \times
{entropy_{\text{right node}}}
There are 3 outcomes in left child node and 1 in the right node. While, Entropyleft node has been calculated as 0.9 and Entropyright node is 0.
Now keeping the values in the formula above we get a weighted average for this example:
\text{weighted average} \times Entropy_{children} =
\frac{3}{4} \times 0.9
+ \frac{1}
{4} \times 0 = 0.675
Calculating IG
Now putting the calculated weighted average in IG formula simply to obtain IG of 'coaching joined'.
IG(coaching joined) = Entropyparent - (weighted average * Entropychildren)
IG(coaching joined) = 0.811 - 0.675 = 0.136
Using the same steps and formula IG of other predictor variables is calculated, compared, and variable with the highest IG is therefore selected for splitting the data at every node.
Similar Reads
Decision Tree in R Programming In this article, weâll explore how to implement decision trees in R, covering key concepts, step-by-step examples, and tuning strategies.A decision tree is a flowchart-like model where each internal node represents a decision based on a feature, each branch represents an outcome of that decision, an
3 min read
Control Statements in R Programming Control statements are expressions used to control the execution and flow of the program based on the conditions provided in the statements. These structures are used to make a decision after assessing the variable. In this article, we'll discuss all the control statements with the examples. In R pr
4 min read
How to Code in R programming? R is a powerful programming language and environment for statistical computing and graphics. Whether you're a data scientist, statistician, researcher, or enthusiast, learning R programming opens up a world of possibilities for data analysis, visualization, and modeling. This comprehensive guide aim
4 min read
Hello World in R Programming When we start to learn any programming languages we do follow a tradition to begin HelloWorld as our first basic program. Here we are going to learn that tradition. An interesting thing about R programming is that we can get our things done with very little code. Before we start to learn to code, le
2 min read
Functions in R Programming A function accepts input arguments and produces the output by executing valid R commands that are inside the function. Functions are useful when you want to perform a certain task multiple times. In R Programming Language when you are creating a function the function name and the file in which you a
8 min read
Data Structures in R Programming A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. Râs base data structures are often organized by
4 min read
Environments in R Programming The environment is a virtual space that is triggered when an interpreter of a programming language is launched. Simply, the environment is a collection of all the objects, variables, and functions. Or, Environment can be assumed as a top-level object that contains the set of names/variables associat
3 min read
Learn R Programming R is a Programming Language that is mostly used for machine learning, data analysis, and statistical computing. It is an interpreted language and is platform independent that means it can be used on platforms like Windows, Linux, and macOS. In this R Language tutorial, we will Learn R Programming La
15+ min read
Reading Files in R Programming So far the operations using the R program are done on a prompt/terminal which is not stored anywhere. But in the software industry, most of the programs are written to store the information fetched from the program. One such way is to store the fetched information in a file. So the two most common o
9 min read
Decision Tree Classifiers in R Programming Classification is the task in which objects of several categories are categorized into their respective classes using the properties of classes. A classification model is typically used to, Predict the class label for a new unlabeled data objectProvide a descriptive model explaining what features ch
4 min read