Decision Trees are one of the most powerful and popular algorithms for both regression and classification tasks. They are a flowchart like structure and fall under the category of supervised algorithms. The ability of the decision trees to be visualized like a flowchart enables them to easily mimic the thinking level of humans and this is the reason why these decision trees are easily understood and interpreted.
What is a Decision Tree?
Decision Trees are a type of tree-structured classifiers. They have three types of nodes which are,
- Root Nodes
- Internal Nodes
- Leaf Nodes
The Root nodes are the primary nodes that represent the entire sample which is further split into several other nodes. The Internal nodes represent the test on an attribute while the branches represent the decision of the test. Finally, the leaf nodes denote the class of the label, which is the decision taken after the compilation of all attributes. Learn more about decision tree learning.
How do Decision Trees work?
The decision trees are used in classification by sorting them down the entire tree structure from the root node to the leaf node. This approach used by the decision tree is called as the Top-Down approach. Once a particular data point is fed into the decision tree, it is made to pass through each and every node of the tree by answering Yes/No questions till it reaches the particular designated leaf node.
Each node in the decision tree represents a test case for an attribute and each descent (branch) to a new node corresponds to one of the possible answers to that test case. In this way, with multiple iterations, the decision tree predicts a value for the regression task or classifies the object in a classification task.
Decision Tree Implementation
Now that we have the basics of a decision tree, let us go through on of its execution in Python programming.
In the following example we are going to use the famous “Iris Flower” Dataset. Originally published in 1936 at UCI Machine Learning Repository, (Link: https://archive.ics.uci.edu/ml/datasets/Iris), this small dataset is widely used for testing out machine learning algorithms and visualizations.
In this, there are a total of 150 rows and 5 columns of which 4 columns are the attributes or features and the last column is the type of Iris flower species. Iris is a genus of flowering plants in botany. The four attributes in cm are,
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
These four features are used to define and classify the type of Iris flower depending upon the size and shape. The 5th or the last column consists of the Iris flower class, which are Iris Setosa, Iris Versicolor and Iris Virginica.
For our problem, we have to build a Machine Learning model utilizing Decision Tree Algorithm to learn the features and classify them based on the Iris flower class.
Let us go through its implementation in python, step by step:
Step 1: Importing the libraries
The first step in building any machine learning model in Python will be to import the necessary libraries such as Numpy, Pandas and Matplotlib. The tree module is imported from the sklearn library to visualise the Decision Tree model at the end.
Step 2: Importing the dataset
Once we have imported the Iris dataset, we store the .csv file into a Pandas DataFrame from which we can easily access the columns and rows of the table. The first four columns of the dataframe are the independent variables or the features which are to be understood by the decision tree classifier and are stored into the variable X.
The dependant variable which is the Iris flower class consisting of 3 species is stored into the variable y. The dataset is visualized by printing the first 5 rows.
Also Read: Decision Tree Classification
Step 3: Splitting the dataset into the Training set and Test set
In the following step, after reading the dataset, we have to split the entire dataset into the training set, using which the classifier model will be trained upon and the test set, on which the trained model will be implemented. The results obtained on the test set will be compared to check for accuracy of the trained model.
Here, we have used a test size of 0.25, which denotes that 25% of the entire dataset will be randomly split as the test set and the remaining 75% will consist of the training set to be used in training the model. Hence, out of 150 datapoints, 38 random datapoints are retained as the test set and the remaining 112 samples are used in the training set.
Step 4: Training the Decision Tree Classification model on the Training Set
Once the model has been split and is ready for training purpose, the DecisionTreeClassifier module is imported from the sklearn library and the training variables (X_train and y_train) are fitted on the classifier to build the model. During this training process, the classifier undergoes several optimization methods such as the Gradient Descent and Backpropagation and finally builds the Decision Tree Classifier model.
Step 5: Predicting the Test Set Results
As we have our model ready, shouldn’t we check its accuracy on the test set? This step involves the testing of the model built using decision tree algorithm on the test set that was split earlier. These results are stored in a variable, “y_pred”.
Step 6: Comparing the Real Values with Predicted Values
This is another simple step, where we will build another simple dataframe which will consist of two columns, the real values of the test set on one side and the predicted values on the other side. This step enables us to compare the results obtained by the model built.
Step 7: Confusion Matrix and Accuracy
Now that we have both the real and predicted values of the test sets, let us build a simple classification matrix and calculate the accuracy of our model built using simple library functions within sklearn. The accuracy score is calculated by inputting both the real and predicted values of the test set. The model built using the above steps gives us an accuracy of 92.1% which is denoted as 0.92105 in the step below.
The confusion matrix is a table that is used to show the correct and incorrect predictions on a classification problem. For simple usage, the values across the diagonal represent the correct predictions and the other values outside of the diagonal are incorrect predictions.
On calculating the number from 38 test set datapoints we get 35 correct predictions and 3 incorrect predictions, which are reflected as 92% accurate. The accuracy can be improved by optimizing the hyperparameters which can be given as arguments to the classifier before training the model.
Step 8: Visualizing the Decision Tree Classifier
Finally, in the last step we shall visualize the Decision Tree built. On noticing the root node, it is seen that the number of “samples” are 112, which are in sync with the training set samples split before. The GINI index is calculated during each step of the decision tree algorithm and the 3 classes are split as shown in the “value” parameter in the decision tree.
Must Read: Decision Tree Interview Questions & Answers
Hence, in this way, we have understood the concept of Decision Tree algorithm and have built a simple Classifier to solve a classification problem using this algorithm.
If you’re interested to learn more about decision trees, machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.