Decision Trees are one of the most powerful and popular algorithms for both regression and classification tasks. They are a flowchart like structure and fall under the category of supervised algorithms. The ability of the decision trees to be visualized like a flowchart enables them to easily mimic the thinking level of humans and this is the reason why these decision trees are easily understood and interpreted.
What is a Decision Tree?
Decision Trees are a type of tree-structured classifiers. They have three types of nodes which are,
- Root Nodes
- Internal Nodes
- Leaf Nodes
The Root nodes are the primary nodes that represent the entire sample which is further split into several other nodes. The Internal nodes represent the test on an attribute while the branches represent the decision of the test. Finally, the leaf nodes denote the class of the label, which is the decision taken after the compilation of all attributes. Learn more about decision tree learning.
How do Decision Trees work?
The decision trees are used in classification by sorting them down the entire tree structure from the root node to the leaf node. This approach used by the decision tree is called as the Top-Down approach. Once a particular data point is fed into the decision tree, it is made to pass through each and every node of the tree by answering Yes/No questions till it reaches the particular designated leaf node.
Each node in the decision tree represents a test case for an attribute and each descent (branch) to a new node corresponds to one of the possible answers to that test case. In this way, with multiple iterations, the decision tree predicts a value for the regression task or classifies the object in a classification task.
Decision Tree Implementation
Now that we have the basics of a decision tree, let us go through on of its execution in Python programming.
In the following example we are going to use the famous “Iris Flower” Dataset. Originally published in 1936 at UCI Machine Learning Repository, (Link: https://archive.ics.uci.edu/ml/datasets/Iris), this small dataset is widely used for testing out machine learning algorithms and visualizations.
In this, there are a total of 150 rows and 5 columns of which 4 columns are the attributes or features and the last column is the type of Iris flower species. Iris is a genus of flowering plants in botany. The four attributes in cm are,
- Sepal Length
- Sepal Width
- Petal Length
- Petal Width
These four features are used to define and classify the type of Iris flower depending upon the size and shape. The 5th or the last column consists of the Iris flower class, which are Iris Setosa, Iris Versicolor and Iris Virginica.
For our problem, we have to build a Machine Learning model utilizing Decision Tree Algorithm to learn the features and classify them based on the Iris flower class.
Let us go through its implementation in python, step by step:
Step 1: Importing the libraries
The first step in building any machine learning model in Python will be to import the necessary libraries such as Numpy, Pandas and Matplotlib. The tree module is imported from the sklearn library to visualise the Decision Tree model at the end.
Step 2: Importing the dataset
Once we have imported the Iris dataset, we store the .csv file into a Pandas DataFrame from which we can easily access the columns and rows of the table. The first four columns of the dataframe are the independent variables or the features which are to be understood by the decision tree classifier and are stored into the variable X.
The dependant variable which is the Iris flower class consisting of 3 species is stored into the variable y. The dataset is visualized by printing the first 5 rows.
Also Read: Decision Tree Classification
Step 3: Splitting the dataset into the Training set and Test set
In the following step, after reading the dataset, we have to split the entire dataset into the training set, using which the classifier model will be trained upon and the test set, on which the trained model will be implemented. The results obtained on the test set will be compared to check for accuracy of the trained model.
Here, we have used a test size of 0.25, which denotes that 25% of the entire dataset will be randomly split as the test set and the remaining 75% will consist of the training set to be used in training the model. Hence, out of 150 datapoints, 38 random datapoints are retained as the test set and the remaining 112 samples are used in the training set.
Step 4: Training the Decision Tree Classification model on the Training Set
Once the model has been split and is ready for training purpose, the DecisionTreeClassifier module is imported from the sklearn library and the training variables (X_train and y_train) are fitted on the classifier to build the model. During this training process, the classifier undergoes several optimization methods such as the Gradient Descent and Backpropagation and finally builds the Decision Tree Classifier model.
Step 5: Predicting the Test Set Results
As we have our model ready, shouldn’t we check its accuracy on the test set? This step involves the testing of the model built using decision tree algorithm on the test set that was split earlier. These results are stored in a variable, “y_pred”.
Step 6: Comparing the Real Values with Predicted Values
This is another simple step, where we will build another simple dataframe which will consist of two columns, the real values of the test set on one side and the predicted values on the other side. This step enables us to compare the results obtained by the model built.
Step 7: Confusion Matrix and Accuracy
Now that we have both the real and predicted values of the test sets, let us build a simple classification matrix and calculate the accuracy of our model built using simple library functions within sklearn. The accuracy score is calculated by inputting both the real and predicted values of the test set. The model built using the above steps gives us an accuracy of 92.1% which is denoted as 0.92105 in the step below.
The confusion matrix is a table that is used to show the correct and incorrect predictions on a classification problem. For simple usage, the values across the diagonal represent the correct predictions and the other values outside of the diagonal are incorrect predictions.
On calculating the number from 38 test set datapoints we get 35 correct predictions and 3 incorrect predictions, which are reflected as 92% accurate. The accuracy can be improved by optimizing the hyperparameters which can be given as arguments to the classifier before training the model.
Step 8: Visualizing the Decision Tree Classifier
Finally, in the last step we shall visualize the Decision Tree built. On noticing the root node, it is seen that the number of “samples” are 112, which are in sync with the training set samples split before. The GINI index is calculated during each step of the decision tree algorithm and the 3 classes are split as shown in the “value” parameter in the decision tree.
Must Read: Decision Tree Interview Questions & Answers
Hence, in this way, we have understood the concept of Decision Tree algorithm and have built a simple Classifier to solve a classification problem using this algorithm.
If you’re interested to learn more about decision trees, machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
What are the cons of using decision trees?
While decision trees help in the classification or sorting of data, their use sometimes creates a few problems too. Often, decision trees lead to the overfitting of data, which further makes the final result highly inaccurate. In case of large datasets, the use of a single decision tree is not recommended because it causes complexity. Also, decision trees are highly unstable, which means that if you cause a small change in the given dataset, the structure of the decision tree changes greatly.
How does a random forest algorithm work?
A random forest is essentially a collection of diverse decision trees, just like a forest is made up of many trees. The random forest algorithm's outcomes are actually dependent on the decision trees' predictions. The random forest technique also minimizes the likelihood of data over-fitting. To get the required outcome, random forest classification employs an ensemble approach. The training data is used to train various decision trees. When nodes are separated, this dataset contains observations and attributes that will be picked at random.
How is a decision table different from a decision tree?
A decision table may be produced from a decision tree, but not the other way around. A decision tree is made up of nodes and branches, whereas a decision table is made up of rows and columns. In decision tables, more than one or condition can be inserted. In decision trees, this is not the case. Decision tables are only useful when only a few properties are presented; decision trees, on the other hand, can be used effectively with a large number of properties and sophisticated logic.