Do you ever wonder how Netflix picks a movie to recommend to you? Or how Amazon chooses the products to show in your feed?
They all use recommendation systems, a technology that utilizes the random forest classifier.
The random forest classifier is among the most popular classification algorithms. Today, we’ll learn about this robust machine learning algorithm and see how it works. You’ll also learn about its implementation as we’ll share a step-by-step tutorial on how to use the random forest classifier in a real-life problem.
We’ll cover the advantages and disadvantages of random forest sklearn and much more in the following points.
Random Forest Classifier: An Introduction
The random forest classifier is a supervised learning algorithm which you can use for regression and classification problems. It is among the most popular machine learning algorithms due to its high flexibility and ease of implementation.
Why is the random forest classifier called the random forest?
That’s because it consists of multiple decision trees just as a forest has many trees. On top of that, it uses randomness to enhance its accuracy and combat overfitting, which can be a huge issue for such a sophisticated algorithm. These algorithms make decision trees based on a random selection of data samples and get predictions from every tree. After that, they select the best viable solution through votes.
It has numerous applications in our daily lives such as feature selectors, recommender systems, and image classifiers. Some of its real-life applications include fraud detection, classification of loan applications, and disease prediction. It forms the basis for the Boruta algorithm, which picks vital features in a dataset.
How does it work?
Assuming your dataset has “m” features, the random forest will randomly choose “k” features where k < m. Now, the algorithm will calculate the root node among the k features by picking a node that has the highest information gain.
After that, the algorithm splits the node into child nodes and repeats this process “n” times. Now you have a forest with n trees. Finally, you’ll perform bootstrapping, ie, combine the results of all the decision trees present in your forest.
It’s certainly one of the most sophisticated algorithms as it builds on the functionality of decision trees.
Technically, it is an ensemble algorithm. The algorithm generates the individual decision trees through an attribute selection indication. Every tree relies on an independent random sample. In a classification problem, every tree votes and the most popular class is the end result. On the other hand, in a regression problem, you’ll compute the average of all the tree outputs and that would be your end result.
A random forest Python implementation is much simpler and robust than other non-linear algorithms used for classification problems.
The following example will help you understand how you use the random forest classifier in your day to day life:
Suppose you wanted to buy a new car and you ask your best friend Supratik for his recommendations. He would ask you about your preferences, your budget, and your requirements and would also share his past experiences with his car to give you a recommendation.
Here, Supratik is using the Decision Tree method to give you feedback based on your response. After his suggestions, you feel dicey about his advice so you ask Aditya about his recommendations and he also asks you about your preferences and other requirements.
Suppose you iterate this process and ask ‘n’ friends this question. Now you have several cars to choose from. You gather all the votes from your friends and decide to buy the car that has the most votes. You have now used the random forest method to pick a car to buy.
However, the more you’ll iterate this process the more prone you are to overfitting. That’s because your dataset in decision trees will keep becoming more specific. Random forest combats this issue by using randomness.
Pros and Cons of Random Forest Classifier
Every machine learning algorithm has its advantages and disadvantages. Following are the advantages and disadvantages of the random forest classification algorithm:
- The random forest algorithm is significantly more accurate than most of the non-linear classifiers.
- This algorithm is also very robust because it uses multiple decision trees to arrive at its result.
- The random forest classifier doesn’t face the overfitting issue because it takes the average of all predictions, canceling out the biases and thus, fixing the overfitting problem.
- You can use this algorithm for both regression and classification problems, making it a highly versatile algorithm.
- Random forests don’t let missing values cause an issue. They can use median values to replace the continuous variables or calculate the proximity-weighted average of the missing values to solve this problem.
- This algorithm offers you relative feature importance that allows you to select the most contributing features for your classifier easily.
- This algorithm is substantially slower than other classification algorithms because it uses multiple decision trees to make predictions. When a random forest classifier makes a prediction, every tree in the forest has to make a prediction for the same input and vote on the same. This process can be very time-consuming.
- Because of its slow pace, random forest classifiers can be unsuitable for real-time predictions.
- The model can be quite challenging to interpret in comparison to a decision tree as you can make a selection by following the tree’s path. However, that’s not possible in a random forest as it has multiple decision trees.
Difference between Random Forest and Decision Trees
A decision tree, as the name suggests, is a tree-like flowchart with branches and nodes. The algorithm splits the data based on the input features at every node and generates multiple branches as output. It’s an iterative process and increases the number of created branches (output) and differentiation of the data. This process repeats itself until a node is created where almost all of the data belongs to the same class and more branches or splits are not possible.
On the other hand, a random forest uses multiple decision trees, thus the name ‘forest’. It gathers votes from the various decision trees it used to make the required prediction.
Hence, the primary difference between a random forest classifier and a decision tree is that the former uses a collection of the latter. Here are some additional differences between the two:
- Decision trees face the problem of overfitting but random forests don’t. That’s because random forest classifiers use random subsets to counter this problem.
- Decision trees are faster than random forests. Random forests use multiple decision trees, which takes a lot of computation power and thus, more time.
- Decision trees are easier to interpret than random forests and you can convert the former easily according to the rules but it’s rather difficult to do the same with the latter.
Building the Algorithm (Random Forest Sklearn)
In the following example, we have performed a random forest Python implementation by using the scikit-learn library. You can follow the steps of this tutorial to build a random forest classifier of your own.
While 80% of any data science task requires you to optimise the data, which includes data cleaning, cleansing, fixing missing values, and much more. However, in this example, we’ll focus solely on the implementation of our algorithm.
First step: Import the libraries and load the dataset
First, we’ll have to import the required libraries and load our dataset into a data frame.
#Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Importing the dataset
from sklearn.datasets import load_iris
dataset = load_iris ()
Second step: Split the dataset into a training set and a test set
After we have imported the necessary libraries and loaded the data, we must split our dataset into a training set and a test set. The training set will help us train the model and the test set will help us determine how accurate our model actually is.
# Fit the classifier to the training set
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = ‘entropy’ , splitter = ‘best’ , random_state = 0)
DecisionTreeClassifier(class_weight=None, criterion=’entropy’ , max_depth=None,
min_weight_fraction_leaf=0.0, presort=False, random_state=0,
Third step: Create a random forest classifier
Now, we’ll create our random forest classifier by using Python and scikit-learn.
#Fitting the classifier to the training set
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, criterion-’entropy’, random_state = 0)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’entropy’,
max_depth=None, max_features=’auto’, max_leaf_nodes=None,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=0, verbose=0, warm_start=False)
Fourth step: Predict the results an make the Confusion matrix
Once we have created our classifier, we can predict the results by using it on the test set and make the confusion matrix and get their accuracy score for the model. The higher the score, the more accurate our model is.
#Predict the test set results
y_pred = mode.predict(X_test)
#Create the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
array ([[16, 0, 0]
[0, 17, 1]
[0, 0, 11]])
#Get the score for your model
Random forest classifiers have many applications. They are among the most robust machine learning algorithms and are a must-have in any AI and ML professional.
If you’re interested to learn more about Artificial Intelligence, check out IIIT-B & upGrad’s Executive PG Program in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.