Everything You Need to Know About Random Forest Algorithm Optimization

Suppose you’ve built a machine learning program and used the random forest model for training it. However, the output/result of the program is not as accurate as you want it to be. So what do you do?

There are three methods for improving a machine learning model to improve the output of a machine learning program:

  • Improve the input data quality and feature engineering
  • Hyperparameter tuning of the algorithm
  • Using different algorithms

But what if you have already used all the data sources available? The next logical step is hyperparameter tuning. Thus, if you have created a machine learning program with a random forest model, used the best data source, and want to improve the output of the program further, you should opt for random forest hyperparameter tuning.

Before we delve into random forest hyperparameter tuning, let’s first have a look at hyperparameters and hyperparameter tuning in general.

What are Hyperparameters?

In the context of machine learning, hyperparameters are parameters whose value is used to control the learning process of the model. They are external to the model, and their values cannot be estimated from data.

For random forest hyperparameter tuning, hyperparameters include the number of decision trees and the number of features considered by each tree during node splitting.

What is Hyperparameter Tuning?

Hyperparameter tuning is the process of searching for an ideal set of hyperparameters for a machine learning problem.

Now that we have seen what hyperparameters and hyperparameter tuning is, let us have a look at hyperparameters in a random forest and random forest hyperparameter tuning.

Read: Decision Tree Interview Questions

What is Random Forest Hyperparameter Tuning?

To understand what random forest hyperparameters tuning is, we will have a look at five hyperparameters and the hyperparameter tuning for each.

Hyperparameter 1: max_depth

max_depth is the longest path between the root node and the leaf node in a tree in a random forest algorithm. By tuning this hyperparameter, we can limit the depth up to which we want the tree to grow in the random forest algorithm. This hyperparameter reduces the growth of the decision tree by working on a macro level.

Hyperparameter 2: max_terminal_nodes

This hyperparameter restricts the growth of a decision tree in the random forest by setting a condition on the splitting of nodes in the tree. The splitting of the nodes will stop, and the growth of the tree will cease if there are more terminal nodes than the specified number after splitting.

For instance, let us suppose that we have a single node in the tree, and the maximum terminal nodes are set to four. Since there is only one node, to begin with, the node will be split, and the tree will grow further. After the split reaches the maximum limit of four, the decision tree will not grow further as the splitting will be terminated. Using max_terminal_nodes hyperparameter tuning helps prevent overfitting. However, if the value of the tuning is very small, the forest is likely to underfit.

Related Read: Decision Tree Classification

Hyperparameter 3: n_estimators

A data scientist always faces the dilemma of how many decision trees to consider. One may say that choosing more number of trees is the way to go. This may hold true, but it also increases the time complexity of the random forest algorithm.

With the n_estimators hyperparameter tuning, we can decide the number of trees in the random forest model. The default value of the n_estimators parameter is ten. This means that ten different decision trees are constructed by default. By tuning this hyperparameter, we can change the number of trees that will be constructed.

Hyperparameter 4: max_features

With this hyperparameter tuning, we can decide the number of features to be provided to each tree in the forest. Generally, if the value of max features is set to six, the overall performance of the model is found to be the highest. However, you can also set the max features parameter value to the default, which is the square root of the number of features present in the dataset.

Hyperparameter 5: min_samples_split

This hyperparameter tuning decides the minimum number of samples required to split an internal leaf node. By default, the value of this parameter is two. It means that to split an internal node, there must be at least two samples present.

How To Do Random Forest Hyperparameter Tuning?

You need to carry out random forest hyperparameter tuning manually, by calling the function that creates the model. Random forest hyperparameter tuning is more of an experimental approach than a theoretical one. Thus, you may need to try out different combinations of hyperparameter tuning and evaluate the performance of each before deciding on one.

For example, suppose you have to tune the number of estimators and the minimum split of a tree in a random forest algorithm. Therefore, you can use the following command to perform hyperparameter tuning:

forest = RandomForestClassifier(random_state = 1, n_estimators = 20, min_samples_split = 2)

In the above example, the number of estimators is changed from their default value of ten to twenty. Thus, instead of ten decision trees, the algorithm will create twenty trees in the random forest. Similarly, an internal leaf node will be split only if it has at least two samples.

Conclusion

We hope that this blog helped you understand random forest hyperparameter tuning. There are many other hyperparameters that you can tune to improve the output of the machine learning program. In most instances, hyperparameter tuning is enough to improve the output of the machine learning program.

However, in rare cases, even random forest hyperparameter tuning might not prove helpful. In such situations, you will need to consider a different machine learning algorithm such as linear or logistic regression, KNN, or any other algorithm that you deem fit.

If you’re interested to learn more about decision trees, machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Learn More

Leave a comment

Your email address will not be published.

×