Random Forest Hyperparameter Tuning: Processes Explained with Coding

Random Forest is a Machine Learning algorithm which uses decision trees as its base. Random Forest is easy to use and a flexible ML algorithm. Due to its simplicity and diversity, it is used very widely. It gives good results on many classification tasks, even without much hyperparameter tuning.

In this article, we will majorly focus on the working of Random Forest and the different hyper parameters that can be controlled for optimal results. The need for Hyperparameter tuning arises because every data has its characteristics.

These characteristics can be types of variables, size of the data, binary/multiclass target variable, number of categories in categorical variables, standard deviation of numerical data, normality in the data, etc. Hence tuning the model according to the data is imperative for maximizing the performance of a model.

Construct and Working

Random Forest Algorithm works as a large collection of decorrelated decision trees. It is also known as a bagging technique. Bagging falls in the category of ensemble learning and is based on the theory that the combination of noisy and unbiased models can be averaged out to create a model with low variance. Let us understand how a Random Forest is constructed. 

S is the matrix of data present for performing random forest classification. There are N instances present and A,B,C are the features of the data. From this data, random subsets of data are created. Over which decision trees are created. As we can see from the figure below, one decision tree is created per subset of data, and depending on the size of data, the decision trees are also increased.

The output of all the trained decision trees is voted and the majority voted class is the effective output of a Random Forest Algorithm. The decision tree models overfit the data hence the need for Random Forest arises. Decision tree models may be Low Bias but they are mostly high variance. Hence to reduce this variance error on the test set, Random Forest is used.

Hyperparameters

There are various hyperparameters that can be controlled in a random forest:

  1. N_estimators: The number of decision trees being built in the forest. Default values in sklearn are 100. N_estimators are mostly correlated to the size of data, to encapsulate the trends in the data, more number of DTs are needed. 
  2. Criterion: The function that is used to measure the quality of splits in a decision tree (Classification Problem). Supported criteria are gini: gini impurity or entropy: information gain. In case of Regression Mean Absolute Error (MAE) or Mean Squared Error (MSE) can be used. Default is gini and mse.
  3. Max_depth: The maximum levels allowed in a decision tree. If set to nothing, The decision tree will keep on splitting until purity is reached.
  4. Max_features: Maximum number of features used for a node split process. Types: sqrt, log2. If total features are n_features then: sqrt(n_features) or log2(n_features) can be selected as max features for node splitting.
  5. Bootstrap: Bootstrap samples are used when building decision trees if True is selected in bootstrap, else whole data is used for every decision tree.
  6. Min_samples_split: This parameter decides the minimum number of samples required to split an internal node. Default value =2. The problem with such a small value is that the condition is checked on the terminal node. If the data points in the node exceed the value 2, then further splitting takes place. Whereas if a more lenient value like 6 is set, then the splitting will stop early and the decision tree wont overfit on the data.
  7. Min_sample_leaf: This parameter sets the minimum number of data point requirements in a node of the decision tree. It affects the terminal node and basically helps in controlling the depth of the tree. If after a split the data points in a node goes under the min_sample_leaf number, the split won’t go through and will be stopped at the parent node.

There are other less important parameters that can also be considered during the hyperparameter tuning process.

n_jobs: number of processors that can be used for training. (-1 for no limit)

max_samples: the maximum data that can be used in each Decision Tree

random_state: the model with a specific random_state will produce similar accuracy/ outputs.

Class_weight: dictionary input, that can handle imbalanced data sets.

Must Read: Types of AI Algorithm 

Hyperparameter Tuning Processes

There are various ways of performing hyperparameter tuning processes. After the base model has been created and evaluated, hyperparameters can be tuned to increase some specific metrics like accuracy or f1 score of the model.

One must check the overfitting and the bias variance errors before and after the adjustments. The model should be tuned according to the real time requirement. Sometimes an overfitting model might be very sensitive to the data fluctuation in validation, hence the cross validation scores with the cross validation deviation should be checked for possible overfit before and after model tuning. 

The methods for Random Forest tuning on python are covered next.

Randomised Search CV

We can use scikit learn and RandomisedSearchCV where we can define the grid, the random forest model will be fitted over and over by randomly selecting parameters from the grid. We won’t get the best parameters, but we’ll definitely get the best model from the different models being fitted and tested.

Source Code:

from sklearn.model_selection import GridSearchCV

# Create a search grid of parameters that will be shuffled through

param_grid = {

‘bootstrap’: [True],

‘max_depth’: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],

‘max_features’: [‘auto’, ‘sqrt’],

‘min_samples_leaf’: [1, 2, 4],

‘min_samples_split’: [2, 5, 10],

‘n_estimators’: [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]

}

# Using the random grid and searching for best hyperparameters

rf = RandomForestRegressor() #creating base model

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(train_features, train_labels) #fit is to initiate training process

The randomised search function will search the parameters through 5 fold cross validation and 100 iterations to end up with the best parameters.

Grid Search CV

Grid search is used after randomised search to narrow down the range to search the perfect hyperparameters. Now that we know where we can focus we can explicitly run those parameters through grid search and evaluate different models to get the final values for every hyperparameter.

Source Code:

from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 

param_grid = {

    ‘bootstrap’: [True],

    ‘max_depth’: [80, 90, 100, 110],

    ‘max_features’: [2, 3],

    ‘min_samples_leaf’: [3, 4, 5],

    ‘min_samples_split’: [8, 10, 12],

    ‘n_estimators’: [100, 200, 300, 1000]

}

# Create a based model

rf = RandomForestRegressor()

# Instantiate the grid search model

grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 

                          cv = 3, n_jobs = -1, verbose = 2)

Results after execution:

# Fit the grid search to the data

grid_search.fit(train_features, train_labels)

grid_search.best_params_

{‘bootstrap’: True,

 ‘max_depth’: 80,

 ‘max_features’: 3,

 ‘min_samples_leaf’: 5,

 ‘min_samples_split’: 12,

 ‘n_estimators’: 100}

best_grid = grid_search.best_estimator_

Also Read: Machine Learning Project Ideas

Conclusion

We went through the working of a random forest model and how each hyperparameter works to alter the decision trees and hence the random forest model as a whole. We also had a look at the efficient technique to combine the use of randomised and grid search to get to the best parameters for our model. Hyperparameter tuning is very important as it helps us control bias and variance performance of our model. 

If you’re interested to learn more about the decision tree, Machine Learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Learn More

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Accelerate Your Career with upGrad

×