In Data Science, validation is probably one of the most important techniques used by Data Scientists to validate the stability of the ML model and evaluate how well it would generalize to new data. Validation ensures that the ML model picks up the right (relevant) patterns from the dataset while successfully canceling out the noise in the dataset. Essentially, the goal of validation techniques is to make sure ML models have a low bias-variance factor.
Today we’re going to discuss at length on one such model validation technique – Cross-Validation.
What is Cross-Validation?
Cross-Validation is a validation technique designed to evaluate and assess how the results of statistical analysis (model) will generalize to an independent dataset. Cross-Validation is primarily used in scenarios where prediction is the main aim, and the user wants to estimate how well and accurately a predictive model will perform in real-world situations.
Cross-Validation seeks to define a dataset by testing the model in the training phase to help minimize problems like overfitting and underfitting. However, you must remember that both the validation and the training set must be extracted from the same distribution, or else it would lead to problems in the validation phase.
Benefits of Cross-Validation
- It helps evaluate the quality of your model.
- It helps to reduce/avoid problems of overfitting and underfitting.
- It lets you select the model that will deliver the best performance on unseen data.
What are Overfitting and Underfitting?
Overfitting refers to the condition when a model becomes too data-sensitive and ends up capturing a lot of noise and random patterns that do not generalize well to unseen data. While such a model usually performs well on the training set, its performance suffers on the test set.
Underfitting refers to the problem when the model fails to capture enough patterns in the dataset, thereby delivering a poor performance for both the training as well as the test set.
Going by these two extremities, the perfect model is one that performs equally well for both training and test sets.
Cross-Validation: Different Validation Strategies
Validation strategies are categorized based on the number of splits done in a dataset. Now, let’s look at the different Cross-Validation strategies in Python.
1. Validation set
This validation approach divides the dataset into two equal parts – while 50% of the dataset is reserved for validation, the remaining 50% is reserved for model training. Since this approach trains the model based on only 50% of a given dataset, there always remains a possibility of missing out on relevant and meaningful information hidden in the other 50% of the data. As a result, this approach generally creates a higher bias in the model.
train, validation = train_test_split(data, test_size=0.50, random_state = 5)
2. Train/Test split
In this validation approach, the dataset is split into two parts – training set and test set. This is done to avoid any overlapping between the training set and the test set (if the training and test sets overlap, the model will be faulty). Thus, it is crucial to ensure that the dataset used for the model must not contain any duplicated samples in our dataset. The train/test split strategy lets you retrain your model based on the whole dataset without altering any hyperparameters of the model.
However, this approach has one significant limitation – the model’s performance and accuracy largely depend on how it is split. For instance, if the split isn’t random, or one subset of the dataset has only a part of the complete information, it will lead to overfitting. With this approach, you cannot be sure which data points will be in which validation set, thereby creating different results for different sets. Hence, the train/test split strategy should only be used when you have enough data at hand.
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
[0, 1, 2, 3, 4]
As seen in the previous two strategies, there is the possibility of missing out on important information in the dataset, which increases the probability of bias-induced error or overfitting. This calls for a method that reserves abundant data for model training while also leaving sufficient data for validation.
Enter the K-fold validation technique. In this strategy, the dataset is split into ‘k’ number of subsets or folds, wherein k-1 subsets are reserved for model training, and the last subset is used for validation (test set). The model is averaged against the individual folds and then finalized. Once the model is finalized, you can test it using the test set.
Here, each data point appears in the validation set exactly once while remaining in the training set k-1 number of times. Since most of the data is used for fitting, the problem of underfitting significantly reduces. Similarly, the issue of overfitting is eliminated since a majority of data is also used in the validation set.
The K-fold strategy is best for instances where you have a limited amount of data, and there’s a substantial difference in the quality of folds or different optimal parameters between them.
from sklearn.model_selection import KFold # import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array
y = np.array([1, 2, 3, 4]) # Create another array
kf = KFold(n_splits=2) # Define the split – into 2 folds
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
KFold(n_splits=2, random_state=None, shuffle=False)
4. Leave one out
The leave one out cross-validation (LOOCV) is a special case of K-fold when k equals the number of samples in a particular dataset. Here, only one data point is reserved for the test set, and the rest of the dataset is the training set. So, if you use the “k-1” object as training samples and “1” object as the test set, they will continue to iterate through every sample in the dataset. It is the most useful method when there’s too little data available.
Since this approach uses all data points, the bias is typically low. However, as the validation process is repeated ‘n’ number of times (n=number of data points), it leads to greater execution time. Another notable constraint of the methods is that it may lead to a higher variation in testing model effectiveness as you test the model against one data point. So, if that data point is an outlier, it will create a higher variation quotient.
>>> import numpy as np
>>> from sklearn.model_selection import LeaveOneOut
>>> X = np.array([[1, 2], [3, 4]])
>>> y = np.array([1, 2])
>>> loo = LeaveOneOut()
>>> for train_index, test_index in loo.split(X):
… print(“TRAIN:”, train_index, “TEST:”, test_index)
… X_train, X_test = X[train_index], X[test_index]
… y_train, y_test = y[train_index], y[test_index]
… print(X_train, X_test, y_train, y_test)
TRAIN:  TEST: 
[[3 4]] [[1 2]]  
TRAIN:  TEST: 
[[1 2]] [[3 4]]  
Typically, for the train/test split and the K-fold, the data is shuffled to create a random training and validation split. Thus, it allows for different target distribution in different folds. Similarly, stratification also facilitates target distribution over different folds while splitting the data.
In this process, data is rearranged in different folds in a way that ensures each fold to become a representative of the whole. So, if you are dealing with a binary classification problem where each class consists of 50% of the data, you can use stratification to arrange the data in a way that each class includes half of the instances.
The stratification process is best suited for small and unbalanced datasets with multiclass classification.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y):
print(“Train:”, train_index, “Validation:”, val_index)
X_train, X_test = X[train_index], X[val_index]
y_train, y_test = y[train_index], y[val_index]
When to Use each of these five Cross-Validation strategies?
As we mentioned before, each Cross-Validation technique has unique use cases, and hence, they perform best when applied correctly to the right scenarios. For instance, if you have enough data, and the scores and optimal parameters (of the model) for different splits are likely to be similar, the train/test split approach will work excellently.
However, if the scores and optimal parameters vary for different splits, the K-fold technique will be best. For instances where you have too little data, the LOOCV approach works best, whereas, for small and unbalanced datasets, stratification is the way to go.
We hope this detailed article helped you gain an in-depth idea of Cross-Validation in Python.
If you are reading this article, most likely you have ambitions towards becoming a Python developer. If you’re interested to learn python & want to get your hands dirty on various tools and libraries, check out IIIT-B’s PG Diploma in Data Science.