Model Development is a crucial step in a Data Science Project Life Cycle where we will try to train our data set with different types of Machine Learning models either of Supervised or Unsupervised Algorithms based on the Business Problem.
As we are aware that we have a lot of models that can be used to solve a business problem we need to assure that whatever model we select at the end of this phase should be performing well on the unseen data. So, we cannot just go with the evaluation metrics in order to select our best performing model.
We need something more apart from the metric which can help us to decide on our final Machine Learning model which we can deploy to production.
The process of determining whether the mathematical results calculating relationships between variables are acceptable as descriptions of the data is known as Validation. Usually, an error estimation for the model is made after training the model on the train data set, better known as the evaluation of residuals.
In this process, we measure the Training Error by calculating the difference between predicted response and original response. But this metric cannot be trusted because it works well only with the training data. It’s possible that the model is Underfitting or Overfitting the data.
So, the problem with this evaluation technique or any other evaluation metric is that it does not give an indication of how well the model will perform to an unseen data set. The technique that helps to know this about our model is known as Cross-Validation.
FYI: Free nlp online course!
In this article, we will get to know more about the different types of cross-validation techniques, pros, and cons of each technique. Let’s start with the definition of Cross-Validation.
Cross-Validation is a resampling technique that helps to make our model sure about its efficiency and accuracy on the unseen data. It is a method for evaluating Machine Learning models by training several other Machine learning models on subsets of the available input data set and evaluating them on the subset of the data set.
We have different types of Cross-Validation techniques but let’s see the basic functionality of Cross-Validation: The first step is to divide the cleaned data set into K partitions of equal size.
- Then we need to treat the Fold-1 as a test fold while the other K-1 as train folds and compute the score of the test-fold.
- We need to repeat step 2 for all folds taking another fold as a test while remaining as a train.
- Last step would be to take the average of scores of all the folds.
Types of Cross-Validation
1. Holdout Method
This technique works on removing a part of the training data set and sending that to a model that was trained on the rest of the data set to get the predictions. We then calculate the error estimation which tells how our model is doing on unseen data sets. This is known as the Holdout Method.
- This Method is Fully independent of data.
- This Method only needs to be run once so has lower computational costs.
- The Performance is subject to higher variance given the smaller size of the data.
2. K-Fold Cross-Validation
In a Data-Driven World, there is never enough data to train your model, on top of that removing a part of it for validation poses a greater problem of Underfitting and we risk losing important patterns and trends in our data set, which in turn increases Bias. So ideally, we require a method that provides ample amounts of data for training the model and leaves ample amounts of data for validation sets.
In K-Fold cross-validation, the data is divided into k subsets or we can take it as a holdout method repeated k times, such that each time, one of the k subsets is used as the validation set and the other k-1 subsets as the training set. The error is averaged over all k trials to get the total efficiency of our model.
We can see that each data point will be in a validation set exactly once and will be in a training set k-1 time. This helps us reduce bias as we are using most of the data for fitting and reduces variance as most of the data is also being used in the validation set.
- This will help to overcome the problem of computational power.
- Models may not be affected much if an outlier is present in data.
- It helps us overcome the problem of variability.
- Imbalanced data sets will impact our model.
3. Stratified K-Fold Cross-Validation
K Fold Cross Validation technique will not work as expected for an Imbalanced Data set. When we have an imbalanced data set, we need a slight change to the K Fold cross validation technique, such that each fold contains approximately the same strata of samples of each output class as the complete. This variation of using a stratum in K Fold Cross Validation is known as Stratified K Fold Cross Validation.
- It can improve different models using hyper-parameter tuning.
- Helps us compare models.
- It helps in reducing both Bias and Variance.
4. Leave-P-Out Cross-Validation
In this approach we leave p data points out of training data out of a total n data points, then n-p samples are used to train the model and p points are used as the validation set. This is repeated for all combinations, and then the error is averaged.
- It has Zero randomness
- The Bias will be lower
- This method is exhaustive and computationally infeasible.
Also Read: Career in Machine Learning
In this article, we have learned about the importance of Validation of a Machine Learning Model in the Data Science Project Life Cycle, got to know what is validation and cross-validation, explored the different types of Cross-Validation techniques, got to know some advantages and disadvantages of those techniques.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
What is the need for cross-validation in machine learning?
Cross-validation is a machine learning technique where the training data is split into two parts: A training set and a test set. The training set is used to build the model, and the test set is used to evaluate how well the model performs when in production. The reason for doing this is that there is a risk that the model that you have built does not perform well in the real world. If you do not cross-validate your model, there is a risk that you have built a model that works great on the training data, but doesn't perform well on the real-world data.
What is k-fold cross validation?
In machine learning and data mining, k-fold cross validation, sometimes called leave-one-out cross-validation, is a form of cross-validation in which the training data is divided into k approximately equal subsets, with each of the k-1 subsets used as test data in turn and the remaining subset used as training data. K is often 10 or 5. K-fold cross-validation is particularly useful in model selection, since it reduces the variance of the estimates of the generalization error.
What are the advantages of cross validation?
Cross validation is a form of validation in which the dataset is partitioned into a training set and a test set (or cross-validation set). This set is then used to test the accuracy of your model. In other words, it gives you a methodology to measure how good your model is based on a sample of your data. For example, it is used to estimate the error of the model which is induced by the discrepancy between the training input and the testing input.