When you embark on your journey into the world of data science and machine learning, there is always a tendency to start with model creation and algorithms. You tend to avoid learning or knowing how to test the models’ effectiveness in real-world data.
Cross-Validation in R is a type of model validation that improves hold-out validation processes by giving preference to subsets of data and understanding the bias or variance trade-off to obtain a good understanding of model performance when applied beyond the data we trained it on. This article will be a start to end guide for data model validation and elucidating the need for model validation.
The Instability of Learning Models
To understand this, we will be using these pictures to illustrate the learning curve fit of various models:
We have shown here the learned model of dependency on the article price on size.
We made a linear transformation equation fitting between these to show the plots.
From training set points, the first plot is erroneous. Thus, on the test set, it does not perform great. So, we can say this is “Underfitting”. Here, the model is not able to understand the actual pattern in data.
The next plot shows the correct dependency on price on size. It depicts minimal training error. Thus, the relationship is generalized.
In the last plot, we establish a relationship that has almost no training error at all. We build the relationship by considering each fluctuation in the data point and the noise. The data model is very vulnerable. The fit arranges itself to minimize the error, hence generating complicated patterns in the given dataset. This is known as “Overfitting”. Here, there might be a higher difference between the training and test sets.
In the world of data science, out of various models, there is a lookout for a model that performs better. But sometimes, it is tough to understand if this improved score is because the relationship is captured better or just data over-fitting. We use these validation techniques to have the correct solutions. Herewith we also get a better-generalized pattern via these techniques.
What is Overfitting & Underfitting?
Underfitting in machine learning refers to capturing insufficient patterns. When we run the model on training and test sets, it performs very poorly.
Overfitting in machine learning means capturing noise and patterns. These do not generalize well to the data which didn’t undergo training. When we run the model on the training set, it performs extremely well, but it performs poorly when run on the test set.
What is Cross-Validation?
Cross-Validation aims to test the model’s ability to make a prediction of new data not used in estimation so that problems like overfitting or selection bias are flagged. Also, insight on the generalization of the database is given.
Steps to organize Cross-Validation:
- We keep aside a data set as a sample specimen.
- We undergo the model training with the other part of the dataset.
- We use the reserved sample set for testing. This set helps in quantifying the compelling performance of the model.
Statistical model validation
In statistics, model validation confirms that a statistical model’s acceptable outputs are generated from the real data. It makes sure that the statistical model outputs are derived from the data-generating process outputs so that the program’s main aims are thoroughly processed.
Validation is generally not only evaluated on data that was used in the model construction, but it also uses data that was not used in construction. So, validation usually tests some of the predictions of the model.
What is the use of cross-validation?
Cross-Validation is primarily used in applied machine learning for estimation of the skill of the model on future data. That is, we use a given sample to estimate how the model is generally expected to perform while making predictions on unused data during the model training.
Does Cross-Validation reduce Overfitting?
Cross-Validation is a strong protective action against overfitting. The idea is that we use our initial data used in training sets to obtain many smaller train-test splits. Then we use these splits for tuning our model. In the normal k-fold Cross-Validation, we divide the data into k subsets which are then called folds.
Methods Used for Cross-Validation in R
There are many methods that data scientists use for Cross-Validation performance. We discuss some of them here.
1. Validation Set Approach
The Validation Set Approach is a method used to estimate the error rate in a model by creating a testing dataset. We build the model using the other set of observations, also known as the training dataset. The model result is then applied to the testing dataset. We can then calculate the testing dataset error. Thus, it allows models not to overfit.
We have written the above code to create a training dataset and a different testing dataset. Therefore, we use the training dataset to build a predictive model. Then it will be applied to the testing dataset to check for error rates.
2. Leave-one-out cross-validation (LOOCV)
Leave-one-out Cross-Validation (LOOCV) is a certain multi-dimensional type of Cross-Validation of k folds. Here the number of folds and the instance number in the data set are the same. For every instance, the learning algorithm runs only once. In statistics, there is a similar process called jack-knife estimation.
R Code Snippet:
We can leave some training examples out, which will create a validation set of the same size for each iteration. This process is known as LPOCV (Leave P Out Cross Validation)
3. k-Fold Cross-Validation
A resampling procedure was used in a limited data sample for the evaluation of machine learning models.
The procedure begins with defining a single parameter, which refers to the number of groups that a given data sample is to be split. Thus, this procedure is named as k-fold Cross-Validation.
Data scientists often use Cross-Validation in applied machine learning to estimate features of a machine learning model on unused data.
It is comparatively simple to understand. It often results in a less biased or overfitted estimate of the model skill like a simple train set or test set.
The general procedure is built-up with a few simple steps:
- We have to mix the dataset to randomize it.
- Then we split the dataset into k groups of similar size.
- For each unique group:
We have to take a group as a particular test data set. Then we consider all the remaining groups as a whole training data set. Then we fit a model on the training set and to confirm the outcome. We run it on the test set. We note down the evaluation score.
R code Snippet:
4. Stratified k-fold Cross-Validation
Stratification is a rearrangement of data to make sure that each fold is a wholesome representative. Consider a binary classification problem, having each class of 50% data.
When dealing with both bias and variance, stratified k-fold Cross Validation is the best method.
R Code Snippet:
5. Adversarial Validation
The basic idea is for checking the percentage of similarity in features and their distribution between training and tests. If they are not easy to differentiate, the distribution is, by all means, similar, and the general validation methods should work out.
While dealing with actual datasets, there are cases sometimes where the test sets and train sets are very different. The internal Cross-Validation techniques generate scores, not within the arena of the test score. Here, adversarial validation comes into play.
It checks the degree of similarity within training and tests concerning feature distribution. This validation is featured by merging train and test sets, labeling zero or one (zero – train, one-test), and analyzing a classification task of binary scores.
We have to create a new target variable which is 1 for each row in the train set and 0 for each row in the test set.
Now we combine the train and test datasets.
Using the above newly created target variable, we fit a classification model and predict each row’s probabilities to be in the test set.
6. Cross-Validation for time series
A time-series dataset cannot be randomly split as the time section messes up the data. In a time series problem, we perform Cross-Validation as shown below.
For time-series Cross-Validation, we create folds in a fashion of forwarding chains.
If, for example, for n years, we have a time series for annual consumer demand for a particular product. We make the folds like this:
fold 1: training group 1, test group 2
fold 2: training group 1,2, test group 3
fold 3: training group 1,2,3, test group 4
fold 4: training group 1,2,3,4, test group 5
fold 5: training group 1,2,3,4,5, test group 6
fold n: training group 1 to n-1, test group n
A new train and test set are progressively selected. Initially, we start with a train set with a minimum number of observations required for fitting the model. Gradually, with every fold, we change our train and test sets.
R Code Snippet:
h = 1 means that we take into consideration the error for 1 step ahead forecasts.
How to measure the model’s bias-variance?
With k-fold Cross-Validation, we obtain various k model estimation errors. For an ideal model, the errors sum up to zero. For the model to return its bias, the average of all the errors is taken and scaled. The lower average is considered appreciable for the model.
For model variance calculation, we take the standard deviation of all the errors. Our model is not variable with different subsets of training data if the standard deviation is minor.
The focus should be on having a balance between bias and variance. If we reduce the variance and control model bias, we can be able to reach equilibrium to some extent. It will eventually make a model for better prediction.
In this article, we discussed Cross-Validation and its application in R. We also learned methods to avoid overfitting. We also discussed different procedures like the validation set approach, LOOCV, k-fold Cross-Validation, and stratified k-fold, followed by each approach’s implementation in R performed on the Iris dataset.
If you get interested in Cross-Validation in R, Data Science and want to learn more about it, please check out upGrad And IIITB’s Post Graduate Certification Program in Data Science. Build your data science career with the help of this online program where you would get a chance to learn from expert faculties from IIIT Bangalore.
The course is structured for data science enthusiasts who are working professionals. Projects, case studies, and numerous assignments of this course would certainly help your journey in becoming a data science expert.