Deciding the right metric is a crucial step in any Machine Learning project. Every Machine Learning model needs to be evaluated against some metrics to check how well it has learnt the data and performed on test data. These are called the Performance Metrics and are different for regression and classification models.
By the end of this tutorial, you will know:
- Metrics for regression
- Metrics for different types of classification
- When to prefer which type of metric
Metrics for Regression
Regression problems involve predicting a target with continuous values from a set of independent features. This is a type of Supervised learning where we compare the prediction with the actual value and then calculate the difference/error term. Lesser the error, better is the performance of the model. We have different types of Regression metrics that are most widely used currently. Let’s go over them one by one.
1. Mean Squared Error
Mean Squared Error(MSE) is the most used regression metric. It uses squared errors (Y_Pred – Y_actual) to calculate errors. The squaring results in two important changes to the usual error calculation. One, that the error can be negative and squaring the errors will turn all the errors into positive terms and hence can be easily added.
Second, that the squaring increases the errors which are already large and reduces the errors with values less than 1. This magnifying effect penalises the instances where the error is large. MSE is highly preferred because it is differentiable at all the points to calculate the gradient of the loss function.
2. Root Mean Squared Error
The shortcoming of MSE is that it squares the error terms which lead to overestimation of the errors. Root Mean Squared Error (RMSE), on the other hand, takes a square root to reduce that effect. This is useful when large errors are not desired.
3. Mean Absolute Error
Mean Absolute Error (MAE) calculates the error by taking an absolute value of the error which is Y_Pred – Y_Actual. This is useful as it is not overestimating the larger errors unlike MSE and is also robust to outliers. Therefore, it is not suitable for applications which require special treatment for outliers. MAE is a linear score which means all the individual differences are weighted equally.
4. R Squared Error
R Squared is a goodness fit measure for regression models. It calculates the scatter of data points along the regression fit line. It is also called the Coefficient of Determination. Higher R Squared value means that there is less difference between the observed value and the actual values.
R Squared value keeps on increasing as more and more features are added into the model. This means that R Squared is not the right measurement of performance as it might give a large R Square even if the features are not adding any value.
In Regression Analysis, R Squared is used to determine the strength of correlation between the features and the target. In simple terms, it measures the strength of the relationship between your model and the dependent variable on a 0 – 100% scale. R Squared is the ratio between the Residual Sum of Squares(SSR) and the Total Sum of Squares(SST). R sqr is defined as:
R Sqr = 1 – SSR/SST ,where
SSR is the sum of the squares of the difference between the actual observed value Y and the predicted value Y_Pred. SST is the sum of the squares of the difference between the actual observed value Y and the average of the observed value Y_Avg.
Generally, more the R sqr, better is the model. But is it so always? No.
5. Adjusted R Squared Error
Adjusted R Squared Error overcomes the shortcoming of R Squared of not able to correctly estimate the improvement in model performance when more features are added. R Square value shows an incomplete picture and can be very misleading.
In essence, the R sqr value always increases on adding new features, even if the feature is decreasing the model’s performance. You might not know when your model started to overfit.
Adjusted R Sqr adjusts for this increase of variables and its value decreases when a feature doesn’t improve the model. We use adjusted R sqr to compare the goodness-of-fit for regression models that contain different numbers of independent variables.
Metrics for Classification
Just like regression metrics, there are different types of metrics for classification as well. Different types of metrics are used for different types of classification and data. Let’s go over them one by one.
Accuracy is the most straightforward and simple metric for classification. It just calculates what percentage of predictions are correct from the total number of instances. For example, if 90 out of 100 instances are predicted correctly, then the accuracy will be 90%. Accuracy, however, is not the correct metric for most classification tasks as it doesn’t take into account the class imbalance.
2. Precision, Recall
For a better picture of model performance, we need to see how many false positives were predicted and how many false negatives were predicted by the model. Precision tells us how many of the total positives were predicted as positives. Or in other words, the proportion of positive instances that were correctly predicted as positives out of total positive predictions. Recall tells us how many true positives were predicted out of total actual positives. Or in other words, it gives the proportion of predicted true positives from the total number of actual positives.
3. Confusion Matrix
A Confusion Matrix is a combination of True Positives, True Negatives, False Positives and False Negatives. It tells us how many were predicted out of the actual true positives and negatives. It is an NxN matrix where N is the number of classes. Confusion Matrix is not so confusing after all!
4. F1 Score
F1 Score combines the Precision and Recall into one metric for an averaged out value. F1 Score is actually the harmonic mean of Precision and Recall values. This is crucial because if in some case the recall value is 1, i.e. 100% and the precision value is 0, the F1 score will be 0.5 if we take the arithmetic mean of Precision & Recall instead of Harmonic mean. But if we take the Harmonic mean, F1 Score will be 0. This tells us that Harmonic mean penalizes extreme values more.
Accuracy and F1 score are nor good metrics when it comes to imbalanced data. AUC (Area Under Curve) ROC (Receiver Operator Characteristics) curve tells us the degree of separability of classes predicted by the model. Higher the score, more is the ability of the model to predict 0s as 0s and 1s as 1s. The AUC ROC Curve is plotted using the True Positive Rate (TPR) on the Y-axis and False Positive rate on the X-axis.
TPR = TP/TP+FN
FPR = FP/TN+FP
If AUC ROC comes out to be 1, it means that the model is correctly predicting all the classes and there is complete separability.
If it is 0.5, it means that there is no separability and the model is predicting all random outputs.
If it is 0, it means that the model is predicting the inverted classes. That is, 0s as 1s and 1s as 0s.
Before you go
In this article, we discussed the various performance metrics for classification and regression. These are the most used metrics and hence it is crucial to know about them. For classification, there are even more metrics which are specifically made for multi-class classification and multi-label classification such as Kappa Score, Precision at K, Average Precision at K, etc.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.