By the end of this tutorial, you will have knowledge of the following:
- What is Homoscedasticity & Heteroscedasticity?
- How to know if Heteroscedasticity is present.
- Effects of Heteroscedasticity in Machine Learning.
- Treating Heteroscedasticity.
What Is Homoscedasticity & Heteroscedasticity?
Homoscedasticity means to be of “The same Variance”. In Linear Regression, one of the main assumptions is that there is a Homoscedasticity present in the errors or the residual terms (Y_Pred – Y_actual).
In other words, Linear Regression assumes that for all the instances, the error terms will be the same and of very little variance.
Let’s understand it with the help of an example. Consider we have two variables – Carpet area of the house and price of the house. As the carpet area increases, the prices also increase.
So we fit a linear regression model and see that the errors are of the same variance throughout. The graph in the below image has Carpet Area in the X-axis and Price in the Y-axis.
As you can see, the predictions are almost along the linear regression line and with similar variance throughout.
Also, if we plot these residuals on the X-axis, we’d see it along in a straight line parallel to the X-axis. This is a clear sign of Homoscedasticity
When this condition is violated, it means there is Heteroscedasticity in the model. Considering the same example as above, let’s say that for houses with lesser carpet area the errors or residuals or very small. And as the carpet area increases, the variance in the predictions increase which results in increasing value of error or residual terms. When we plot the values again we see the typical Cone curve which strongly indicates the presence of Heteroscedsticity in the model.
Specifically speaking, Heteroscedasticity is a systematic increase or decrease in the variance of residuals over the range of independent variables. This is an issue because Homoscedasticity is an assumption of linear regression and all errors should be of the same variance. Learn more about linear Regression
How To Know If Heteroscedasticity is Present?
In the simplest terms, the easiest way to know if Heteroscedasticity is present is by plotting the graph of residuals. If you see any pattern present then there is Heteroscedasticity. Typically the values increase as the fitted value increase, thereby making a cone-shaped curve.
Usual Reasons For Heteroscedasticity
- When there is a large variance in a variable. In other words, when the smallest and the largest values in a variable are too extreme. These can also be outliers.
- When you are fitting the wrong model. If you fit a linear regression model to a data which is non-linear, it will lead to Heteroscedasticity.
- When the scale of values in a variable is not the same.
- When a wrong transformation on data is used for regression.
- When there is left/right skewness present in the data.
Pure Vs Impure Heteroscedasticity
Now with the above reasons, the Heteroscedasticity can either be Pure or Impure. When we fit the right model (linear or non-linear) and if yet there is a visible pattern in the residuals then it is called Pure Heteroscedasticity.
However, if we fit the wrong model and then observe a pattern in the residuals then it is a case of Impure Heteroscedasticity. Depending on the type of Heteroscedasticity the measures need to be taken to overcome it. It also depends on the domain you’re working in and varies from domain to domain.
Effects Of Heteroscedasticity In Machine Learning
As we discussed earlier, the linear regression model makes an assumption about Homoscedasticity being present in the data. If that assumption is broken then we won’t be able to trust the results we get.
If Heteroscedasticity is present then the instances with high variance will have a larger impact on the prediction which we don’t want.
- Presence of Heteroscedasticity makes the coefficients less precise and hence the correct coefficients are further away from the population value.
- Heteroscedasticity is also likely to produce p-values smaller than the actual values. This is due to the fact that the variance of coefficient estimates has increased but the standard OLS (Ordinary Least Squares) model did not detect it. Therefore the OLS model calculates p-values using an underestimated variance. This can lead us to incorrectly make a conclusion that the regression coefficients are significant when they are actually not significant.
- The standard errors produced will also be biased. Standard errors are crucial in calculating significant tests and confidence intervals. If the Standard errors are biased, it will mean that the tests are incorrect and the regression coefficient estimates will be incorrect.
How To Treat Heteroscedasticity?
If you detect the presence of Heteroscedasticity, then there are multiple ways to tackle it. First, let’s consider an example where we have 2 variables: Population of City and Number of Infections of COVID-19.
Now in this example, there will be a huge difference in the number of infections in large metro cities vs small tier-3 cities. The variable Number of Infections will be independent and Population of City will be a dependent variable.
Consider that fit a regression model to this data and observe Heteroscedasticity similar to the image above. So now we know that there is Heteroscedasticity present in the model and it needs to be fixed.
Now the first step would be to identify the source of Heteroscedasticity. In our case, it is the variable with a large variance.
There can be multiple ways to deal with Heteroscedasticity, but we’ll look at three such methods.
Manipulating The Variables
We can make some modifications to the variables/features we have to reduce the impact of this large variance on the model predictions. One way to do this by modifying the features to rates and percentages rather than actual values.
This would make the features convey a bit different information but it is worth trying. It will also depend on the problem and data if this type of approach can be implemented or not.
This method involves the least modification with features and often help solve the problem and even make the model’s performance better in some cases.
So in our case, we can change the feature “Number of Infections” to “Rate of infections”. This will help reduce the variance as quite obviously the number of infections in cities with a large population will be large.
Weighted regression is a modification of normal regression where the data points are assigned certain weights according to their variance. The ones with large variance are given small weights and the ones with less variance are given larger weights.
So when these weights are squared, the square of small weights underestimates the effect of high variance.
When correct weights are used, Heteroscedasticity is replaced by Homoscedasticity. But how to find correct weights? One quick way is to use the inverse of that variable as the weight.
So in our case, the weight will be Inverse of City Population.
Transforming the data is the last resort as by doing that you lose the interpretability of the feature.
What that means is you no longer can easily explain what the feature is showing.
One way could be to use Box-Cox transformations and log transformations.
Before You Go
There can be many reasons for Heteroscedasticity in your data. It also highly varies from one domain to another.
So it is essential to have the knowledge of that as well before you start with the above processes to remove Heteroscedasticity.
In this blog, we discussed Homoscedasticity and Heteroscedasticity and how it can be used to implement several machine learning algorithms.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.