Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconHomoscedasticity In Machine Learning: Detection, Effects & How to Treat

Homoscedasticity In Machine Learning: Detection, Effects & How to Treat

Last updated:
6th Jan, 2021
Views
Read Time
8 Mins
share image icon
In this article
Chevron in toc
View All
Homoscedasticity In Machine Learning: Detection, Effects & How to Treat

By the end of this tutorial, you will have knowledge of the following:

  • What is Homoscedasticity & Heteroscedasticity?
  • How to know if Heteroscedasticity is present.
  • Effects of Heteroscedasticity in Machine Learning.
  • Treating Heteroscedasticity

Top Machine Learning and AI Courses Online

What Is Homoscedasticity & Heteroscedasticity?

Homoscedasticity means to be of “The same Variance”. In Linear Regression, one of the main assumptions is that there is a Homoscedasticity present in the errors or the residual terms (Y_Pred – Y_actual).

In other words, Linear Regression assumes that for all the instances, the error terms will be the same and of very little variance.

Ads of upGrad blog

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Let’s understand it with the help of an example. Consider we have two variables – Carpet area of the house and price of the house. As the carpet area increases, the prices also increase.

So we fit a linear regression model and see that the errors are of the same variance throughout. The graph in the below image has Carpet Area in the X-axis and Price in the Y-axis.

As you can see, the predictions are almost along the linear regression line and with similar variance throughout.

Also, if we plot these residuals on the X-axis, we’d see it along in a straight line parallel to the X-axis. This is a clear sign of Homoscedasticity

Image Source

When this condition is violated, it means there is Heteroscedasticity in the model. Considering the same example as above, let’s say that for houses with lesser carpet area the errors or residuals or very small. And as the carpet area increases, the variance in the predictions increase which results in increasing value of error or residual terms. When we plot the values again we see the typical Cone curve which strongly indicates the presence of Heteroscedsticity in the model.

Image Source

Specifically speaking, Heteroscedasticity is a systematic increase or decrease in the variance of residuals over the range of independent variables. This is an issue because Homoscedasticity is an assumption of linear regression and all errors should be of the same variance. Learn more about linear Regression

How To Know If Heteroscedasticity is Present?

In the simplest terms, the easiest way to know if Heteroscedasticity is present is by plotting the graph of residuals. If you see any pattern present then there is Heteroscedasticity. Typically the values increase as the fitted value increase, thereby making a cone-shaped curve.

Read: Machine Learning Project Ideas

Usual Reasons For Heteroscedasticity

  • When there is a large variance in a variable. In other words, when the smallest and the largest values in a variable are too extreme. These can also be outliers.
  • When you are fitting the wrong model. If you fit a linear regression model to a data which is non-linear, it will lead to Heteroscedasticity.
  • When the scale of values in a variable is not the same.
  • When a wrong transformation on data is used for regression.
  • When there is left/right skewness present in the data.

Pure Vs Impure Heteroscedasticity

Now with the above reasons, the Heteroscedasticity can either be Pure or Impure. When we fit the right model (linear or non-linear) and if yet there is a visible pattern in the residuals then it is called Pure Heteroscedasticity.

However, if we fit the wrong model and then observe a pattern in the residuals then it is a case of Impure Heteroscedasticity. Depending on the type of Heteroscedasticity the measures need to be taken to overcome it. It also depends on the domain you’re working in and varies from domain to domain.

Effects Of Heteroscedasticity In Machine Learning

As we discussed earlier, the linear regression model makes an assumption about Homoscedasticity being present in the data. If that assumption is broken then we won’t be able to trust the results we get.

If Heteroscedasticity is present then the instances with high variance will have a larger impact on the prediction which we don’t want.

  • Presence of Heteroscedasticity makes the coefficients less precise and hence the correct coefficients are further away from the population value.
  • Heteroscedasticity is also likely to produce p-values smaller than the actual values. This is due to the fact that the variance of coefficient estimates has increased but the standard OLS (Ordinary Least Squares) model did not detect it. Therefore the OLS model calculates p-values using an underestimated variance. This can lead us to incorrectly make a conclusion that the regression coefficients are significant when they are actually not significant.
  • The standard errors produced will also be biased. Standard errors are crucial in calculating significant tests and confidence intervals. If the Standard errors are biased, it will mean that the tests are incorrect and the regression coefficient estimates will be incorrect.

How To Treat Heteroscedasticity?

If you detect the presence of Heteroscedasticity, then there are multiple ways to tackle it. First, let’s consider an example where we have 2 variables: Population of City and Number of Infections of COVID-19.

Now in this example, there will be a huge difference in the number of infections in large metro cities vs small tier-3 cities. The variable Number of Infections will be independent and Population of City will be a dependent variable.

Consider that fit a regression model to this data and observe Heteroscedasticity similar to the image above. So now we know that there is Heteroscedasticity present in the model and it needs to be fixed.

Now the first step would be to identify the source of Heteroscedasticity. In our case, it is the variable with a large variance.

There can be multiple ways to deal with Heteroscedasticity, but we’ll look at three such methods.

Manipulating The Variables

We can make some modifications to the variables/features we have to reduce the impact of this large variance on the model predictions. One way to do this by modifying the features to rates and percentages rather than actual values.

This would make the features convey a bit different information but it is worth trying. It will also depend on the problem and data if this type of approach can be implemented or not.

This method involves the least modification with features and often help solve the problem and even make the model’s performance better in some cases.

So in our case, we can change the feature “Number of Infections” to “Rate of infections”. This will help reduce the variance as quite obviously the number of infections in cities with a large population will be large.

Weighted Regression

Weighted regression is a modification of normal regression where the data points are assigned certain weights according to their variance. The ones with large variance are given small weights and the ones with less variance are given larger weights.

So when these weights are squared, the square of small weights underestimates the effect of high variance.

When correct weights are used, Heteroscedasticity is replaced by Homoscedasticity. But how to find correct weights? One quick way is to use the inverse of that variable as the weight.

So in our case, the weight will be Inverse of City Population.

Transformations

Transforming the data is the last resort as by doing that you lose the interpretability of the feature.

What that means is you no longer can easily explain what the feature is showing.

One way could be to use Box-Cox transformations and log transformations.

Popular AI and ML Blogs & Free Courses

Before You Go

There can be many reasons for Heteroscedasticity in your data. It also highly varies from one domain to another.

Ads of upGrad blog

So it is essential to have the knowledge of that as well before you start with the above processes to remove Heteroscedasticity.

In this blog, we discussed Homoscedasticity and Heteroscedasticity and how it can be used to implement several machine learning algorithms.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What is meant by locally weighted regression in machine learning?

2What is the white test for heteroscedasticity?

If you need your independent variable to have an interactive, non-linear effect on the variance, then the use of a white test is preferred to check for heteroscedasticity. However, the white test, being an asymptotic test, is preferred in the case of large samples only. The heteroscedasticity process can be a function of one or more of your independent variables using the White test. It's comparable to the Breusch-Pagan test, the only difference being that the White test allows for a nonlinear and interactive influence of the independent variable on the error variance.

3What exactly is the null hypothesis for heteroscedasticity?

The existence of an outlier in the data causes heteroscedasticity. Heteroscedasticity can also be produced when variables are omitted from the model. Heteroscedasticity implies just two hypotheses: the null hypothesis and the alternate hypothesis. When applying the White test, Breusch-Pagan, or Cook-Weisberg tests to check for heteroscedasticity, the null hypothesis is true if the variances of the errors are equal. An alternate hypothesis occurs when the variances of the errors are not identical.

Explore Free Courses

Suggested Blogs

Top 5 Natural Language Processing (NLP) Projects & Topics For Beginners [2024]
109235
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

30 May 2024

Top 8 Exciting AWS Projects & Ideas For Beginners [2024]
99036
AWS Projects & Topics Looking for AWS project ideas? Then you’ve come to the right place because, in this article, we’ve shared multiple AWS proj
Read More

by Pavan Vadapalli

30 May 2024

Bagging vs Boosting in Machine Learning: Difference Between Bagging and Boosting
91368
Owing to the proliferation of Machine learning applications and an increase in computing power, data scientists have inherently implemented algorithms
Read More

by Pavan Vadapalli

25 May 2024

45+ Best Machine Learning Project Ideas For Beginners [2024]
331118
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

21 May 2024

Top 9 Python Libraries for Machine Learning in 2024
76208
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 May 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
65151
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 May 2024

40 Best IoT Project Ideas & Topics For Beginners 2024 [Latest]
769092
In this article, you will learn the 40Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Best Simple IoT Proje
Read More

by Kechit Goyal

19 May 2024

Top 22 Artificial Intelligence Project Ideas & Topics for Beginners [2024]
421328
In this article, you will learn the 22 AI project ideas & Topics. Take a glimpse below. Best AI Project Ideas & Topics Predict Housing Price
Read More

by Pavan Vadapalli

18 May 2024

Image Segmentation Techniques [Step By Step Implementation]
64514
What do you see first when you look at your selfie? Your face, right? You can spot your face because your brain is capable of identifying your face an
Read More

by Pavan Vadapalli

16 May 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon