Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconHomoscedasticity In Machine Learning: Detection, Effects & How to Treat

Homoscedasticity In Machine Learning: Detection, Effects & How to Treat

Last updated:
6th Jan, 2021
Read Time
8 Mins
share image icon
In this article
Chevron in toc
View All
Homoscedasticity In Machine Learning: Detection, Effects & How to Treat

By the end of this tutorial, you will have knowledge of the following:

  • What is Homoscedasticity & Heteroscedasticity?
  • How to know if Heteroscedasticity is present.
  • Effects of Heteroscedasticity in Machine Learning.
  • Treating Heteroscedasticity

Top Machine Learning and AI Courses Online

What Is Homoscedasticity & Heteroscedasticity?

Homoscedasticity means to be of “The same Variance”. In Linear Regression, one of the main assumptions is that there is a Homoscedasticity present in the errors or the residual terms (Y_Pred – Y_actual).

In other words, Linear Regression assumes that for all the instances, the error terms will be the same and of very little variance.

Ads of upGrad blog

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Let’s understand it with the help of an example. Consider we have two variables – Carpet area of the house and price of the house. As the carpet area increases, the prices also increase.

So we fit a linear regression model and see that the errors are of the same variance throughout. The graph in the below image has Carpet Area in the X-axis and Price in the Y-axis.

As you can see, the predictions are almost along the linear regression line and with similar variance throughout.

Also, if we plot these residuals on the X-axis, we’d see it along in a straight line parallel to the X-axis. This is a clear sign of Homoscedasticity

Image Source

When this condition is violated, it means there is Heteroscedasticity in the model. Considering the same example as above, let’s say that for houses with lesser carpet area the errors or residuals or very small. And as the carpet area increases, the variance in the predictions increase which results in increasing value of error or residual terms. When we plot the values again we see the typical Cone curve which strongly indicates the presence of Heteroscedsticity in the model.

Image Source

Specifically speaking, Heteroscedasticity is a systematic increase or decrease in the variance of residuals over the range of independent variables. This is an issue because Homoscedasticity is an assumption of linear regression and all errors should be of the same variance. Learn more about linear Regression

How To Know If Heteroscedasticity is Present?

In the simplest terms, the easiest way to know if Heteroscedasticity is present is by plotting the graph of residuals. If you see any pattern present then there is Heteroscedasticity. Typically the values increase as the fitted value increase, thereby making a cone-shaped curve.

Read: Machine Learning Project Ideas

Usual Reasons For Heteroscedasticity

  • When there is a large variance in a variable. In other words, when the smallest and the largest values in a variable are too extreme. These can also be outliers.
  • When you are fitting the wrong model. If you fit a linear regression model to a data which is non-linear, it will lead to Heteroscedasticity.
  • When the scale of values in a variable is not the same.
  • When a wrong transformation on data is used for regression.
  • When there is left/right skewness present in the data.

Pure Vs Impure Heteroscedasticity

Now with the above reasons, the Heteroscedasticity can either be Pure or Impure. When we fit the right model (linear or non-linear) and if yet there is a visible pattern in the residuals then it is called Pure Heteroscedasticity.

However, if we fit the wrong model and then observe a pattern in the residuals then it is a case of Impure Heteroscedasticity. Depending on the type of Heteroscedasticity the measures need to be taken to overcome it. It also depends on the domain you’re working in and varies from domain to domain.

Effects Of Heteroscedasticity In Machine Learning

As we discussed earlier, the linear regression model makes an assumption about Homoscedasticity being present in the data. If that assumption is broken then we won’t be able to trust the results we get.

If Heteroscedasticity is present then the instances with high variance will have a larger impact on the prediction which we don’t want.

  • Presence of Heteroscedasticity makes the coefficients less precise and hence the correct coefficients are further away from the population value.
  • Heteroscedasticity is also likely to produce p-values smaller than the actual values. This is due to the fact that the variance of coefficient estimates has increased but the standard OLS (Ordinary Least Squares) model did not detect it. Therefore the OLS model calculates p-values using an underestimated variance. This can lead us to incorrectly make a conclusion that the regression coefficients are significant when they are actually not significant.
  • The standard errors produced will also be biased. Standard errors are crucial in calculating significant tests and confidence intervals. If the Standard errors are biased, it will mean that the tests are incorrect and the regression coefficient estimates will be incorrect.

How To Treat Heteroscedasticity?

If you detect the presence of Heteroscedasticity, then there are multiple ways to tackle it. First, let’s consider an example where we have 2 variables: Population of City and Number of Infections of COVID-19.

Now in this example, there will be a huge difference in the number of infections in large metro cities vs small tier-3 cities. The variable Number of Infections will be independent and Population of City will be a dependent variable.

Consider that fit a regression model to this data and observe Heteroscedasticity similar to the image above. So now we know that there is Heteroscedasticity present in the model and it needs to be fixed.

Now the first step would be to identify the source of Heteroscedasticity. In our case, it is the variable with a large variance.

There can be multiple ways to deal with Heteroscedasticity, but we’ll look at three such methods.

Manipulating The Variables

We can make some modifications to the variables/features we have to reduce the impact of this large variance on the model predictions. One way to do this by modifying the features to rates and percentages rather than actual values.

This would make the features convey a bit different information but it is worth trying. It will also depend on the problem and data if this type of approach can be implemented or not.

This method involves the least modification with features and often help solve the problem and even make the model’s performance better in some cases.

So in our case, we can change the feature “Number of Infections” to “Rate of infections”. This will help reduce the variance as quite obviously the number of infections in cities with a large population will be large.

Weighted Regression

Weighted regression is a modification of normal regression where the data points are assigned certain weights according to their variance. The ones with large variance are given small weights and the ones with less variance are given larger weights.

So when these weights are squared, the square of small weights underestimates the effect of high variance.

When correct weights are used, Heteroscedasticity is replaced by Homoscedasticity. But how to find correct weights? One quick way is to use the inverse of that variable as the weight.

So in our case, the weight will be Inverse of City Population.


Transforming the data is the last resort as by doing that you lose the interpretability of the feature.

What that means is you no longer can easily explain what the feature is showing.

One way could be to use Box-Cox transformations and log transformations.

Popular AI and ML Blogs & Free Courses

Before You Go

There can be many reasons for Heteroscedasticity in your data. It also highly varies from one domain to another.

Ads of upGrad blog

So it is essential to have the knowledge of that as well before you start with the above processes to remove Heteroscedasticity.

In this blog, we discussed Homoscedasticity and Heteroscedasticity and how it can be used to implement several machine learning algorithms.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.


Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What is meant by locally weighted regression in machine learning?

2What is the white test for heteroscedasticity?

If you need your independent variable to have an interactive, non-linear effect on the variance, then the use of a white test is preferred to check for heteroscedasticity. However, the white test, being an asymptotic test, is preferred in the case of large samples only. The heteroscedasticity process can be a function of one or more of your independent variables using the White test. It's comparable to the Breusch-Pagan test, the only difference being that the White test allows for a nonlinear and interactive influence of the independent variable on the error variance.

3What exactly is the null hypothesis for heteroscedasticity?

The existence of an outlier in the data causes heteroscedasticity. Heteroscedasticity can also be produced when variables are omitted from the model. Heteroscedasticity implies just two hypotheses: the null hypothesis and the alternate hypothesis. When applying the White test, Breusch-Pagan, or Cook-Weisberg tests to check for heteroscedasticity, the null hypothesis is true if the variances of the errors are equal. An alternate hypothesis occurs when the variances of the errors are not identical.

Explore Free Courses

Suggested Blogs

15 Interesting MATLAB Project Ideas & Topics For Beginners [2024]
Diving into the world of engineering and data science, I’ve discovered the potential of MATLAB as an indispensable tool. It has accelerated my c
Read More

by Pavan Vadapalli

09 Jul 2024

5 Types of Research Design: Elements and Characteristics
The reliability and quality of your research depend upon several factors such as determination of target audience, the survey of a sample population,
Read More

by Pavan Vadapalli

07 Jul 2024

Biological Neural Network: Importance, Components & Comparison
Humans have made several attempts to mimic the biological systems, and one of them is artificial neural networks inspired by the biological neural net
Read More

by Pavan Vadapalli

04 Jul 2024

Production System in Artificial Intelligence and its Characteristics
The AI market has witnessed rapid growth on the international level, and it is predicted to show a CAGR of 37.3% from 2023 to 2030. The production sys
Read More

by Pavan Vadapalli

03 Jul 2024

AI vs Human Intelligence: Difference Between AI & Human Intelligence
In this article, you will learn about AI vs Human Intelligence, Difference Between AI & Human Intelligence. Definition of AI & Human Intelli
Read More

by Pavan Vadapalli

01 Jul 2024

Career Opportunities in Artificial Intelligence: List of Various Job Roles
Artificial Intelligence or AI career opportunities have escalated recently due to its surging demands in industries. The hype that AI will create tons
Read More

by Pavan Vadapalli

26 Jun 2024

Gini Index for Decision Trees: Mechanism, Perfect & Imperfect Split With Examples
As you start learning about supervised learning, it’s important to get acquainted with the concept of decision trees. Decision trees are akin to
Read More

by MK Gurucharan

24 Jun 2024

Random Forest Vs Decision Tree: Difference Between Random Forest and Decision Tree
Recent advancements have paved the growth of multiple algorithms. These new and blazing algorithms have set the data on fire. They help in handling da
Read More

by Pavan Vadapalli

24 Jun 2024

Basic CNN Architecture: Explaining 5 Layers of Convolutional Neural Network
Introduction In the last few years of the IT industry, there has been a huge demand for once particular skill set known as Deep Learning. Deep Learni
Read More

by MK Gurucharan

21 Jun 2024

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon