What is Overfitting & Underfitting In Machine Learning ? [Everything You Need to Learn]

Machine Learning is not the easiest subject to master. Overfitting and Underfitting are a few of many terms that are common in the Machine Learning community. Understanding these concepts will lay the foundation for your future learning.

We will learn about these concepts deeply in this article. We’ll also discuss the basic idea of these errors, why they occur, and how you can fix them. You’ll learn a little about data models and their relationship with these errors as well. 

So without beating around the bush, let’s dive right in: 

What is a Data Model?

Before we start discussing what Overfitting and Underfitting are, let’s first understand what a model is. A data model is a system for making predictions with the input. You can say that a model is a theory for solving a problem. For example, if you want to predict the growth of multiple companies, you can take their profits as the input and generate results based on the relationship between their earnings and growth. The output for this example would be the predicted growth of the companies.

So the input is the current profit of the companies, whereas their growth projections are the output. The relationship between these two is the model. Models are necessary to generate outputs. 

The model understands the relationship between the input and output through a training dataset. We call inputs features and outputs labels. So, you might see these names in the article too. During the training of the model, you’ll give it the features as well as the labels and let it figure out the relationship between them. Once it has completed the training, you can try out the model by giving it only a set of features, whose correct predictions are available to you. 

After it has generated its predictions, you’d compare them with the correct predictions you have and see how accurate the model was. Models are of many shapes. 

Data Training and Testing

You might give your data model perfect features when you’re a beginner, but that’s not what happens in the real world. Data in the real world is filled with noise and useless information. No matter what’s the source of your data, you’ll find some variables present in it, which doesn’t fit the trend. 

In our example of companies’ growth projections, you know their growth wouldn’t rely entirely on their profits. There would be a lot of factors at play. During the training of your model, you should add some noise to make it realistic. Once you have created your data, you’d have to divide it into two sets for training and testing. 

You’d use the training data to help the model learn the relationship between features and labels. And you’d use the testing data to evaluate its performance.

There are many forms of models present in the data world. Choosing one can be a little daunting, but with a bit of practice, it gets easier. A standard model is a polynomial regression. It’s a form of linear regression where the inputs are raised to a variety of powers. It’s a kind of linear regression, but it doesn’t form a straight line. Read more about linear regression implementation. 

You define a polynomial by its order. The order of a polynomial is the highest power of x in its equation. And the order of the polynomial shows its degree as well. For example, a straight line equation has 1 degree. 

Importance of Fixing Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting occur when you deal with the polynomial degree of your model. Like we mentioned earlier, the degree of the polynomial depends on the highest power of x in its equation. This value indicates how flexible your model is. If your model has a high degree, it’d have a lot more freedom. With a high degree, a model can cover many data objects. 

On the other hand, a model with fewer than required degrees wouldn’t be able to cover sufficient data objects. Both of these situations can lead to soiled results that aren’t useful.

The former problem of higher than necessary degree was Overfitting. And the second problem of less than the required degree was underfitting. As you can see, they both can be detrimental to your model and damage your results. 

If you didn’t fix these issues, your model wouldn’t give you accurate results, and you’ll have useless labels to use. 

Now that we know their basic concept, let’s discuss each one of them in detail:

What is Overfitting?

When a machine learning algorithm starts to register noise within the data, we call it Overfitting. In simpler words, when the algorithm starts paying too much attention to the small details. In machine learning, the result is to predict the probable output, and due to Overfitting, it can hinder its accuracy big time. We know it sounds like a good thing, but it is not. 

A severe example of Overfitting in machine learning can be a graph where all the dots connect linearly. We want to capture the trend, but the chart doesn’t do that.

A model that is unable to make good predictions but learns everything possible from the data is useless as it leads to inaccurate results.

What to do when you notice Overfitting?

We can fix this issue by simply decreasing the amount of data the algorithm uses and not overload the system. High variance (Overfitting) makes things worse than better. Some of the conventional techniques used to solve Overfitting are as follows:

Decreasing the Iterations

By reducing the number of repetitions that run before Overfitting happens, we can stop it from happening. You can find the exact amount of iterations by the trial and error method.

Regularization

It constrains the coefficient estimates, which are close to 0. In simpler words, we can say that it tells the algorithm to use a more lenient model instead of a rigid one. Learn more about regularization and how to avoid overfitting.

Pruning (standard)

The easiest and the most common way to avoid Overfitting is Pruning. It gets rid of any nodes that add little to no predictive power. 

Fivefold Cross-Validation

Using cross-validation is one of the less complicated methods for checking for Overfitting.

What is Underfitting?

As the name suggests, Underfitting is when the model is not fit enough to give you results. An underfit data model doesn’t know how to target sufficient data objects. With a less degree, the graph ends up missing most of the features present. 

In other words, the model is ‘too simple’ to generate results if it is underfit. However, solving this problem is quite more comfortable and doesn’t require as much effort as Overfitting did previously.

What to do when you notice Underfitting?

If your model is underfit, you should give it more features. With more features, it’ll have a larger hypothesis space. It can use that space to generate accurate results. Detecting underfitting is more comfortable in comparison to Overfitting so that you wouldn’t have any problem identifying this error. However, you should only increase the features and not the entire data while dealing with an underfit model. Expanding the data results in more errors in this case. 

Read: Interesting Machine Learning Project Ideas

Hit the Sweet Spot

In machine learning, you’d want your data model to stay between Underfitting and Overfitting. It should neither cover too many data points nor too few. As you’ll train your model further, you can improve it further and fix its errors. Your model’s errors will start falling in numbers with the training set and the testing set. 

A great way to hit the sweet spot between Overfitting and Underfitting is to stop training your model before its errors start increasing. It’s a general solution, which you can use apart from the methods we have mentioned previously in this article. 

Conclusion

Every data professional faces the problem of Overfitting and Underfitting. Training a data model isn’t easy, and it takes a lot of practice to get acquainted with them. However, with experience, you’ll begin to identify problems early on and avoid the causes of errors altogether. 

It’s vital to be familiar with such errors if you want to become a machine learning expert. If you’re interested in learning more about machine learning and data science, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms. 

Kechit Goyal

Prepare for a Career of the Future

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Learn More
×