Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconPolynomial Regression: Importance, Step-by-Step Implementation

Polynomial Regression: Importance, Step-by-Step Implementation

Last updated:
29th Jan, 2021
Views
Read Time
10 Mins
share image icon
In this article
Chevron in toc
View All
Polynomial Regression: Importance, Step-by-Step Implementation

Introduction

In this vast field of Machine Learning, what would be the first algorithm that most of us would have studied? Yes, it is the Linear Regression. Mostly being the first program and algorithm that one would have learned in their initial days of Machine Learning Programming, Linear Regression has its own importance and power with a linear type of data.

Top Machine Learning and AI Courses Online

What if the dataset we come across is not linearly separable? What if the linear regression model is not able to derive any sort of relationship between both the independent and dependent variables?

There comes another type of regression known as the Polynomial Regression. True to its name, Polynomial Regression is a regression algorithm that models the relationship between the dependent (y) variable and the independent variable (x) as an nth degree polynomial. In this article, we shall understand the algorithm and math behind Polynomial Regression along with its implementation in Python.

Ads of upGrad blog

Trending Machine Learning Skills

What is Polynomial Regression?

As defined earlier, Polynomial Regression is a special case of linear regression in which a polynomial equation with a specified (n) degree is fit on the non-linear data which forms a curvilinear relationship between the dependent and independent variables.

y= b0+b1x1+ b2x12+ b3x13+…… bnx1n

Here,

y is the dependent variable (output variable)

x1 is the independent variable (predictors)

b0 is the bias

b1, b2, ….bn are the weights in the regression equation.

As the degree of the polynomial equation (n) becomes higher, the polynomial equation becomes more complicated and there is a possibility of the model tending to overfit which will be discussed in the later part.

Comparison of Regression Equations

Simple Linear Regression ===>         y= b0+b1x

Multiple Linear Regression ===>     y= b0+b1x1+ b2x2+ b3x3+…… bnxn

Polynomial Regression ===>         y= b0+b1x1+ b2x12+ b3x13+…… bnx1n

From the above three equations, we see that there are several subtle differences in them. The Simple and Multiple Linear Regressions are different from the Polynomial Regression equation in that it has a degree of only 1. The Multiple Linear Regression consists of several variables x1, x2, and so on. Though the Polynomial Regression equation has only one variable x1, it has a degree n which differentiates it from the other two.

Need for Polynomial Regression

From the below diagrams we can see that in the first diagram, a linear line is attempted to be fit on the given set of non-linear datapoints. It is understood that it becomes very difficult for a straight line to form a relationship with this non-linear data. Because of this when we train the model, the loss function increases causing the high error.

On the other hand, when we apply Polynomial Regression it is clearly visible that the line fits well on the data points. This signifies that the polynomial equation that fits the datapoints derives some sort of relationship between the variables in the dataset. Thus, for such cases where the data points are arranged in a non-linear manner, we require the Polynomial Regression model.

Implementation of Polynomial Regression in Python

From here, we shall build a Machine Learning model in Python implementing Polynomial Regression. We shall compare the results obtained with Linear Regression and Polynomial Regression. Let us first understand the problem that we are going to solve with Polynomial Regression.

Problem Description

In this, consider the case of a Start-up looking to hire several candidates from a company. There are different openings for different job roles in the company. The start-up has details of the salary for each role in the previous company. Thus, when a candidate mentions his or her previous salary, the HR of the start-up needs to verify it with the existing data. Thus, we have two independent variables which are Position and Level. The dependent variable (output) is the Salary which is to be predicted using Polynomial Regression.

On visualizing the above table in a graph, we see that the data is non-linear in nature. In other words, as the level increases the salary increases at a higher rate thus giving us a curve as shown below.

Step 1: Data Pre-Processing

The first step in building any Machine Learning model is to import the libraries. Here, we have only three basic libraries to be imported. After this, the dataset is imported from my GitHub repository and the dependent variables and independent variables are assigned. The independent variables are stored in the variable X and the dependent variable is stored in the variable y.

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

dataset = pd.read_csv(‘https://raw.githubusercontent.com/mk-gurucharan/Regression/master/PositionSalaries_Data.csv’)

X = dataset.iloc[:, 1:-1].values

y = dataset.iloc[:, -1].values

Here in the term [:, 1:-1], the first colon represents that all rows must be taken and the term 1:-1 denotes that the columns to be included are from the first column to the penultimate column which is given by -1.

Step 2: Linear Regression Model

In the next step, we shall build a Multiple Linear Regression model and use it to predict the salary data from the independent variables. For this, the class LinearRegression is imported from the sklearn library. It is then fitted on the variables X and y for training purposes.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X, y)

Once the model is built, on visualizing the results, we get the following graph.

As it is clearly seen, by trying to fit a straight line on a non-linear dataset, there is no relationship that is derived by the Machine Learning model. Thus, we need to go for Polynomial Regression to get a relationship between the variables.

Step 3: Polynomial Regression Model

In this next step, we shall fit a Polynomial Regression model on this dataset and visualize the results. For this, we import another Class from the sklearn module named as PolynomialFeatures in which we give the degree of the polynomial equation to be built. Then the LinearRegression class is used to fit the Polynomial equation to the dataset.

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

poly_reg = PolynomialFeatures(degree = 2)

X_poly = poly_reg.fit_transform(X)

lin_reg = LinearRegression()

lin_reg.fit(X_poly, y)

In the above case, we have given the degree of the polynomial equation to be equal to 2. On plotting the graph, we see that there is some sort of curve that is derived but still there is much deviation from the real data (in red) and the predicted curve points (in green). Thus, in the next step we shall increase the degree of the polynomial to higher numbers such as 3 & 4 and then compare it with each other.

On comparing the results of the Polynomial Regression with degrees 3 and 4, we see that as the degree increases, the model trains well with the data. Thus, we can infer that a higher degree enables the Polynomial equation to fit more accurately on the training data. However, this is the perfect case of overfitting. Thus, it becomes important to choose the value of n precisely to prevent overfitting.

What is Overfitting?

As the name says, Overfitting is termed as a situation in statistics when a function (or a Machine Learning model in this case) is too closely fit on to a set of limited data points. This causes the function to perform poorly with new data points.

In Machine Learning if a model is said to be overfitting on a given set of training data points, then when the same model is introduced to a completely new set of points (say the test dataset), then it performs very badly on it as the overfitting model hasn’t generalized well with the data and is only overfitting on the training data points.

Also Read: Machine Learning Project Ideas

In polynomial regression, there is a good chance of the model getting overfit on the training data as the degree of the polynomial is increased. In the example shown above, we see a typical case of overfitting in polynomial regression which can be corrected with only a trial-and-error basis for choosing the optimal value of the degree.

Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Conclusion

To conclude, Polynomial Regression is utilized in many situations where there is a non-linear relationship between the dependent and independent variables. Though this algorithm suffers from sensitivity towards outliers, it can be corrected by treating them before fitting the regression line. Thus, in this article, we have been introduced to the concept of Polynomial Regression along with an example of its implementation in Python Programming on a simple dataset.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What do you mean by linear regression?

Linear regression is a type of predictive numerical analysis through which we can find the value of an unknown variable with the help of a dependent variable. It also explains the connection between one dependent and one or more independent variables. Linear regression is a statistical technique for demonstrating a link between two variables. Linear regression plots a trend line from a set of data points. Linear regression can be used to generate a prediction model from seemingly random data, such as cancer diagnoses or stock prices. There are several methods for calculating linear regression. The ordinary least-squares approach, which estimates unknown variables in data and visually transforms into the sum of the vertical distances between the data points and the trend line, is one of the most prevalent.

2What are some of Linear Regression's drawbacks?

In most cases, regression analysis is used in research to establish that there is a link between variables. However, correlation does not imply causation since a link between two variables does not imply that one causes the other to happen. Even a line in a basic linear regression that suits the data points well may not ensure a relationship between circumstances and logical outcomes. Using a linear regression model, you may determine whether or not there is any correlation between variables. Extra investigation and statistical analysis will be required to determine the exact nature of the link and whether one variable causes the other.

3What are the basic assumptions of linear regression?

In linear regression, there are three key assumptions. The dependent and independent variables must, first and foremost, have a linear connection. A scatter plot of the dependent and independent variables is used to check this relationship. Second, there should be minimal or zero multi-collinearity between the independent variables in the dataset. It implies that the independent variables are unrelated. The value must be limited, which is determined by the domain requirement. Homoscedasticity is the third factor. The assumption that errors are evenly distributed is one of the most essential assumptions.

Explore Free Courses

Suggested Blogs

Artificial Intelligence course fees
5438
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
6176
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024
75639
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
64469
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
152994
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
908759
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
760392
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
107744
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
328367
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon