In this vast field of Machine Learning, what would be the first algorithm that most of us would have studied? Yes, it is the Linear Regression. Mostly being the first program and algorithm that one would have learned in their initial days of Machine Learning Programming, Linear Regression has its own importance and power with a linear type of data.
What if the dataset we come across is not linearly separable? What if the linear regression model is not able to derive any sort of relationship between both the independent and dependent variables?
There comes another type of regression known as the Polynomial Regression. True to its name, Polynomial Regression is a regression algorithm that models the relationship between the dependent (y) variable and the independent variable (x) as an nth degree polynomial. In this article, we shall understand the algorithm and math behind Polynomial Regression along with its implementation in Python.
What is Polynomial Regression?
As defined earlier, Polynomial Regression is a special case of linear regression in which a polynomial equation with a specified (n) degree is fit on the non-linear data which forms a curvilinear relationship between the dependent and independent variables.
y= b0+b1x1+ b2x12+ b3x13+…… bnx1n
y is the dependent variable (output variable)
x1 is the independent variable (predictors)
b0 is the bias
b1, b2, ….bn are the weights in the regression equation.
As the degree of the polynomial equation (n) becomes higher, the polynomial equation becomes more complicated and there is a possibility of the model tending to overfit which will be discussed in the later part.
Comparison of Regression Equations
Simple Linear Regression ===> y= b0+b1x
Multiple Linear Regression ===> y= b0+b1x1+ b2x2+ b3x3+…… bnxn
Polynomial Regression ===> y= b0+b1x1+ b2x12+ b3x13+…… bnx1n
From the above three equations, we see that there are several subtle differences in them. The Simple and Multiple Linear Regressions are different from the Polynomial Regression equation in that it has a degree of only 1. The Multiple Linear Regression consists of several variables x1, x2, and so on. Though the Polynomial Regression equation has only one variable x1, it has a degree n which differentiates it from the other two.
Need for Polynomial Regression
From the below diagrams we can see that in the first diagram, a linear line is attempted to be fit on the given set of non-linear datapoints. It is understood that it becomes very difficult for a straight line to form a relationship with this non-linear data. Because of this when we train the model, the loss function increases causing the high error.
On the other hand, when we apply Polynomial Regression it is clearly visible that the line fits well on the data points. This signifies that the polynomial equation that fits the datapoints derives some sort of relationship between the variables in the dataset. Thus, for such cases where the data points are arranged in a non-linear manner, we require the Polynomial Regression model.
Implementation of Polynomial Regression in Python
From here, we shall build a Machine Learning model in Python implementing Polynomial Regression. We shall compare the results obtained with Linear Regression and Polynomial Regression. Let us first understand the problem that we are going to solve with Polynomial Regression.
In this, consider the case of a Start-up looking to hire several candidates from a company. There are different openings for different job roles in the company. The start-up has details of the salary for each role in the previous company. Thus, when a candidate mentions his or her previous salary, the HR of the start-up needs to verify it with the existing data. Thus, we have two independent variables which are Position and Level. The dependent variable (output) is the Salary which is to be predicted using Polynomial Regression.
On visualizing the above table in a graph, we see that the data is non-linear in nature. In other words, as the level increases the salary increases at a higher rate thus giving us a curve as shown below.
Step 1: Data Pre-Processing
The first step in building any Machine Learning model is to import the libraries. Here, we have only three basic libraries to be imported. After this, the dataset is imported from my GitHub repository and the dependent variables and independent variables are assigned. The independent variables are stored in the variable X and the dependent variable is stored in the variable y.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv(‘https://raw.githubusercontent.com/mk-gurucharan/Regression/master/PositionSalaries_Data.csv’)
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
Here in the term [:, 1:-1], the first colon represents that all rows must be taken and the term 1:-1 denotes that the columns to be included are from the first column to the penultimate column which is given by -1.
Step 2: Linear Regression Model
In the next step, we shall build a Multiple Linear Regression model and use it to predict the salary data from the independent variables. For this, the class LinearRegression is imported from the sklearn library. It is then fitted on the variables X and y for training purposes.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
Once the model is built, on visualizing the results, we get the following graph.
As it is clearly seen, by trying to fit a straight line on a non-linear dataset, there is no relationship that is derived by the Machine Learning model. Thus, we need to go for Polynomial Regression to get a relationship between the variables.
Step 3: Polynomial Regression Model
In this next step, we shall fit a Polynomial Regression model on this dataset and visualize the results. For this, we import another Class from the sklearn module named as PolynomialFeatures in which we give the degree of the polynomial equation to be built. Then the LinearRegression class is used to fit the Polynomial equation to the dataset.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X)
lin_reg = LinearRegression()
In the above case, we have given the degree of the polynomial equation to be equal to 2. On plotting the graph, we see that there is some sort of curve that is derived but still there is much deviation from the real data (in red) and the predicted curve points (in green). Thus, in the next step we shall increase the degree of the polynomial to higher numbers such as 3 & 4 and then compare it with each other.
On comparing the results of the Polynomial Regression with degrees 3 and 4, we see that as the degree increases, the model trains well with the data. Thus, we can infer that a higher degree enables the Polynomial equation to fit more accurately on the training data. However, this is the perfect case of overfitting. Thus, it becomes important to choose the value of n precisely to prevent overfitting.
What is Overfitting?
As the name says, Overfitting is termed as a situation in statistics when a function (or a Machine Learning model in this case) is too closely fit on to a set of limited data points. This causes the function to perform poorly with new data points.
In Machine Learning if a model is said to be overfitting on a given set of training data points, then when the same model is introduced to a completely new set of points (say the test dataset), then it performs very badly on it as the overfitting model hasn’t generalized well with the data and is only overfitting on the training data points.
In polynomial regression, there is a good chance of the model getting overfit on the training data as the degree of the polynomial is increased. In the example shown above, we see a typical case of overfitting in polynomial regression which can be corrected with only a trial-and-error basis for choosing the optimal value of the degree.
Also Read: Machine Learning Project Ideas
To conclude, Polynomial Regression is utilized in many situations where there is a non-linear relationship between the dependent and independent variables. Though this algorithm suffers from sensitivity towards outliers, it can be corrected by treating them before fitting the regression line. Thus, in this article, we have been introduced to the concept of Polynomial Regression along with an example of its implementation in Python Programming on a simple dataset.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
What do you mean by linear regression?
Linear regression is a type of predictive numerical analysis through which we can find the value of an unknown variable with the help of a dependent variable. It also explains the connection between one dependent and one or more independent variables. Linear regression is a statistical technique for demonstrating a link between two variables. Linear regression plots a trend line from a set of data points. Linear regression can be used to generate a prediction model from seemingly random data, such as cancer diagnoses or stock prices. There are several methods for calculating linear regression. The ordinary least-squares approach, which estimates unknown variables in data and visually transforms into the sum of the vertical distances between the data points and the trend line, is one of the most prevalent.
What are some of Linear Regression's drawbacks?
In most cases, regression analysis is used in research to establish that there is a link between variables. However, correlation does not imply causation since a link between two variables does not imply that one causes the other to happen. Even a line in a basic linear regression that suits the data points well may not ensure a relationship between circumstances and logical outcomes. Using a linear regression model, you may determine whether or not there is any correlation between variables. Extra investigation and statistical analysis will be required to determine the exact nature of the link and whether one variable causes the other.
What are the basic assumptions of linear regression?
In linear regression, there are three key assumptions. The dependent and independent variables must, first and foremost, have a linear connection. A scatter plot of the dependent and independent variables is used to check this relationship. Second, there should be minimal or zero multi-collinearity between the independent variables in the dataset. It implies that the independent variables are unrelated. The value must be limited, which is determined by the domain requirement. Homoscedasticity is the third factor. The assumption that errors are evenly distributed is one of the most essential assumptions.