In practice, there are two primary supervised machine learning algorithms: 1. Classification and 2. Regression — Classification is used to predict discrete outputs, while regression is used to predict continuous value output.Â
In algebra, linearity denotes a straight or linear relationship between multiple variables. A literal representation of this relationship would be a straight line.Â
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
Linear regression is a machine learning algorithm that is executed under supervision. It is a process of looking for and mapping a line suitable for all the data points available on the said plot. It is a regression model that helps estimate the value between one dependent and one independent variable, all with the help of a straight line.Â
Linear regression models help build a linear relationship between these independent variables, which have the lowest costs, based on the given dependent variables.Â
In mathematics, we have three ways which are used to describe a linear regression model. They are as follows (y being the dependent variable):
- y = intercept + (slope x) + errorÂ
- y = constant + (coefficientx) + errorÂ
- y = a + bx + e
Why is linear regression essential?
The models of linear regression are comparatively simpler and more user-friendly. They make the process of interpreting mathematical data/formulae capable of generating predictions relatively simpler. Linear regression can be instrumental in various fields (for instance, academics or business studies).
The linear regression model is the only scientifically proven method to accurately predict the future. It is used in various sciences from environmental, behavioural, social, etc.Â
The properties of these models are very well understood and hence, easily trainable since it is a long-established statistical procedure. It also facilitates the transformation of copious raw data sets into actionable information.
Key assumptions of effective linear regression
- The number of valid cases, mean, and standard deviation should be considered for each variable.Â
- For each model: Regression coefficients, correlation matrix, part and partial correlations, standard error of the estimate, analysis-of-variance table, predicted values, and residuals should be considered.
- Plots: Scatterplots, histograms, partial plots, and normal probability plots are considered.
- Data: It must be ensured that dependent and independent variables are quantitative. Categorical variables need not be re-coded to binary or dummy variables or other types of contrast variables.
- Other assumptions: For every value of a given independent variable, we need a normal distribution of the dependent variable. The variance of the given distribution of the dependent variable should also be kept constant for every independent variable value. The relationship between every dependent independent variable should be linear. Plus, all observations should be independent.
Here is an existing example of a simple linear regression:
The dataset in the example contains information regarding the global weather situations of each day for a particular period. This detailed list of information includes factors like precipitation, snowfall, temperatures, wind speed, thunderstorms or other possible weather conditions.Â
This problem aims to use the simple linear regression model to predict the maximum temperature while taking the minimum temperature as the input.Â
Firstly, all the libraries need to be imported.
import pandas as pd Â
import numpy as np Â
import matplotlib.pyplot as plt Â
import seaborn as seabornInstanceÂ
from sklearn.model_selection import train_test_splitÂ
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline
To import the following dataset using pandas, the following command needs to be applied:
dataset = pd.read_csv(‘/Users/nageshsinghchauhan/Documents/projects/ML/ML_BLOG_LInearRegression/Weather.csv’)
To check the number of rows and columns present in the dataset to explore the data, the following command needs to be applied:Â
dataset.shape
The output received should be (119040, 31), which means the data contains 119040 rows and 31 columns.
To see the statistical details of the dataset, the following command can be used:Â
describe():
dataset.describe()
Here is another example that will aim to demonstrate how one can retrieve and use various Python libraries which are to be used for applying linear regression to given data sets:
1. Importing all the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
2. Reading the data set
cd C:\Users\Dev\Desktop\Kaggle\Salinity
# Changing the file read location to the location of the dataset
df = pd.read_csv(‘bottle.csv’)
df_binary = df[[‘Salnty’, ‘T_degC’]]
# Taking only the selected two attributes from the dataset
df_binary.columns = [‘Sal’, ‘Temp’]
# Renaming the columns for easier writing of the code
df_binary.head()
# Displaying only the 1st rows along with the column names
2. Exploring the data scatter
sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary, order = 2, ci = None)
# Plotting the data scatter
3. Data cleaning
# Eliminating NaN or missing input numbers
df_binary.fillna(method =’ffill’, inplace = True)
4. Training the model
X = np.array(df_binary[‘Sal’]).reshape(-1, 1)
y = np.array(df_binary[‘Temp’]).reshape(-1, 1)
# Separating the data into independent and dependent variables
# Converting each dataframe into a numpy array
# since each dataframe contains only one column
df_binary.dropna(inplace = True)
# Dropping any rows with Nan values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Splitting the data into training and testing data
regr = LinearRegression()
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))
5. Exploring the results
y_pred = regr.predict(X_test)
plt.scatter(X_test, y_test, color =’b’)
plt.plot(X_test, y_pred, color =’k’)
plt.show()
# Data scatter of predicted values
6. Working with a smaller dataset
df_binary500 = df_binary[:][:500]
# Selecting the 1st 500 rows of the data
sns.lmplot(x =”Sal”, y =”Temp”, data = df_binary500,
order = 2, ci = None)
Popular AI and ML Blogs & Free Courses
If you are interested in learning full-fledged machine learning, we recommend joining upGrad’s Master of Science in Machine Learning & AI. The 20-months program is offered in association with IIIT Bangalore and Liverpool John Moores University. It is designed to help you build competence in industry-relevant programming languages, tools, and libraries like Python, Keras, Tensor Flow, MySql, Flask, Kubernetes, etc.
The program can help you ace advanced data science concepts through hands-on experience and skill-building. Plus, you get the upGrad advantage with access to 360° career counsel, a networking pool of 40,000+ paid learners, and a ton of collaborating opportunities!
Book your seat today!
What is linear regression used for
This kind of analysis is generally used to predict the value of one variable based on another known variable. The variables being used to find the value of the other one are called dependent and independent variables, respectively.
How to install scikit learn?
At first, the Scikit learn linear regression version provided by the concerned operating system or Python distribution needs to be installed. This is the quickest for people who have this option available. Then the officially released and latest updated version needs to be installed.
How does scikit learn work?
Scikit learn linear regression gives out a range of supervised and unsupervised algorithms through an interface of python, which is always consistent. It is licensed under a permissible BSD license. It is distributed under various Linux operators. Usage of these algorithms is widely encouraged in business and education.
